Production bugs are not all equal. Some are cosmetic. Some are revenue-impacting. Some are both, until you look closer and find they're actually the symptom of a deeper infrastructure problem that's been waiting to surface. The difference between a team that fixes bugs fast and one that takes days to resolve the same issue usually comes down to process, not skill.
Here's the triage framework we've refined across dozens of production incidents — structured enough to keep the response coordinated, flexible enough to adapt to whatever you actually find.
Step 1: Contain Before You Diagnose
The first priority in any production incident is limiting the blast radius — not finding the cause. These are separate jobs, and confusing them wastes time and risks making things worse.
Containment options, in rough order of speed:
- Feature flag off — If the bug is in a recently deployed feature, disable it. Fastest path to recovery if you have the infrastructure.
- Rollback — If deployment timing correlates with the incident, roll back before diagnosing. You can dig into the cause on the old version.
- Traffic routing — Route affected users to a fallback, a different region, or a degraded-but-functional mode while the main path is being fixed.
- Kill switch — For severe cases, taking a feature completely offline is better than letting a broken experience reach every user.
Step 2: Isolate the Problem
Once the blast radius is contained, narrow the scope. You're looking for the smallest reproducible case.
Define the failure boundary
Who is affected? All users, specific accounts, a geographic region, users on a specific plan? The boundaries tell you where to look.
Identify the trigger
What action or condition causes the bug? A specific API call, a particular data shape, a time-based trigger, a load threshold? This is the most important question in the incident.
Reproduce in a safe environment
Can you reproduce it in staging? If not, what's different about production? The answer to that question is often where the bug lives.
Check the diff
What changed since it last worked? Deployments, config changes, data migrations, dependency updates, traffic spikes. Correlation isn't causation, but it's a fast way to narrow the search space.
Step 3: Fix at the Right Level
There are three levels of fix, and using the wrong one makes things worse:
- Workaround — Doesn't fix the bug, but removes the immediate user impact. Buy time, nothing more. Document it as temporary.
- Hotfix — Fixes the immediate symptom in production. Appropriate when you understand the cause well enough to be confident the fix won't introduce new bugs. Ship with a minimal change surface — fix the thing that's broken and nothing else.
- Root cause fix — Addresses what actually caused the bug, not just the manifestation. Often takes longer, should go through normal review and testing, and gets scheduled as a follow-up if a hotfix was already shipped.
Step 4: Verify End-to-End, Not Just the Fix
Before declaring an incident closed, verify three things:
- The specific failure mode is gone (not just in test — in production metrics)
- Adjacent functionality hasn't regressed (spot-check the flows nearest to the change)
- The containment measure is removed (feature flags left off and rollbacks left in place accumulate into configuration debt)
Step 5: The Post-Mortem That Actually Helps
A post-mortem's value isn't in documenting what happened — it's in the action items that make the next incident faster or smaller. The questions that matter most:
- How long did detection take, and how do we shorten that?
- What made diagnosis slow, and what would have made it faster (better logs, better observability, better runbooks)?
- Was the root cause something a test should have caught? If so, add the test.
- Is this class of bug possible elsewhere in the codebase? If so, go find it before it finds users.
A good post-mortem produces two to four action items that reduce the likelihood or impact of the next incident. More than that and none of them get done.
Production bug that needs fixing now?
Our engineers triage and fix critical production bugs fast — rooting the cause, not just the symptom. Available for urgent engagements.
Learn about our Bug Fixing service →