- Published on
Handling a Production Incident Live — What Good Incident Command Looks Like
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Incident response under pressure is a practiced skill, not a natural talent. The engineers who handle incidents well don't do it because they're calmer or smarter — they do it because they've internalized a process that works even when the pressure is high. The most common failure in incidents is everyone trying to fix things simultaneously with no coordination: multiple people making changes at once, no audit trail of what was tried, and no one responsible for communicating status to stakeholders. Good incident command separates diagnosis from execution, maintains a timeline, and delegates clearly.
- What Bad Incident Response Looks Like
- Fix 1: The First 10 Minutes Framework
- Fix 2: The Incident Timeline (Real-Time Log)
- Fix 3: Stakeholder Communication Template
- Fix 4: The Rollback-First Mental Model
- Fix 5: Post-Incident Immediately Actionable Items
- Incident Command Checklist
- Conclusion
What Bad Incident Response Looks Like
Common incident response failure patterns:
T+0: Alert fires
T+1: Three engineers independently start investigating
T+3: Engineer A tries a fix without telling anyone
T+4: Engineer B tries a different fix — now two changes in flight
T+5: Error rate drops — was it A or B? Or neither?
T+7: Third fix deployed — now undoing the changes is unclear
T+10: Stakeholders asking "what's happening" — no one has time to answer
T+15: All three engineers have changed things; nobody knows the current state
T+20: Error rate goes back up — nobody knows why
After: Can't write a postmortem because nobody tracked what happened
Better structure:
- One incident commander (coordinates, doesn't execute)
- One or two responders (investigate and fix)
- One comms person (stakeholder updates)
- Shared incident channel with timestamped updates
- No changes without announcing them
Fix 1: The First 10 Minutes Framework
// Incident Commander mental model for the first 10 minutes
const firstTenMinutes = {
minute0_2: {
role: 'Assess and declare',
actions: [
'Acknowledge the alert — tell others you have it',
'Open the incident channel: "#incident-YYYY-MM-DD-[description]"',
'Declare severity: P1 (service down) / P2 (degraded) / P3 (single user)',
'Post first status: "Investigating: [symptom]. [your name] is IC."',
],
},
minute2_5: {
role: 'Orient',
actions: [
'Check: what changed in the last hour? (recent deploys, config changes)',
'Check: what does the error look like? (error type, affected endpoints)',
'Check: what is the scope? (% of users, which regions, which services)',
'DO NOT start fixing yet — understand first',
],
},
minute5_10: {
role: 'Delegate',
actions: [
'Assign: one engineer to investigate root cause',
'Assign: one engineer to check rollback feasibility',
'If P1: notify stakeholders now (not after you know everything)',
'Post status update: "Root cause investigation in progress. Likely [hypothesis]. ETA for update: 10 min."',
],
},
minute10_plus: {
role: 'Coordinate',
actions: [
'Every change announced before execution: "I\'m going to try X"',
'Every change result logged: "X didn\'t help / X reduced error rate"',
'Status updates every 10-15 minutes regardless of progress',
'One fix at a time — never two simultaneous changes',
],
},
}
Fix 2: The Incident Timeline (Real-Time Log)
Maintain this in the incident channel — timestamped, continuous
Format: [HH:MM UTC] [WHO] [WHAT HAPPENED]
Example:
14:03 @alice IC: Error rate at 8%, affecting checkout. Investigating.
14:04 @alice Checked deploys: payment-service v2.4.1 deployed 14:00 UTC
14:05 @bob Investigating: payment service logs show "connection refused to redis"
14:07 @alice Hypothesis: Redis connection issue. @bob check Redis health
14:08 @bob Redis CPU at 95%, max connections reached (100/100)
14:09 @alice Fix option 1: increase Redis max connections. Rollback option: redeploy payment-service v2.4.0
14:10 @alice Stakeholder update sent: "Payment service degraded, investigating Redis"
14:11 @bob Attempting: redis-cli config set maxclients 200
14:12 @bob Redis config updated. Error rate dropping.
14:14 @alice Error rate at 1.2% and falling. Still monitoring.
14:18 @alice Error rate back to baseline 0.1%. Incident resolved.
14:19 @alice Postmortem scheduled for tomorrow 10am. Action items: [link]
This log:
→ Tells the story of what happened
→ Shows what was tried and what worked
→ Provides the timeline for the postmortem
→ Lets latecomers catch up instantly
Fix 3: Stakeholder Communication Template
// Communicate early, often, and with specificity
// Silence is the worst stakeholder communication during an incident
// P1 Initial notification (within 5 minutes):
const p1Initial = `
🚨 *INCIDENT P1 - Payment Service Degraded*
Status: Investigating
Impact: ~30% of checkout attempts failing
Affected: Users attempting to complete purchases
Started: 14:00 UTC (approximately)
Next update: 14:15 UTC or sooner if resolved.
IC: @alice
`
// P1 Update (every 10-15 minutes):
const p1Update = `
📊 *INCIDENT UPDATE - 14:15 UTC*
Progress: Root cause identified — Redis reached max connections
Fix in progress: Increasing Redis connection limit
Error rate: Down from 8% to 2%, still declining
Next update: 14:25 UTC or on resolution.
`
// P1 Resolution:
const p1Resolution = `
✅ *INCIDENT RESOLVED - 14:18 UTC*
Duration: 18 minutes (14:00 - 14:18 UTC)
Root cause: Redis max_connections limit (100) hit by traffic spike
Fix applied: Increased to 200 connections, error rate returned to baseline
Customer impact: Approximately 2,400 failed checkout attempts (8% of attempts over 18 min)
Postmortem scheduled: 2026-03-16 10:00 UTC
Action items:
- Increase Redis connection limit permanently (tonight)
- Add CloudWatch alert for Redis connection count > 80% capacity
- Review connection pooling in payment service
`
Fix 4: The Rollback-First Mental Model
// During an incident: rollback is usually faster than finding root cause
const rollbackFirstPrinciple = {
question: 'Before investigating root cause: can we rollback to restore service?',
ifRecentDeploy: [
'1. Check: was there a deploy in the last hour?',
'2. Can we rollback to the previous version in < 5 minutes?',
'3. If yes: rollback first, investigate after service is restored',
'4. Users care about service restoration, not root cause understanding',
],
ifNoRecentDeploy: [
'1. Look for infrastructure changes (autoscaling event, config change)',
'2. Look for external dependency issues (third-party API, database)',
'3. Look for traffic pattern changes (spike, unusual request type)',
'4. Mitigation (rate limiting, circuit breaker) before fix',
],
}
// Common mistake: spending 20 minutes finding the root cause
// while 20 minutes of rollback would have restored service in 5 minutes.
// Restore first. Learn after.
Fix 5: Post-Incident Immediately Actionable Items
// The most important thing to do right after an incident is resolved:
// write down the immediate next steps while memory is fresh
interface PostIncidentActions {
within1Hour: string[] // While the team is still assembled
within24Hours: string[] // Before the postmortem
within1Week: string[] // Postmortem actions
}
const immediateActions: PostIncidentActions = {
within1Hour: [
'Write the timeline while it\'s fresh (you\'ll forget details by tomorrow)',
'Check if the fix needs a permanent change (config set manually → automate)',
'Send final stakeholder update with resolution summary',
'Thank the team and release from incident duty',
],
within24Hours: [
'Draft postmortem document',
'Collect any monitoring data before it ages out',
'Identify who was affected and how (for customer notification)',
],
within1Week: [
'Complete postmortem with action items',
'Implement the P1 action items (monitoring, alerts, runbook)',
'Schedule one-month follow-up on all action items',
],
}
Incident Command Checklist
- ✅ Single incident commander — one person coordinates, others execute
- ✅ Incident channel opened within 2 minutes with real-time timeline
- ✅ First stakeholder update sent within 5 minutes (even if "investigating")
- ✅ Rollback considered before root cause investigation
- ✅ Every change announced before execution — no uncoordinated changes
- ✅ Status updates every 10-15 minutes regardless of progress
- ✅ Timeline maintained throughout — postmortem writes itself
- ✅ Immediate action items captured while team is still assembled
Conclusion
Incident response quality is the product of practiced process, not talent. The engineers who handle incidents well have internalized a simple structure: one commander, a shared timeline, early stakeholder communication, and a rollback-before-investigation reflex. The most expensive minutes in an incident are spent with multiple people making uncoordinated changes — the timeline becomes unreadable, the blast radius expands, and the post-incident recovery is harder. A 30-minute "how we run incidents" training session and a shared template does more for incident quality than any additional tooling.