Published on

Handling a Postmortem Without Blame — How to Learn From Incidents Without Burning People

Authors

Introduction

Blame in a postmortem is the most expensive possible use of the information. When engineers fear blame, they hide information, deflect causation, and avoid the risky work that would actually improve the system. Blameless postmortems don't pretend that humans didn't make decisions — they examine why reasonable people made the decisions they did with the information they had at the time. That examination reveals systemic failures: missing monitoring, confusing runbooks, unclear ownership, inadequate testing. Fix those, and the next engineer in the same situation makes a better decision.

The Blame vs. Blameless Difference

Blame postmortem output:
- "Dev deployed untested code to production"
- "On-call engineer missed alert for 20 minutes"
- "DBA ran migration without backup"
Actions: Performance improvement plan. More process. Fear.

Blameless postmortem asks:
- Why was untested code deployable to production?
No staging parity, no required test gate in CI
- Why did the on-call engineer miss the alert?
47 alerts that week, 60% were noise, alert fatigue was documented
- Why did the DBA not take a backup first?
No required backup step in migration runbook, runbook didn't exist

Blameless output:
- CI now requires passing tests for merge
- Alert audit: cut noise alerts by 70%, actionable rate now > 80%
- Migration runbook added: backup required step, verified by second engineer

The person who made the decision was the last line of defense
in a system with many missing earlier defenses.
Fix the system. The person already knows what happened.

Fix 1: Postmortem Template That Forces Systems Thinking

# Incident Postmortem: [Title]
Date: YYYY-MM-DD
Severity: P1 / P2 / P3
Duration: X hours Y minutes
Impact: [Customer impact, revenue impact if known]
Author: [Name]
Reviewers: [Names — at least one outside the incident team]

## Timeline
All times UTC:
- HH:MM: [Event]
- HH:MM: [Event]
...

## Root Cause
[Describe the immediate cause — what broke]

## Contributing Factors (5 Whys)
Why did X fail?
→ Because Y was true
Why was Y true?
→ Because Z condition existed
Why did Z exist?
→ Because [systemic condition]

## What Went Well
[Things that limited impact: monitoring caught it, rollback worked, communication was clear]

## What Could Have Gone Better
[Not WHO made mistakes — WHAT was missing or unclear]

## Action Items
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| Add CI test gate for payment endpoints | @alice | 2026-03-22 | P1 |
| Alert audit: remove noise alerts | @bob | 2026-03-29 | P2 |
| Add migration backup requirement to runbook | @carol | 2026-03-22 | P1 |

## Questions That Should Not Be Asked in Postmortem
❌ "Why didn't [person] check X?"
❌ "Who is responsible for this?"
❌ "How could they miss Y?"

## Questions That Should Be Asked
✅ "What conditions made this possible?"
✅ "What information did the on-call engineer have at the time?"
✅ "What would have needed to be true for a different outcome?"
✅ "Where was the first opportunity to detect this — and why wasn't it detected there?"

Fix 2: The 5 Whys for System Failure

// 5 Whys: always end at a systemic condition, not a person
// The person is always the last "why" — there are always more before it

interface WhyChain {
  why: string
  because: string
  systemicFactor?: string
}

const exampleAnalysis: WhyChain[] = [
  {
    why: 'The payment service returned 500s for 45 minutes',
    because: 'Database connections were exhausted',
  },
  {
    why: 'Database connections were exhausted',
    because: 'A new feature deployed 3x the normal connection count',
  },
  {
    why: 'The new feature deployed 3x the connection count',
    because: 'A connection pool was created per-request instead of shared',
  },
  {
    why: 'A connection pool was created per-request',
    because: 'The code reviewer didn\'t catch it',
  },
  {
    why: 'The code reviewer didn\'t catch it',
    because: 'No automated check for connection pool usage patterns existed',
    systemicFactor: 'Missing lint rule / architectural review checklist item',
  },
]

// Action: add ESLint rule that warns on `new Pool()` inside request handlers
// Instead of: "Code reviewer should have caught this"

Fix 3: Run the Postmortem as a Learning Session

Postmortem facilitation principles:

Before the meeting:
- Incident timeline already written and shared
- "Questions not to ask" reminder sent with invite
- Focus: learning, not judgment

During the meeting:
- Facilitator is not the incident owner (reduces defensiveness)
- Start with "what went well" — establishes psychological safety
- Timeline walkthrough: narrator explains decisions WITH context
  ("At 2 AM, with 20 alerts firing, here's what the data showed...")
- For each decision: "Given the information available, was this reasonable?"
  Not: "Was this the right decision in hindsight?"
- Action items are system fixes, not people fixes

Facilitator interventions:
- "Let's focus on what condition made that possible, not who should have..."
- "What would the documentation have needed to say for a different decision?"
- "What monitoring would have caught this earlier?"

After the meeting:
- Postmortem published (ideally publicly within the company)
- Action items assigned with owners and due dates
- 30-day follow-up: which action items are actually done?

Fix 4: Postmortem Action Item Follow-Through

// The most common postmortem failure: action items never completed
// Track them the same way you track bugs

interface PostmortemAction {
  incidentId: string
  action: string
  owner: string
  dueDate: Date
  priority: 'P1' | 'P2' | 'P3'
  status: 'open' | 'in_progress' | 'completed' | 'deferred'
  ticketUrl?: string
}

// Weekly review: are postmortem actions being completed?
async function reviewPostmortemActions(): Promise<void> {
  const overdue = await db.query(`
    SELECT pa.*, i.title as incident_title
    FROM postmortem_actions pa
    JOIN incidents i ON i.id = pa.incident_id
    WHERE pa.status IN ('open', 'in_progress')
    AND pa.due_date < NOW()
    ORDER BY pa.priority, pa.due_date
  `)

  if (overdue.rows.length > 0) {
    logger.warn({ count: overdue.rows.length }, 'Overdue postmortem action items')

    // Notify in Slack or email
    await notify.engineeringChannel({
      message: `⚠️ ${overdue.rows.length} overdue postmortem action items`,
      items: overdue.rows.map(r => `${r.priority} | ${r.incident_title} | ${r.action} | @${r.owner}`),
    })
  }
}

// Rule: P1 postmortem actions must be completed before the next deploy
// Rule: P2 actions must be on the current sprint
// Rule: P3 actions must be scheduled within 2 sprints

Fix 5: Public Postmortems Build Trust

# When to Share Postmortems Externally

Internal postmortems: always (default)
- Publish to engineering wiki within 5 business days
- All engineers have access — even if unrelated to the incident
- Culture of learning, not secrecy

External postmortems (customer-facing): consider for P1 incidents
Benefits:
- Demonstrates operational maturity
- Customers who felt impact want to understand what happened
- Proactive communication prevents speculation and rumors
- Shows engineering quality, not weakness

Format for external postmortems:
- What happened (customer-understandable language)
- How long were customers affected
- What we did to resolve it
- What we're doing to prevent recurrence
- No internal names, no blame language

Examples that did this well:
- GitHub's outage postmortems
- Cloudflare's incident reports
- Stripe's transparency about reliability

Blameless Postmortem Checklist

  • ✅ Timeline documented before the postmortem meeting
  • ✅ Facilitator separate from the incident owner
  • ✅ "What went well" covered before "what could be better"
  • ✅ All contributing factors are system conditions, not person failures
  • ✅ Every action item has an owner, due date, and priority
  • ✅ Action items tracked to completion — not just written and forgotten
  • ✅ Postmortem published to engineering team within 5 business days
  • ✅ 30-day review: which actions are complete?

Conclusion

A blameless postmortem is the highest-leverage investment in reliability engineering. The alternative — blame — generates fear, incomplete information, and defensive behavior that makes the system more fragile. Blameless doesn't mean consequence-free for truly negligent behavior, but for normal incidents where reasonable people made reasonable decisions with incomplete information, the system conditions that enabled the failure are the only fixable target. Every action item that points at a person (rather than a system, process, or tool) is a missed opportunity to make the next incident less likely.