On-Call Burnout Spiral — When the Pager Becomes the Job

Introduction

On-call burnout follows a predictable spiral: high alert volume → engineers get paged constantly → sleep deprivation → reduced capacity to fix root causes → more incidents → higher alert volume. The spiral is a system problem disguised as a people problem. The fix isn't "hire more engineers" or "be more resilient" — it's reducing alert noise to only actionable signals, fixing the recurring issues that generate 80% of pages, and distributing on-call load so no single person bears it alone.

The On-Call Burnout Spiral
Fix 1: Alert Audit — Kill Noisy Alerts Before They Kill Morale
Fix 2: Runbooks for Every Alert
Fix 3: Fix Recurring Issues (Not Just Acknowledge Them)
Fix 4: Sustainable On-Call Rotation Design
Fix 5: On-Call Metrics and Accountability
On-Call Health Checklist
Conclusion

The On-Call Burnout Spiral

How on-call burnout develops:

Week 1:  Engineer starts on-call rotation. 5 pages/week. Manageable.
Week 4:  Redis flapping alert added. Never fixed. 20 pages/week.
Week 8:  New service deployed, poorly tested. 40 pages/week.
Week 12: Engineer starts ignoring some alerts ("it always self-heals")
Week 16: Engineer begins missing real incidents because alert fatigue
Week 20: Engineer leaves company. On-call knowledge disappears.
Week 24: Remaining engineers cover more rotations. Spiral accelerates.

Leading indicators to catch it early:
- MTTD (mean time to detect) increasing → alerts ignored
- % of alerts that are actionable < 50% → noise problem
- Pages per engineer per week > 5 → unsustainable
- Same alert fires > 10 times without root cause fix → toil

Fix 1: Alert Audit — Kill Noisy Alerts Before They Kill Morale

// alert-audit.ts — find alerts that are noise, not signal
async function auditAlerts(days = 30): Promise<AlertAuditReport> {
  const alerts = await db.query(`
    SELECT
      alert_name,
      COUNT(*) as total_fires,
      COUNT(*) FILTER (WHERE acknowledged_within_mins > 30) as slow_ack,
      COUNT(*) FILTER (WHERE action_taken = false) as no_action_taken,
      COUNT(*) FILTER (WHERE auto_resolved = true) as auto_resolved,
      COUNT(DISTINCT DATE(fired_at)) as days_fired,
      MAX(fired_at) as last_fired
    FROM alert_events
    WHERE fired_at > NOW() - INTERVAL '${days} days'
    GROUP BY alert_name
    ORDER BY total_fires DESC
  `)

  const noisy: string[] = []
  const toFix: string[] = []
  const healthy: string[] = []

  for (const alert of alerts.rows) {
    const actionRate = 1 - (alert.no_action_taken / alert.total_fires)
    const autoResolveRate = alert.auto_resolved / alert.total_fires

    if (autoResolveRate > 0.8) {
      // This alert auto-resolves > 80% of the time → it's noise
      noisy.push(`${alert.alert_name} (${alert.total_fires} fires, ${(autoResolveRate*100).toFixed(0)}% auto-resolved)`)
    } else if (alert.total_fires > 50 && actionRate < 0.3) {
      // Fires often but rarely gets acted on → either wrong threshold or unfixable
      toFix.push(`${alert.alert_name} (${alert.total_fires} fires, only ${(actionRate*100).toFixed(0)}% actionable)`)
    } else {
      healthy.push(alert.alert_name)
    }
  }

  return { noisy, toFix, healthy }
}

// Rule: if an alert fires and no action is taken > 50% of the time, it should be deleted or redesigned
// Rule: if an alert auto-resolves > 80% of the time, it's not alertable — it's a metric to watch

Fix 2: Runbooks for Every Alert

// Every alert must have a runbook — no exceptions
// An alert without a runbook means the on-call engineer has to figure it out at 3 AM

interface AlertRunbook {
  alertName: string
  severity: 'critical' | 'high' | 'medium'
  summary: string
  impact: string
  firstSteps: string[]
  commonCauses: string[]
  diagnosticCommands: string[]
  escalationPath: string
  recentIncidents: string  // Link to past incidents for context
}

const runbooks: Record<string, AlertRunbook> = {
  'redis-connection-failure': {
    alertName: 'Redis Connection Failure',
    severity: 'critical',
    summary: 'Redis is unreachable from one or more application instances',
    impact: 'Sessions, caching, and rate limiting affected. App may return 500s.',
    firstSteps: [
      '1. Check Redis health: redis-cli -h $REDIS_HOST ping',
      '2. Check Elasticache dashboard for failover events',
      '3. Check application logs for specific connection errors',
      '4. Verify security group rules: port 6379 open to ECS tasks',
    ],
    commonCauses: [
      'ElastiCache failover in progress (auto-resolves in ~30s)',
      'Security group changed',
      'Redis memory exhaustion (maxmemory policy evicting keys)',
      'Network ACL change',
    ],
    diagnosticCommands: [
      'redis-cli -h $REDIS_HOST info memory | grep used_memory_human',
      'redis-cli -h $REDIS_HOST client list | wc -l',
      'aws elasticache describe-events --duration 60',
    ],
    escalationPath: 'Escalate to Platform team if not resolved in 15 minutes',
    recentIncidents: 'https://incidents.myapp.com/label/redis',
  },
}

Fix 3: Fix Recurring Issues (Not Just Acknowledge Them)

// toil-tracker.ts — identify and eliminate recurring manual responses

// Rule: if you respond to the same alert > 3 times without a permanent fix,
// it must be treated as a bug, not an incident

interface RecurringIssue {
  alertName: string
  occurrences: number
  estimatedTimePerPage: number  // minutes
  rootCause?: string
  owner?: string
  ticket?: string
}

async function identifyToil(): Promise<RecurringIssue[]> {
  const recurring = await db.query(`
    SELECT
      alert_name,
      COUNT(*) as occurrences,
      AVG(resolution_time_minutes) as avg_resolution_time
    FROM alert_events
    WHERE fired_at > NOW() - INTERVAL '30 days'
    GROUP BY alert_name
    HAVING COUNT(*) > 3
    ORDER BY COUNT(*) * AVG(resolution_time_minutes) DESC
  `)

  // Most expensive toil = most occurrences × most time per incident
  return recurring.rows.map(r => ({
    alertName: r.alert_name,
    occurrences: r.occurrences,
    estimatedTimePerPage: r.avg_resolution_time,
    totalToilMinutes: r.occurrences * r.avg_resolution_time,
  }))
}

// Rule: spend 20% of sprint capacity on toil reduction
// Prioritize by: total toil minutes per month
// Example: "Redis memory alert: 40 occurrences × 15 min = 10 hours/month of toil"
// Fix: increase maxmemory setting and add eviction policy

Fix 4: Sustainable On-Call Rotation Design

// Rotation design principles that prevent burnout
const onCallPrinciples = {
  maxPageableAlerts: '5 per night maximum — above this is a system problem, not an engineer problem',
  rotationLength: '1 week maximum — longer rotations compound fatigue',
  minTeamSize: '4 engineers minimum for on-call — each person is on-call 1 week per month max',
  handoffProcess: 'Written summary of active issues at rotation handoff',
  primaryAndSecondary: 'Always have primary + secondary — secondary is backup, not co-primary',
  businessHoursEscalation: 'Non-critical alerts (medium) only page during business hours',
  postRotationBuffer: 'Day off or light day after completing a particularly hard on-call week',
}

// PagerDuty schedule structure
interface RotationSchedule {
  primary: string
  secondary: string
  weekStart: Date
  weekEnd: Date
  criticalAlerts: 'primary'  // Only primary paged for critical
  highAlerts: 'primary'      // Primary first, secondary escalation
  mediumAlerts: 'business_hours_only'  // Never wake people up for medium
}

// Alert routing by severity:
// Critical: page immediately, 24/7, primary + escalate to secondary after 5 min
// High: page immediately during business hours, email off-hours unless 2+ fires
// Medium: email only, review next business day
// Low: dashboard only, weekly review

Fix 5: On-Call Metrics and Accountability

// Track on-call health as a team metric — not just system metrics
interface OnCallHealthReport {
  week: string
  totalPages: number
  pagesPerEngineer: number
  noisyAlerts: string[]  // Alerts that auto-resolved > 3x
  actionableRate: number  // What % of pages required action
  avgResponseTimeMin: number
  topRecurringIssues: string[]
}

// Review weekly in engineering standup
// Target metrics:
// - Pages per engineer per week: < 5
// - Actionable rate: > 80% (if < 80%, you have noise to cut)
// - Top recurring issues: each should have an owner and a fix date

// When these metrics are violated, the team leads a toil reduction sprint
// — not optional, not deferred until "later"

// Example: "Redis flapping alert fired 40 times last month"
// Action: investigate root cause, fix permanently, delete the alert if unfixable

On-Call Health Checklist

✅ Alert audit runs monthly — noisy and non-actionable alerts deleted or redesigned
✅ Every alert has a runbook with concrete diagnostic steps
✅ Recurring issues tracked and treated as bugs — not just incidents to respond to
✅ Rotation has minimum 4 engineers — max 1 week per person per month
✅ Pages per engineer per week < 5 on average
✅ Medium severity alerts never page off-hours
✅ 20% of sprint capacity reserved for toil reduction when above targets
✅ On-call health metrics reviewed weekly by engineering lead

Conclusion

On-call burnout is prevented by the same principle as all reliability engineering: measure, identify the root cause, and fix it systematically. The measure is pages per engineer per week and actionable alert rate. The root cause is almost always alert noise (auto-resolving alerts that shouldn't page) and recurring toil (the same unfixed issue paging every week). The fix is a monthly alert audit, a runbook requirement for every alert, and a team commitment to spending sprint capacity on toil reduction. Engineers who are well-rested and not fatigued fix incidents faster, write better code, and stay longer — on-call health is an engineering productivity investment.