- Published on
On-Call Burnout Spiral — When the Pager Becomes the Job
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
On-call burnout follows a predictable spiral: high alert volume → engineers get paged constantly → sleep deprivation → reduced capacity to fix root causes → more incidents → higher alert volume. The spiral is a system problem disguised as a people problem. The fix isn't "hire more engineers" or "be more resilient" — it's reducing alert noise to only actionable signals, fixing the recurring issues that generate 80% of pages, and distributing on-call load so no single person bears it alone.
- The On-Call Burnout Spiral
- Fix 1: Alert Audit — Kill Noisy Alerts Before They Kill Morale
- Fix 2: Runbooks for Every Alert
- Fix 3: Fix Recurring Issues (Not Just Acknowledge Them)
- Fix 4: Sustainable On-Call Rotation Design
- Fix 5: On-Call Metrics and Accountability
- On-Call Health Checklist
- Conclusion
The On-Call Burnout Spiral
How on-call burnout develops:
Week 1: Engineer starts on-call rotation. 5 pages/week. Manageable.
Week 4: Redis flapping alert added. Never fixed. 20 pages/week.
Week 8: New service deployed, poorly tested. 40 pages/week.
Week 12: Engineer starts ignoring some alerts ("it always self-heals")
Week 16: Engineer begins missing real incidents because alert fatigue
Week 20: Engineer leaves company. On-call knowledge disappears.
Week 24: Remaining engineers cover more rotations. Spiral accelerates.
Leading indicators to catch it early:
- MTTD (mean time to detect) increasing → alerts ignored
- % of alerts that are actionable < 50% → noise problem
- Pages per engineer per week > 5 → unsustainable
- Same alert fires > 10 times without root cause fix → toil
Fix 1: Alert Audit — Kill Noisy Alerts Before They Kill Morale
// alert-audit.ts — find alerts that are noise, not signal
async function auditAlerts(days = 30): Promise<AlertAuditReport> {
const alerts = await db.query(`
SELECT
alert_name,
COUNT(*) as total_fires,
COUNT(*) FILTER (WHERE acknowledged_within_mins > 30) as slow_ack,
COUNT(*) FILTER (WHERE action_taken = false) as no_action_taken,
COUNT(*) FILTER (WHERE auto_resolved = true) as auto_resolved,
COUNT(DISTINCT DATE(fired_at)) as days_fired,
MAX(fired_at) as last_fired
FROM alert_events
WHERE fired_at > NOW() - INTERVAL '${days} days'
GROUP BY alert_name
ORDER BY total_fires DESC
`)
const noisy: string[] = []
const toFix: string[] = []
const healthy: string[] = []
for (const alert of alerts.rows) {
const actionRate = 1 - (alert.no_action_taken / alert.total_fires)
const autoResolveRate = alert.auto_resolved / alert.total_fires
if (autoResolveRate > 0.8) {
// This alert auto-resolves > 80% of the time → it's noise
noisy.push(`${alert.alert_name} (${alert.total_fires} fires, ${(autoResolveRate*100).toFixed(0)}% auto-resolved)`)
} else if (alert.total_fires > 50 && actionRate < 0.3) {
// Fires often but rarely gets acted on → either wrong threshold or unfixable
toFix.push(`${alert.alert_name} (${alert.total_fires} fires, only ${(actionRate*100).toFixed(0)}% actionable)`)
} else {
healthy.push(alert.alert_name)
}
}
return { noisy, toFix, healthy }
}
// Rule: if an alert fires and no action is taken > 50% of the time, it should be deleted or redesigned
// Rule: if an alert auto-resolves > 80% of the time, it's not alertable — it's a metric to watch
Fix 2: Runbooks for Every Alert
// Every alert must have a runbook — no exceptions
// An alert without a runbook means the on-call engineer has to figure it out at 3 AM
interface AlertRunbook {
alertName: string
severity: 'critical' | 'high' | 'medium'
summary: string
impact: string
firstSteps: string[]
commonCauses: string[]
diagnosticCommands: string[]
escalationPath: string
recentIncidents: string // Link to past incidents for context
}
const runbooks: Record<string, AlertRunbook> = {
'redis-connection-failure': {
alertName: 'Redis Connection Failure',
severity: 'critical',
summary: 'Redis is unreachable from one or more application instances',
impact: 'Sessions, caching, and rate limiting affected. App may return 500s.',
firstSteps: [
'1. Check Redis health: redis-cli -h $REDIS_HOST ping',
'2. Check Elasticache dashboard for failover events',
'3. Check application logs for specific connection errors',
'4. Verify security group rules: port 6379 open to ECS tasks',
],
commonCauses: [
'ElastiCache failover in progress (auto-resolves in ~30s)',
'Security group changed',
'Redis memory exhaustion (maxmemory policy evicting keys)',
'Network ACL change',
],
diagnosticCommands: [
'redis-cli -h $REDIS_HOST info memory | grep used_memory_human',
'redis-cli -h $REDIS_HOST client list | wc -l',
'aws elasticache describe-events --duration 60',
],
escalationPath: 'Escalate to Platform team if not resolved in 15 minutes',
recentIncidents: 'https://incidents.myapp.com/label/redis',
},
}
Fix 3: Fix Recurring Issues (Not Just Acknowledge Them)
// toil-tracker.ts — identify and eliminate recurring manual responses
// Rule: if you respond to the same alert > 3 times without a permanent fix,
// it must be treated as a bug, not an incident
interface RecurringIssue {
alertName: string
occurrences: number
estimatedTimePerPage: number // minutes
rootCause?: string
owner?: string
ticket?: string
}
async function identifyToil(): Promise<RecurringIssue[]> {
const recurring = await db.query(`
SELECT
alert_name,
COUNT(*) as occurrences,
AVG(resolution_time_minutes) as avg_resolution_time
FROM alert_events
WHERE fired_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
HAVING COUNT(*) > 3
ORDER BY COUNT(*) * AVG(resolution_time_minutes) DESC
`)
// Most expensive toil = most occurrences × most time per incident
return recurring.rows.map(r => ({
alertName: r.alert_name,
occurrences: r.occurrences,
estimatedTimePerPage: r.avg_resolution_time,
totalToilMinutes: r.occurrences * r.avg_resolution_time,
}))
}
// Rule: spend 20% of sprint capacity on toil reduction
// Prioritize by: total toil minutes per month
// Example: "Redis memory alert: 40 occurrences × 15 min = 10 hours/month of toil"
// Fix: increase maxmemory setting and add eviction policy
Fix 4: Sustainable On-Call Rotation Design
// Rotation design principles that prevent burnout
const onCallPrinciples = {
maxPageableAlerts: '5 per night maximum — above this is a system problem, not an engineer problem',
rotationLength: '1 week maximum — longer rotations compound fatigue',
minTeamSize: '4 engineers minimum for on-call — each person is on-call 1 week per month max',
handoffProcess: 'Written summary of active issues at rotation handoff',
primaryAndSecondary: 'Always have primary + secondary — secondary is backup, not co-primary',
businessHoursEscalation: 'Non-critical alerts (medium) only page during business hours',
postRotationBuffer: 'Day off or light day after completing a particularly hard on-call week',
}
// PagerDuty schedule structure
interface RotationSchedule {
primary: string
secondary: string
weekStart: Date
weekEnd: Date
criticalAlerts: 'primary' // Only primary paged for critical
highAlerts: 'primary' // Primary first, secondary escalation
mediumAlerts: 'business_hours_only' // Never wake people up for medium
}
// Alert routing by severity:
// Critical: page immediately, 24/7, primary + escalate to secondary after 5 min
// High: page immediately during business hours, email off-hours unless 2+ fires
// Medium: email only, review next business day
// Low: dashboard only, weekly review
Fix 5: On-Call Metrics and Accountability
// Track on-call health as a team metric — not just system metrics
interface OnCallHealthReport {
week: string
totalPages: number
pagesPerEngineer: number
noisyAlerts: string[] // Alerts that auto-resolved > 3x
actionableRate: number // What % of pages required action
avgResponseTimeMin: number
topRecurringIssues: string[]
}
// Review weekly in engineering standup
// Target metrics:
// - Pages per engineer per week: < 5
// - Actionable rate: > 80% (if < 80%, you have noise to cut)
// - Top recurring issues: each should have an owner and a fix date
// When these metrics are violated, the team leads a toil reduction sprint
// — not optional, not deferred until "later"
// Example: "Redis flapping alert fired 40 times last month"
// Action: investigate root cause, fix permanently, delete the alert if unfixable
On-Call Health Checklist
- ✅ Alert audit runs monthly — noisy and non-actionable alerts deleted or redesigned
- ✅ Every alert has a runbook with concrete diagnostic steps
- ✅ Recurring issues tracked and treated as bugs — not just incidents to respond to
- ✅ Rotation has minimum 4 engineers — max 1 week per person per month
- ✅ Pages per engineer per week < 5 on average
- ✅ Medium severity alerts never page off-hours
- ✅ 20% of sprint capacity reserved for toil reduction when above targets
- ✅ On-call health metrics reviewed weekly by engineering lead
Conclusion
On-call burnout is prevented by the same principle as all reliability engineering: measure, identify the root cause, and fix it systematically. The measure is pages per engineer per week and actionable alert rate. The root cause is almost always alert noise (auto-resolving alerts that shouldn't page) and recurring toil (the same unfixed issue paging every week). The fix is a monthly alert audit, a runbook requirement for every alert, and a team commitment to spending sprint capacity on toil reduction. Engineers who are well-rested and not fatigued fix incidents faster, write better code, and stay longer — on-call health is an engineering productivity investment.