- Published on
Single Point of Failure Nobody Noticed — Until It Took Down Everything
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Single points of failure don't announce themselves. They're the Redis instance that "only handles caching" — until you discover that session management, rate limiting, and distributed locks all depend on it. They're the single deployment pipeline that every team uses to ship. They're the one DNS server that all services rely on for internal service discovery. They hide in the dependencies you assume are stable, and they reveal themselves at the worst possible time — during the traffic spike, during the incident, at 3 AM.
- Common SPOFs That Go Unnoticed
- Fix 1: Multi-AZ + Replica for Every Stateful Component
- Fix 2: Circuit Breaker on Every External Dependency
- Fix 3: Map Your Dependencies to Find Hidden SPOFs
- Fix 4: Chaos Engineering — Find SPOFs Before They Find You
- Fix 5: Multi-Region for True High Availability
- SPOF Elimination Checklist
- Conclusion
Common SPOFs That Go Unnoticed
Hidden single points of failure:
Infrastructure:
- Single Redis instance (sessions, cache, locks, rate limits)
- Single message broker (all async work stops)
- Single Postgres primary with no automatic failover
- Single NAT gateway (all outbound traffic in a VPC)
- Single availability zone deployment
Application:
- Single authentication service (all services fail if it's down)
- Single config service (all services can't start if it's down)
- Hard dependency on a single third-party API with no fallback
- Single cron runner with no redundancy
Operational:
- Single person who knows how to deploy
- Single person who knows the database password
- Single CI/CD pipeline that all teams share
- Single monitoring system that alerts on all other systems
Fix 1: Multi-AZ + Replica for Every Stateful Component
# terraform/redis-cluster.tf — Redis with automatic failover
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "myapp-redis"
description = "Redis with multi-AZ failover"
# Primary + 1 read replica in different AZs
num_cache_clusters = 2
automatic_failover_enabled = true
multi_az_enabled = true
# If primary goes down, replica promotes automatically (~30 seconds)
node_type = "cache.r6g.large"
port = 6379
# Spread across AZs
preferred_cache_cluster_azs = ["us-east-1a", "us-east-1b"]
}
# postgresql/rds-multi-az.tf
resource "aws_db_instance" "postgres" {
identifier = "myapp-postgres"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.large"
# Multi-AZ = synchronous standby, automatic failover on failure
multi_az = true
# Automated backups
backup_retention_period = 14
}
Fix 2: Circuit Breaker on Every External Dependency
// If Redis is down, degrade gracefully — don't bring down the whole app
import CircuitBreaker from 'opossum'
const redisOptions = {
timeout: 3000, // Fail fast if Redis takes > 3s
errorThresholdPercentage: 50, // Open circuit after 50% failures
resetTimeout: 30000, // Try again after 30 seconds
}
const redisBreaker = new CircuitBreaker(redisOperation, redisOptions)
redisBreaker.fallback((key: string) => {
// Graceful degradation when Redis is down:
logger.warn({ key }, 'Redis circuit open — using fallback')
return null // Cache miss, fall through to database
})
// Rate limiting with Redis down: allow the request (fail open)
async function checkRateLimit(userId: string): Promise<boolean> {
try {
return await redisBreaker.fire(async () => {
const requests = await redis.incr(`rl:${userId}`)
await redis.expire(`rl:${userId}`, 60)
return requests <= 100
})
} catch {
// Redis down: fail open (allow the request)
// Alternative: fail closed (deny all) for security-critical paths
logger.warn({ userId }, 'Rate limit check skipped — Redis unavailable')
return true
}
}
// Session lookup with Redis down: fall back to database
async function getSession(sessionId: string): Promise<Session | null> {
const cached = await redisBreaker.fire(() =>
redis.get(`session:${sessionId}`)
).catch(() => null)
if (cached) return JSON.parse(cached)
// Fallback: database is slower but available
return db.query('SELECT * FROM sessions WHERE id = $1', [sessionId])
.then(r => r.rows[0] ?? null)
}
Fix 3: Map Your Dependencies to Find Hidden SPOFs
// dependency-map.ts — document what breaks when each component goes down
const dependencyMap = {
redis: {
dependents: ['session-service', 'rate-limiter', 'distributed-lock', 'cache-layer'],
degradedBehavior: {
'session-service': 'Fall back to database sessions (slower but functional)',
'rate-limiter': 'Fail open — allow requests without rate limiting',
'distributed-lock': 'Use optimistic locking in database instead',
'cache-layer': 'Cache miss — all reads go to database',
},
isSPOF: true,
mitigation: 'ElastiCache with multi-AZ failover + circuit breaker fallbacks',
},
postgres: {
dependents: ['all services'],
degradedBehavior: {
'all services': 'Service degraded — read-only mode from replica',
},
isSPOF: true,
mitigation: 'RDS Multi-AZ with automatic failover + read replicas',
},
stripe: {
dependents: ['payment-service'],
degradedBehavior: {
'payment-service': 'Queue payment retry — accept order, process later',
},
isSPOF: false, // Has retry queue
mitigation: 'Async payment queue with retry and manual reconciliation',
},
}
// Run quarterly: list all SPOFs where isSPOF: true and no mitigation
function auditSPOFs() {
const unmitigated = Object.entries(dependencyMap)
.filter(([_, dep]) => dep.isSPOF && !dep.mitigation)
if (unmitigated.length > 0) {
console.warn('Unmitigated SPOFs found:')
unmitigated.forEach(([name]) => console.warn(` - ${name}`))
}
}
Fix 4: Chaos Engineering — Find SPOFs Before They Find You
// chaos-test.ts — deliberately kill components to verify fallback behavior
// Run in staging on a schedule
interface ChaosExperiment {
name: string
action: () => Promise<void>
restore: () => Promise<void>
verify: () => Promise<boolean>
}
const experiments: ChaosExperiment[] = [
{
name: 'Redis unavailable',
action: async () => {
// Block Redis port for 60 seconds
await exec('iptables -A OUTPUT -p tcp --dport 6379 -j DROP')
},
restore: async () => {
await exec('iptables -D OUTPUT -p tcp --dport 6379 -j DROP')
},
verify: async () => {
// App should still serve requests (degraded, not down)
const response = await fetch('https://staging.myapp.com/health')
const health = await response.json()
return health.status !== 'down'
},
},
{
name: 'Primary database failover',
action: async () => {
// Trigger RDS failover to standby
await rds.rebootDBInstance({ DBInstanceIdentifier: 'myapp-staging', ForceFailover: true })
},
restore: async () => {
// RDS restores itself after failover
},
verify: async () => {
// After failover (< 60s), app should be serving requests
await sleep(90_000) // Wait for failover
const response = await fetch('https://staging.myapp.com/api/healthz')
return response.ok
},
},
]
async function runChaosExperiments() {
for (const experiment of experiments) {
console.log(`Running: ${experiment.name}`)
await experiment.action()
const passed = await experiment.verify()
await experiment.restore()
await sleep(30_000) // Recovery buffer
console.log(`${experiment.name}: ${passed ? '✅ PASSED' : '❌ FAILED'}`)
if (!passed) {
await alerting.warn(`Chaos experiment failed: ${experiment.name} — SPOF identified`)
}
}
}
Fix 5: Multi-Region for True High Availability
# When a full AZ or region goes down:
# Active-passive: one region handles traffic, other is warm standby
# Active-active: traffic distributed across both regions
# Route53 health-check based failover:
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.zone_id
name = "api.myapp.com"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_failover" {
zone_id = aws_route53_zone.main.zone_id
name = "api.myapp.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
}
# If primary health check fails, Route53 automatically routes to secondary
SPOF Elimination Checklist
- ✅ Every stateful component (DB, Redis, message broker) has a replica in a different AZ
- ✅ Circuit breakers wrap every external dependency with graceful fallback behavior
- ✅ Dependency map documents what breaks (and what degrades gracefully) for each failure
- ✅ Chaos experiments in staging verify fallback behavior works before incidents do
- ✅ No single person holds critical credentials or operational knowledge
- ✅ CI/CD pipeline failures don't block all teams — teams can deploy independently
- ✅ Monitoring system itself is redundant — the thing that alerts you can't be the SPOF
Conclusion
SPOFs are discovered in two ways: through deliberate mapping and testing, or through incidents. The first way is much cheaper. Start with a dependency map: for each component, ask what breaks if it's unavailable. Any component whose failure cascades to total service unavailability is a SPOF, and it needs either redundancy, a circuit breaker with graceful degradation, or both. Then verify your fallbacks with chaos engineering in staging — because a fallback that hasn't been tested is just a comment in the code. The goal isn't a system that never fails; it's a system where any single failure degrades gracefully rather than taking everything down with it.