Published on

Single Point of Failure Nobody Noticed — Until It Took Down Everything

Authors

Introduction

Single points of failure don't announce themselves. They're the Redis instance that "only handles caching" — until you discover that session management, rate limiting, and distributed locks all depend on it. They're the single deployment pipeline that every team uses to ship. They're the one DNS server that all services rely on for internal service discovery. They hide in the dependencies you assume are stable, and they reveal themselves at the worst possible time — during the traffic spike, during the incident, at 3 AM.

Common SPOFs That Go Unnoticed

Hidden single points of failure:

Infrastructure:
- Single Redis instance (sessions, cache, locks, rate limits)
- Single message broker (all async work stops)
- Single Postgres primary with no automatic failover
- Single NAT gateway (all outbound traffic in a VPC)
- Single availability zone deployment

Application:
- Single authentication service (all services fail if it's down)
- Single config service (all services can't start if it's down)
- Hard dependency on a single third-party API with no fallback
- Single cron runner with no redundancy

Operational:
- Single person who knows how to deploy
- Single person who knows the database password
- Single CI/CD pipeline that all teams share
- Single monitoring system that alerts on all other systems

Fix 1: Multi-AZ + Replica for Every Stateful Component

# terraform/redis-cluster.tf — Redis with automatic failover
resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "myapp-redis"
  description                = "Redis with multi-AZ failover"

  # Primary + 1 read replica in different AZs
  num_cache_clusters         = 2
  automatic_failover_enabled = true
  multi_az_enabled           = true

  # If primary goes down, replica promotes automatically (~30 seconds)
  node_type                  = "cache.r6g.large"
  port                       = 6379

  # Spread across AZs
  preferred_cache_cluster_azs = ["us-east-1a", "us-east-1b"]
}

# postgresql/rds-multi-az.tf
resource "aws_db_instance" "postgres" {
  identifier           = "myapp-postgres"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class       = "db.r6g.large"

  # Multi-AZ = synchronous standby, automatic failover on failure
  multi_az             = true

  # Automated backups
  backup_retention_period = 14
}

Fix 2: Circuit Breaker on Every External Dependency

// If Redis is down, degrade gracefully — don't bring down the whole app
import CircuitBreaker from 'opossum'

const redisOptions = {
  timeout: 3000,         // Fail fast if Redis takes > 3s
  errorThresholdPercentage: 50,  // Open circuit after 50% failures
  resetTimeout: 30000,   // Try again after 30 seconds
}

const redisBreaker = new CircuitBreaker(redisOperation, redisOptions)

redisBreaker.fallback((key: string) => {
  // Graceful degradation when Redis is down:
  logger.warn({ key }, 'Redis circuit open — using fallback')
  return null  // Cache miss, fall through to database
})

// Rate limiting with Redis down: allow the request (fail open)
async function checkRateLimit(userId: string): Promise<boolean> {
  try {
    return await redisBreaker.fire(async () => {
      const requests = await redis.incr(`rl:${userId}`)
      await redis.expire(`rl:${userId}`, 60)
      return requests <= 100
    })
  } catch {
    // Redis down: fail open (allow the request)
    // Alternative: fail closed (deny all) for security-critical paths
    logger.warn({ userId }, 'Rate limit check skipped — Redis unavailable')
    return true
  }
}

// Session lookup with Redis down: fall back to database
async function getSession(sessionId: string): Promise<Session | null> {
  const cached = await redisBreaker.fire(() =>
    redis.get(`session:${sessionId}`)
  ).catch(() => null)

  if (cached) return JSON.parse(cached)

  // Fallback: database is slower but available
  return db.query('SELECT * FROM sessions WHERE id = $1', [sessionId])
    .then(r => r.rows[0] ?? null)
}

Fix 3: Map Your Dependencies to Find Hidden SPOFs

// dependency-map.ts — document what breaks when each component goes down

const dependencyMap = {
  redis: {
    dependents: ['session-service', 'rate-limiter', 'distributed-lock', 'cache-layer'],
    degradedBehavior: {
      'session-service': 'Fall back to database sessions (slower but functional)',
      'rate-limiter': 'Fail open — allow requests without rate limiting',
      'distributed-lock': 'Use optimistic locking in database instead',
      'cache-layer': 'Cache miss — all reads go to database',
    },
    isSPOF: true,
    mitigation: 'ElastiCache with multi-AZ failover + circuit breaker fallbacks',
  },

  postgres: {
    dependents: ['all services'],
    degradedBehavior: {
      'all services': 'Service degraded — read-only mode from replica',
    },
    isSPOF: true,
    mitigation: 'RDS Multi-AZ with automatic failover + read replicas',
  },

  stripe: {
    dependents: ['payment-service'],
    degradedBehavior: {
      'payment-service': 'Queue payment retry — accept order, process later',
    },
    isSPOF: false,  // Has retry queue
    mitigation: 'Async payment queue with retry and manual reconciliation',
  },
}

// Run quarterly: list all SPOFs where isSPOF: true and no mitigation
function auditSPOFs() {
  const unmitigated = Object.entries(dependencyMap)
    .filter(([_, dep]) => dep.isSPOF && !dep.mitigation)

  if (unmitigated.length > 0) {
    console.warn('Unmitigated SPOFs found:')
    unmitigated.forEach(([name]) => console.warn(`  - ${name}`))
  }
}

Fix 4: Chaos Engineering — Find SPOFs Before They Find You

// chaos-test.ts — deliberately kill components to verify fallback behavior
// Run in staging on a schedule

interface ChaosExperiment {
  name: string
  action: () => Promise<void>
  restore: () => Promise<void>
  verify: () => Promise<boolean>
}

const experiments: ChaosExperiment[] = [
  {
    name: 'Redis unavailable',
    action: async () => {
      // Block Redis port for 60 seconds
      await exec('iptables -A OUTPUT -p tcp --dport 6379 -j DROP')
    },
    restore: async () => {
      await exec('iptables -D OUTPUT -p tcp --dport 6379 -j DROP')
    },
    verify: async () => {
      // App should still serve requests (degraded, not down)
      const response = await fetch('https://staging.myapp.com/health')
      const health = await response.json()
      return health.status !== 'down'
    },
  },
  {
    name: 'Primary database failover',
    action: async () => {
      // Trigger RDS failover to standby
      await rds.rebootDBInstance({ DBInstanceIdentifier: 'myapp-staging', ForceFailover: true })
    },
    restore: async () => {
      // RDS restores itself after failover
    },
    verify: async () => {
      // After failover (< 60s), app should be serving requests
      await sleep(90_000)  // Wait for failover
      const response = await fetch('https://staging.myapp.com/api/healthz')
      return response.ok
    },
  },
]

async function runChaosExperiments() {
  for (const experiment of experiments) {
    console.log(`Running: ${experiment.name}`)
    await experiment.action()

    const passed = await experiment.verify()

    await experiment.restore()
    await sleep(30_000)  // Recovery buffer

    console.log(`${experiment.name}: ${passed ? '✅ PASSED' : '❌ FAILED'}`)

    if (!passed) {
      await alerting.warn(`Chaos experiment failed: ${experiment.name} — SPOF identified`)
    }
  }
}

Fix 5: Multi-Region for True High Availability

# When a full AZ or region goes down:
# Active-passive: one region handles traffic, other is warm standby
# Active-active: traffic distributed across both regions

# Route53 health-check based failover:
resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"

  set_identifier = "primary"
  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_failover" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"

  set_identifier = "secondary"
  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}
# If primary health check fails, Route53 automatically routes to secondary

SPOF Elimination Checklist

  • ✅ Every stateful component (DB, Redis, message broker) has a replica in a different AZ
  • ✅ Circuit breakers wrap every external dependency with graceful fallback behavior
  • ✅ Dependency map documents what breaks (and what degrades gracefully) for each failure
  • ✅ Chaos experiments in staging verify fallback behavior works before incidents do
  • ✅ No single person holds critical credentials or operational knowledge
  • ✅ CI/CD pipeline failures don't block all teams — teams can deploy independently
  • ✅ Monitoring system itself is redundant — the thing that alerts you can't be the SPOF

Conclusion

SPOFs are discovered in two ways: through deliberate mapping and testing, or through incidents. The first way is much cheaper. Start with a dependency map: for each component, ask what breaks if it's unavailable. Any component whose failure cascades to total service unavailability is a SPOF, and it needs either redundancy, a circuit breaker with graceful degradation, or both. Then verify your fallbacks with chaos engineering in staging — because a fallback that hasn't been tested is just a comment in the code. The goal isn't a system that never fails; it's a system where any single failure degrades gracefully rather than taking everything down with it.