Published on

Scaling Under Black Friday Traffic — When Your Best Day Becomes Your Worst Incident

Authors

Introduction

Black Friday traffic isn't just "more traffic" — it's a rapid spike that tests every assumption you made during normal load. The auto-scaler that works fine for gradual growth takes 4 minutes to add capacity; in that gap, your connection pool fills, your cache gets hammered, and your checkout goes down. The teams that survive Black Friday don't just have more capacity — they prepare specifically for the shape of the traffic: sudden, massive, and concentrated on exactly the pages that matter most.

The Anatomy of a Traffic Spike Failure

Black Friday failure timeline (10x traffic spike):

T-60min: Load test said system handles 5x. Team is confident.
T+0:    8:00 AM EST. Traffic starts climbing.
T+2min: Traffic at 3x normal. All good.
T+4min: Traffic at 7x. Auto-scaler triggers (threshold: 70% CPU).
T+5min: Auto-scaler launches new instances. ECS task startup: ~3 minutes.
T+6min: Traffic at 10x. DB connection pool exhausted (max: 100 connections).
T+7min: DB queries start timing out. App returns 503s.
T+8min: Checkout is down. Revenue: $0/min on highest-traffic day.
T+11min: New instances finally healthy. But DB is still overwhelmed.
T+15min: PgBouncer connection pooler manually added. Situation stabilizes.

What worked: auto-scaling (eventually)
What didn't: DB connection pool, cold start time, no pre-warming

Fix 1: Pre-Warm Capacity Before the Event

#!/bin/bash
# pre-warm.sh — run 30 minutes before expected traffic spike

# Scale up BEFORE traffic arrives, not in response to it
aws ecs update-service \
  --cluster production \
  --service myapp-api \
  --desired-count 40  # 4x normal capacity

# Wait for tasks to be healthy
aws ecs wait services-stable \
  --cluster production \
  --services myapp-api

echo "✅ Pre-warmed to 40 instances"

# Also warm up the ALB if using connection draining
aws elbv2 describe-target-health \
  --target-group-arn "$TARGET_GROUP_ARN" \
  --query 'TargetHealthDescriptions[*].{Target:Target.Id,Health:TargetHealth.State}'
// Scheduled pre-warming for known events
import cron from 'node-cron'

// Scale up at 7:30 AM EST on Black Friday
cron.schedule('30 12 28 11 *', async () => {  // 12:30 UTC = 7:30 AM EST
  await scaleService('myapp-api', { desiredCount: 40 })
  await scaleService('myapp-worker', { desiredCount: 20 })
  await alerting.info('Black Friday pre-warm complete: 40 API + 20 worker instances')
})

// Scale back down at 11 PM
cron.schedule('0 4 29 11 *', async () => {
  await scaleService('myapp-api', { desiredCount: 10 })
  await scaleService('myapp-worker', { desiredCount: 5 })
})

Fix 2: PgBouncer — Connection Pooling at the Proxy Level

# Without PgBouncer: each app instance holds N connections to Postgres
# 40 instances × 25 connections each = 1000 connections
# Postgres max_connections: 200 → KABOOM

# With PgBouncer: all app instances connect to PgBouncer
# PgBouncer maintains a small, stable pool to Postgres

# pgbouncer.ini
[databases]
myapp = host=postgres-primary port=5432 dbname=myapp

[pgbouncer]
listen_port = 5432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Transaction pooling: connection released after each transaction
# (not each session — much more efficient)
pool_mode = transaction
max_client_conn = 1000   # App can open 1000 connections to PgBouncer
default_pool_size = 25   # PgBouncer uses only 25 connections to Postgres

server_idle_timeout = 600
client_idle_timeout = 0  # Don't close idle client connections

Fix 3: Shed Non-Critical Load During Spikes

// During high traffic, protect critical paths by dropping less important ones
import { RateLimiter } from 'limiter'

// Load shedder: track current load
let currentLoad = 0
const MAX_LOAD = 1000  // Max concurrent requests

app.use(async (req, res, next) => {
  // Always allow critical paths
  const isCritical = req.path.startsWith('/checkout') ||
                     req.path.startsWith('/payment') ||
                     req.path === '/health'

  if (isCritical) {
    return next()
  }

  // Under high load: shed non-critical requests
  if (currentLoad > MAX_LOAD * 0.9) {
    // 90% load: shed recommendations and analytics
    if (req.path.startsWith('/recommendations') || req.path.startsWith('/analytics')) {
      return res.status(503).json({
        error: 'Service temporarily reduced load',
        retryAfter: 30,
      })
    }
  }

  if (currentLoad > MAX_LOAD) {
    // 100% load: return 503 for everything non-critical
    return res.status(503).json({
      error: 'Service at capacity',
      retryAfter: 10,
    })
  }

  currentLoad++
  res.on('finish', () => { currentLoad-- })
  next()
})

Fix 4: Cache Stampede Protection

// During a spike, cache misses all hit the DB simultaneously
// Solution: probabilistic early expiry + mutex locking

import { createClient } from 'redis'

const redis = createClient()

async function getWithStampedeProtection<T>(
  key: string,
  ttl: number,
  fetchFn: () => Promise<T>
): Promise<T> {
  // Check cache
  const cached = await redis.get(key)
  if (cached) {
    const { value, expiresAt } = JSON.parse(cached)

    // Probabilistic early refresh: start refreshing before TTL expires
    // Prevents thundering herd at expiry
    const remainingTtl = expiresAt - Date.now()
    const shouldEarlyRefresh = Math.random() < Math.exp(-remainingTtl / (ttl * 200))

    if (!shouldEarlyRefresh) {
      return value
    }
  }

  // Use a distributed lock to prevent multiple refreshes
  const lockKey = `lock:${key}`
  const lockAcquired = await redis.set(lockKey, '1', { NX: true, EX: 10 })

  if (!lockAcquired) {
    // Another process is refreshing — return stale value or wait
    if (cached) {
      return JSON.parse(cached).value  // Return stale while fresh is being fetched
    }
    // Wait for the lock holder to finish
    await sleep(100)
    return getWithStampedeProtection(key, ttl, fetchFn)
  }

  try {
    const value = await fetchFn()
    await redis.set(key, JSON.stringify({ value, expiresAt: Date.now() + ttl * 1000 }), { EX: ttl })
    return value
  } finally {
    await redis.del(lockKey)
  }
}

Fix 5: Load Test at Scale Before the Event

// k6 load test — simulate Black Friday traffic shape
// Run this 2 weeks before the event, with time to fix findings

import http from 'k6/http'
import { sleep, check } from 'k6'

export const options = {
  stages: [
    { duration: '2m', target: 100 },    // Normal load
    { duration: '1m', target: 1000 },   // Spike to 10x
    { duration: '5m', target: 1000 },   // Sustained spike
    { duration: '2m', target: 100 },    // Recovery
    { duration: '2m', target: 0 },      // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<2000'],  // 99% of requests under 2s
    http_req_failed: ['rate<0.01'],     // <1% error rate
  },
}

export default function () {
  // Simulate Black Friday traffic mix
  const flows = [
    { weight: 40, path: '/products' },        // Browse
    { weight: 30, path: '/products/featured' }, // Featured
    { weight: 20, path: '/cart' },             // Cart
    { weight: 10, path: '/checkout' },         // Checkout (critical)
  ]

  const flow = weightedRandom(flows)
  const res = http.get(`https://staging.myapp.com${flow.path}`)

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time OK': (r) => r.timings.duration < 2000,
  })

  sleep(1)
}

Black Friday Readiness Checklist

  • ✅ Load tested at 2x the expected peak (not just 1x)
  • ✅ Pre-warming scheduled to run 30+ minutes before traffic spike
  • ✅ Database connection pooler (PgBouncer) in place — app doesn't hold raw connections
  • ✅ Critical paths (checkout, payment) protected from load shedding
  • ✅ Cache stampede protection on high-traffic cache keys
  • ✅ Auto-scaling thresholds set low enough to trigger before saturation
  • ✅ War room scheduled: engineers on standby during the peak window

Conclusion

Black Friday failures are almost always capacity failures that happen because the system was designed for gradual traffic growth, not sudden spikes. The fixes are operational as much as technical: pre-warm capacity before the event (don't wait for auto-scaling), install PgBouncer so connection counts don't explode, protect checkout with load shedding that sacrifices less critical features first, and run a full-scale load test with time to act on findings. The teams that have good Black Fridays rehearse them.