Published on

Retry Storm Amplifying Failure — When Good Intentions Crash the System

Authors

Introduction

A downstream service starts struggling — 30% of requests fail. Your load balancer and every client in your fleet have retry logic configured. Each failed request retries 3 times. Suddenly that struggling service is receiving 3x its normal traffic, all while already failing.

The retry storm turns a partial degradation into a complete outage. You built retries to increase reliability — they made it worse.

The Math of a Retry Storm

Normal traffic: 1,000 req/s
Service degraded: 30% error rate → 300 errors/s

With naive retries (3 per error):
  300 errors × 3 retries = 900 extra requests/s
  Total load: 1,000 + 900 = 1,900 req/s

Service now 90% error rate → 1,710 errors/s
  1,710 × 3 = 5,130 extra requests
  Total load: 6,840 req/s

Service now 100% error rate → total collapse

The more clients you have and the more retries they do, the faster the cascade.

Fix 1: Exponential Backoff with Jitter

Never retry immediately, never retry at fixed intervals:

interface RetryOptions {
  maxRetries: number
  initialDelayMs: number
  maxDelayMs: number
  backoffFactor: number
  jitter: boolean
}

async function fetchWithRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions = {
    maxRetries: 3,
    initialDelayMs: 100,
    maxDelayMs: 30_000,
    backoffFactor: 2,
    jitter: true,
  }
): Promise<T> {
  let lastError: Error

  for (let attempt = 0; attempt <= options.maxRetries; attempt++) {
    try {
      return await fn()
    } catch (error) {
      lastError = error as Error

      // Don't retry on client errors (4xx)
      if (error.status >= 400 && error.status < 500 && error.status !== 429) {
        throw error  // 4xx errors are not transient — don't retry
      }

      if (attempt === options.maxRetries) break

      const exponentialDelay = options.initialDelayMs * Math.pow(options.backoffFactor, attempt)
      const cappedDelay = Math.min(exponentialDelay, options.maxDelayMs)

      // Full jitter: random delay between 0 and cappedDelay
      // This spreads retries across time — no synchronized retry waves
      const delay = options.jitter
        ? Math.random() * cappedDelay
        : cappedDelay

      console.log(`Attempt ${attempt + 1} failed, retrying in ${delay.toFixed(0)}ms`)
      await sleep(delay)
    }
  }

  throw lastError!
}

// Without jitter — all clients retry at t=100, t=200, t=400ms in sync
// With full jitter — retries spread randomly, no synchronized spike

Fix 2: Circuit Breaker — Stop Retrying Failed Services

When a service is clearly down, stop sending requests entirely:

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN'

class CircuitBreaker {
  private state: CircuitState = 'CLOSED'
  private failureCount = 0
  private successCount = 0
  private lastFailureTime = 0

  constructor(
    private readonly options = {
      failureThreshold: 5,
      successThreshold: 2,
      timeoutMs: 30_000,
    }
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      const timeSinceFailure = Date.now() - this.lastFailureTime

      if (timeSinceFailure < this.options.timeoutMs) {
        throw new Error('Circuit breaker OPEN — fast fail (no retry wasted)')
      }

      // Move to half-open — try one request
      this.state = 'HALF_OPEN'
    }

    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }

  private onSuccess() {
    this.failureCount = 0
    if (this.state === 'HALF_OPEN') {
      this.successCount++
      if (this.successCount >= this.options.successThreshold) {
        this.state = 'CLOSED'
        this.successCount = 0
        console.log('Circuit CLOSED — service recovered')
      }
    }
  }

  private onFailure() {
    this.failureCount++
    this.lastFailureTime = Date.now()

    if (this.failureCount >= this.options.failureThreshold || this.state === 'HALF_OPEN') {
      this.state = 'OPEN'
      this.successCount = 0
      console.log('Circuit OPENED — stopping requests to failed service')
    }
  }

  getState() { return this.state }
}

// Combined: retry with backoff + circuit breaker
const breaker = new CircuitBreaker()

async function callPaymentService(data: PaymentData) {
  return breaker.call(() =>
    fetchWithRetry(() => paymentApi.charge(data), {
      maxRetries: 2,
      initialDelayMs: 200,
      maxDelayMs: 5_000,
      backoffFactor: 2,
      jitter: true,
    })
  )
}

Fix 3: Retry Budgets — Limit Total Retries Across the Fleet

Individual clients look innocent — but fleet-wide retries can still overwhelm:

// Centralized retry budget in Redis
class RetryBudget {
  constructor(
    private redis: Redis,
    private serviceName: string,
    private maxRetryRatePercent: number = 10  // Max 10% of traffic can be retries
  ) {}

  async canRetry(requestCount: number): Promise<boolean> {
    const key = `retry:budget:${this.serviceName}`
    const windowSec = 60

    const [retries, _] = await this.redis
      .multi()
      .incr(key)
      .expire(key, windowSec)
      .exec()

    const retryCount = retries as number
    const maxRetries = requestCount * (this.maxRetryRatePercent / 100)

    if (retryCount > maxRetries) {
      logger.warn(`Retry budget exhausted for ${this.serviceName}`)
      return false  // Skip retry — protect the service
    }

    return true
  }
}

const paymentBudget = new RetryBudget(redis, 'payment-service', 10)

async function chargeWithBudget(data: PaymentData) {
  try {
    return await paymentApi.charge(data)
  } catch (err) {
    const requestsPerMinute = await getRequestRate('payment-service')
    const allowed = await paymentBudget.canRetry(requestsPerMinute)

    if (!allowed) {
      throw new Error('Retry budget exhausted — try again later')
    }

    await sleep(exponentialBackoff(1))
    return await paymentApi.charge(data)
  }
}

Fix 4: Only Retry Idempotent Operations

Never blindly retry operations that can cause double execution:

// Classify which errors and operations are safe to retry
function isRetryable(error: Error, method: string): boolean {
  // Never retry POST mutations (might execute twice)
  if (method === 'POST' && !error.isIdempotent) return false

  // Retry on transient errors
  const retryableStatusCodes = [429, 500, 502, 503, 504]
  if (error.status && retryableStatusCodes.includes(error.status)) return true

  // Retry on network errors
  if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') return true

  // Don't retry on 4xx client errors (except 429 rate limit)
  if (error.status >= 400 && error.status < 500) return false

  return false
}

async function smartRetry<T>(
  fn: () => Promise<T>,
  method: string,
  maxRetries = 3
): Promise<T> {
  for (let i = 0; i <= maxRetries; i++) {
    try {
      return await fn()
    } catch (err) {
      if (i === maxRetries || !isRetryable(err, method)) throw err

      const delay = Math.min(100 * Math.pow(2, i) * (0.5 + Math.random()), 10_000)
      await sleep(delay)
    }
  }
  throw new Error('Unreachable')
}

Fix 5: Respect Retry-After Headers

async function respectServerBackoff(url: string, options: RequestInit) {
  const response = await fetch(url, options)

  if (response.status === 429) {
    const retryAfter = response.headers.get('Retry-After')

    if (retryAfter) {
      const delayMs = isNaN(Number(retryAfter))
        ? new Date(retryAfter).getTime() - Date.now()  // HTTP date format
        : Number(retryAfter) * 1000                     // Seconds format

      console.log(`Rate limited — waiting ${delayMs}ms as server requested`)
      await sleep(Math.max(delayMs, 0))
      return fetch(url, options)  // Single retry after server-specified wait
    }
  }

  return response
}

The Retry Rules

ScenarioRule
Network timeoutRetry with jitter — might not have executed
500 server errorRetry with jitter + circuit breaker
429 rate limitedRetry after Retry-After header
404 not foundNever retry — won't change
400 bad requestNever retry — your fault
Payment/order POSTOnly retry if idempotency key sent
Downstream circuit OPENNever retry — fast fail immediately

Conclusion

Retries are a double-edged sword in distributed systems. Without exponential backoff and jitter, they synchronize — causing spike patterns. Without circuit breakers, they hammer already-failing services. Without idempotency, they cause double-execution. The right retry strategy: exponential backoff with full jitter, circuit breakers to fast-fail clearly dead services, retry budgets to limit fleet-wide amplification, and idempotency keys for all non-idempotent operations.