- Published on
Retry Storm Amplifying Failure — When Good Intentions Crash the System
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
A downstream service starts struggling — 30% of requests fail. Your load balancer and every client in your fleet have retry logic configured. Each failed request retries 3 times. Suddenly that struggling service is receiving 3x its normal traffic, all while already failing.
The retry storm turns a partial degradation into a complete outage. You built retries to increase reliability — they made it worse.
- The Math of a Retry Storm
- Fix 1: Exponential Backoff with Jitter
- Fix 2: Circuit Breaker — Stop Retrying Failed Services
- Fix 3: Retry Budgets — Limit Total Retries Across the Fleet
- Fix 4: Only Retry Idempotent Operations
- Fix 5: Respect Retry-After Headers
- The Retry Rules
- Conclusion
The Math of a Retry Storm
Normal traffic: 1,000 req/s
Service degraded: 30% error rate → 300 errors/s
With naive retries (3 per error):
300 errors × 3 retries = 900 extra requests/s
Total load: 1,000 + 900 = 1,900 req/s
Service now 90% error rate → 1,710 errors/s
1,710 × 3 = 5,130 extra requests
Total load: 6,840 req/s
Service now 100% error rate → total collapse
The more clients you have and the more retries they do, the faster the cascade.
Fix 1: Exponential Backoff with Jitter
Never retry immediately, never retry at fixed intervals:
interface RetryOptions {
maxRetries: number
initialDelayMs: number
maxDelayMs: number
backoffFactor: number
jitter: boolean
}
async function fetchWithRetry<T>(
fn: () => Promise<T>,
options: RetryOptions = {
maxRetries: 3,
initialDelayMs: 100,
maxDelayMs: 30_000,
backoffFactor: 2,
jitter: true,
}
): Promise<T> {
let lastError: Error
for (let attempt = 0; attempt <= options.maxRetries; attempt++) {
try {
return await fn()
} catch (error) {
lastError = error as Error
// Don't retry on client errors (4xx)
if (error.status >= 400 && error.status < 500 && error.status !== 429) {
throw error // 4xx errors are not transient — don't retry
}
if (attempt === options.maxRetries) break
const exponentialDelay = options.initialDelayMs * Math.pow(options.backoffFactor, attempt)
const cappedDelay = Math.min(exponentialDelay, options.maxDelayMs)
// Full jitter: random delay between 0 and cappedDelay
// This spreads retries across time — no synchronized retry waves
const delay = options.jitter
? Math.random() * cappedDelay
: cappedDelay
console.log(`Attempt ${attempt + 1} failed, retrying in ${delay.toFixed(0)}ms`)
await sleep(delay)
}
}
throw lastError!
}
// Without jitter — all clients retry at t=100, t=200, t=400ms in sync
// With full jitter — retries spread randomly, no synchronized spike
Fix 2: Circuit Breaker — Stop Retrying Failed Services
When a service is clearly down, stop sending requests entirely:
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN'
class CircuitBreaker {
private state: CircuitState = 'CLOSED'
private failureCount = 0
private successCount = 0
private lastFailureTime = 0
constructor(
private readonly options = {
failureThreshold: 5,
successThreshold: 2,
timeoutMs: 30_000,
}
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
const timeSinceFailure = Date.now() - this.lastFailureTime
if (timeSinceFailure < this.options.timeoutMs) {
throw new Error('Circuit breaker OPEN — fast fail (no retry wasted)')
}
// Move to half-open — try one request
this.state = 'HALF_OPEN'
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
private onSuccess() {
this.failureCount = 0
if (this.state === 'HALF_OPEN') {
this.successCount++
if (this.successCount >= this.options.successThreshold) {
this.state = 'CLOSED'
this.successCount = 0
console.log('Circuit CLOSED — service recovered')
}
}
}
private onFailure() {
this.failureCount++
this.lastFailureTime = Date.now()
if (this.failureCount >= this.options.failureThreshold || this.state === 'HALF_OPEN') {
this.state = 'OPEN'
this.successCount = 0
console.log('Circuit OPENED — stopping requests to failed service')
}
}
getState() { return this.state }
}
// Combined: retry with backoff + circuit breaker
const breaker = new CircuitBreaker()
async function callPaymentService(data: PaymentData) {
return breaker.call(() =>
fetchWithRetry(() => paymentApi.charge(data), {
maxRetries: 2,
initialDelayMs: 200,
maxDelayMs: 5_000,
backoffFactor: 2,
jitter: true,
})
)
}
Fix 3: Retry Budgets — Limit Total Retries Across the Fleet
Individual clients look innocent — but fleet-wide retries can still overwhelm:
// Centralized retry budget in Redis
class RetryBudget {
constructor(
private redis: Redis,
private serviceName: string,
private maxRetryRatePercent: number = 10 // Max 10% of traffic can be retries
) {}
async canRetry(requestCount: number): Promise<boolean> {
const key = `retry:budget:${this.serviceName}`
const windowSec = 60
const [retries, _] = await this.redis
.multi()
.incr(key)
.expire(key, windowSec)
.exec()
const retryCount = retries as number
const maxRetries = requestCount * (this.maxRetryRatePercent / 100)
if (retryCount > maxRetries) {
logger.warn(`Retry budget exhausted for ${this.serviceName}`)
return false // Skip retry — protect the service
}
return true
}
}
const paymentBudget = new RetryBudget(redis, 'payment-service', 10)
async function chargeWithBudget(data: PaymentData) {
try {
return await paymentApi.charge(data)
} catch (err) {
const requestsPerMinute = await getRequestRate('payment-service')
const allowed = await paymentBudget.canRetry(requestsPerMinute)
if (!allowed) {
throw new Error('Retry budget exhausted — try again later')
}
await sleep(exponentialBackoff(1))
return await paymentApi.charge(data)
}
}
Fix 4: Only Retry Idempotent Operations
Never blindly retry operations that can cause double execution:
// Classify which errors and operations are safe to retry
function isRetryable(error: Error, method: string): boolean {
// Never retry POST mutations (might execute twice)
if (method === 'POST' && !error.isIdempotent) return false
// Retry on transient errors
const retryableStatusCodes = [429, 500, 502, 503, 504]
if (error.status && retryableStatusCodes.includes(error.status)) return true
// Retry on network errors
if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') return true
// Don't retry on 4xx client errors (except 429 rate limit)
if (error.status >= 400 && error.status < 500) return false
return false
}
async function smartRetry<T>(
fn: () => Promise<T>,
method: string,
maxRetries = 3
): Promise<T> {
for (let i = 0; i <= maxRetries; i++) {
try {
return await fn()
} catch (err) {
if (i === maxRetries || !isRetryable(err, method)) throw err
const delay = Math.min(100 * Math.pow(2, i) * (0.5 + Math.random()), 10_000)
await sleep(delay)
}
}
throw new Error('Unreachable')
}
Fix 5: Respect Retry-After Headers
async function respectServerBackoff(url: string, options: RequestInit) {
const response = await fetch(url, options)
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After')
if (retryAfter) {
const delayMs = isNaN(Number(retryAfter))
? new Date(retryAfter).getTime() - Date.now() // HTTP date format
: Number(retryAfter) * 1000 // Seconds format
console.log(`Rate limited — waiting ${delayMs}ms as server requested`)
await sleep(Math.max(delayMs, 0))
return fetch(url, options) // Single retry after server-specified wait
}
}
return response
}
The Retry Rules
| Scenario | Rule |
|---|---|
| Network timeout | Retry with jitter — might not have executed |
| 500 server error | Retry with jitter + circuit breaker |
| 429 rate limited | Retry after Retry-After header |
| 404 not found | Never retry — won't change |
| 400 bad request | Never retry — your fault |
| Payment/order POST | Only retry if idempotency key sent |
| Downstream circuit OPEN | Never retry — fast fail immediately |
Conclusion
Retries are a double-edged sword in distributed systems. Without exponential backoff and jitter, they synchronize — causing spike patterns. Without circuit breakers, they hammer already-failing services. Without idempotency, they cause double-execution. The right retry strategy: exponential backoff with full jitter, circuit breakers to fast-fail clearly dead services, retry budgets to limit fleet-wide amplification, and idempotency keys for all non-idempotent operations.