- Published on
Partial Failure Between Services — When Half Your System Lies
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
In monoliths, failure is binary — the code runs or it crashes. In distributed systems, failure is a spectrum. A network call might time out after the server processed the request. A service might return 200 but silently discard part of your data. Half your microservices are up, half are down, and none of them agree on the current state.
Partial failure is the defining challenge of distributed systems.
- The Anatomy of Partial Failure
- Fix 1: Idempotency Keys — Safe Retries
- Fix 2: Saga Pattern — Distributed Transactions
- Fix 3: Outbox Pattern — Reliable Event Publishing
- Fix 4: Health Check Propagation
- Fix 5: Timeout + Fallback Strategy
- Conclusion
The Anatomy of Partial Failure
Client → Order Service → Payment Service → Bank API
↑
Network times out after 3s
Did payment go through?
→ Client doesn't know
→ Order Service doesn't know
→ Payment Service might know
→ Bank definitely knows — but nobody asked in time
Result: Order placed, customer charged, order status = "failed"
Partial failures happen when:
- Request succeeds but response lost — action executed, caller never knew
- Slow response — caller times out, but service is still processing
- Partial write — some data written, some not, no atomicity
- Inconsistent state across services — Service A updated, Service B didn't
Fix 1: Idempotency Keys — Safe Retries
Make every state-changing operation safe to retry:
// Client: generate and send idempotency key
import { v4 as uuidv4 } from 'uuid'
async function placeOrder(items: CartItem[]): Promise<Order> {
const idempotencyKey = uuidv4() // Generated once per user action
// Store key locally so retry uses SAME key
localStorage.setItem('pending_order_key', idempotencyKey)
try {
const order = await orderService.create(items, { idempotencyKey })
localStorage.removeItem('pending_order_key')
return order
} catch (err) {
// On timeout: retry with SAME key — safe!
if (err.code === 'TIMEOUT') {
const existingKey = localStorage.getItem('pending_order_key')
return orderService.create(items, { idempotencyKey: existingKey! })
}
throw err
}
}
// Server: deduplicate using idempotency key
app.post('/orders', async (req, res) => {
const key = req.headers['idempotency-key']
if (!key) return res.status(400).json({ error: 'idempotency-key required' })
// Check if we already processed this key
const existing = await redis.get(`idempotency:${key}`)
if (existing) {
// Return cached response — operation already completed
return res.status(200).json(JSON.parse(existing))
}
// Set a placeholder to handle concurrent requests with same key
await redis.set(`idempotency:${key}`, 'processing', 'EX', 30)
try {
const order = await createOrder(req.body)
// Cache the response for 24 hours
await redis.set(`idempotency:${key}`, JSON.stringify(order), 'EX', 86400)
res.status(201).json(order)
} catch (err) {
await redis.del(`idempotency:${key}`)
throw err
}
})
Fix 2: Saga Pattern — Distributed Transactions
When you need atomicity across services, use Sagas with compensating transactions:
// Orchestration-based Saga
class OrderSaga {
async execute(orderId: string, items: CartItem[], userId: string) {
const steps: CompletedStep[] = []
try {
// Step 1: Reserve inventory
await inventoryService.reserve(orderId, items)
steps.push({ service: 'inventory', action: 'reserve', orderId })
// Step 2: Charge payment
const charge = await paymentService.charge(userId, calculateTotal(items))
steps.push({ service: 'payment', action: 'charge', chargeId: charge.id })
// Step 3: Create order
const order = await orderService.create({ orderId, items, userId })
steps.push({ service: 'order', action: 'create', orderId })
return order
} catch (error) {
// Compensate in reverse order
await this.compensate(steps, error)
throw error
}
}
private async compensate(steps: CompletedStep[], originalError: Error) {
// Execute compensating transactions in reverse
for (const step of [...steps].reverse()) {
try {
if (step.service === 'inventory') {
await inventoryService.release(step.orderId)
} else if (step.service === 'payment') {
await paymentService.refund(step.chargeId)
} else if (step.service === 'order') {
await orderService.cancel(step.orderId)
}
} catch (compensationError) {
// Log but continue — compensating transactions can fail too
logger.error(`Compensation failed for ${step.service}:`, compensationError)
// Send to dead letter queue for manual intervention
await dlqService.send({ step, originalError, compensationError })
}
}
}
}
Fix 3: Outbox Pattern — Reliable Event Publishing
Never trust a service to both write to DB AND publish an event atomically:
// ❌ DANGEROUS — event publish can fail after DB commit
async function createOrder(data: CreateOrderInput) {
const order = await db.order.create(data) // DB committed
await eventBus.publish('order.created', order) // Can fail!
// If publish fails: DB says order exists, but no event was emitted
// Downstream services never know the order was created
}
// ✅ Outbox pattern — atomically write event to DB, publish separately
async function createOrder(data: CreateOrderInput) {
await db.transaction(async (tx) => {
const order = await tx.order.create(data)
// Write event to outbox table IN THE SAME TRANSACTION
await tx.outbox.create({
topic: 'order.created',
payload: JSON.stringify(order),
status: 'pending',
createdAt: new Date(),
})
// If DB fails → both rollback. If DB succeeds → event guaranteed.
})
}
// Separate outbox processor (retry-safe)
async function processOutbox() {
const pendingEvents = await db.outbox.findMany({
where: { status: 'pending' },
take: 100,
orderBy: { createdAt: 'asc' },
})
for (const event of pendingEvents) {
try {
await eventBus.publish(event.topic, JSON.parse(event.payload))
await db.outbox.update({
where: { id: event.id },
data: { status: 'published', publishedAt: new Date() },
})
} catch (err) {
await db.outbox.update({
where: { id: event.id },
data: { attempts: { increment: 1 }, lastError: err.message },
})
}
}
}
// Run every 5 seconds
setInterval(processOutbox, 5000)
Fix 4: Health Check Propagation
Services should report partial health, not just alive/dead:
// Detailed health check for upstream dependencies
app.get('/health', async (req, res) => {
const checks = await Promise.allSettled([
db.query('SELECT 1').then(() => ({ name: 'database', status: 'ok' })),
redis.ping().then(() => ({ name: 'cache', status: 'ok' })),
paymentService.ping().then(() => ({ name: 'payment-service', status: 'ok' })),
])
const results = checks.map((check, i) => {
const names = ['database', 'cache', 'payment-service']
return check.status === 'fulfilled'
? check.value
: { name: names[i], status: 'degraded', error: check.reason?.message }
})
const allHealthy = results.every(r => r.status === 'ok')
const anyUp = results.some(r => r.status === 'ok')
res.status(allHealthy ? 200 : anyUp ? 207 : 503).json({
status: allHealthy ? 'healthy' : 'degraded',
checks: results,
})
})
Fix 5: Timeout + Fallback Strategy
Never wait forever on a downstream service:
async function getProductWithFallback(productId: string) {
try {
// Primary: live catalog service (3s timeout)
const product = await Promise.race([
catalogService.getProduct(productId),
timeout(3000, 'Catalog service timeout'),
])
return product
} catch (err) {
logger.warn(`Catalog service failed: ${err.message} — using cached data`)
// Fallback 1: Redis cache
const cached = await redis.get(`product:${productId}`)
if (cached) return JSON.parse(cached)
// Fallback 2: Stale database snapshot
const stale = await db.productSnapshot.findById(productId)
if (stale) return { ...stale, isStale: true }
// Fallback 3: Graceful degradation
return { id: productId, name: 'Product Unavailable', error: 'catalog_unavailable' }
}
}
function timeout<T>(ms: number, message: string): Promise<T> {
return new Promise((_, reject) =>
setTimeout(() => reject(new Error(message)), ms)
)
}
Conclusion
Partial failure is the defining challenge of distributed systems — and accepting that it will happen is the first step to handling it well. Build for partial failure from the start: idempotency keys make retries safe, the outbox pattern guarantees event delivery, sagas provide distributed rollback, and graceful degradation ensures one failing service doesn't bring down everything else. Design every service call assuming it might partially fail — because it will.