Published on

Partial Failure Between Services — When Half Your System Lies

Authors

Introduction

In monoliths, failure is binary — the code runs or it crashes. In distributed systems, failure is a spectrum. A network call might time out after the server processed the request. A service might return 200 but silently discard part of your data. Half your microservices are up, half are down, and none of them agree on the current state.

Partial failure is the defining challenge of distributed systems.

The Anatomy of Partial Failure

ClientOrder ServicePayment ServiceBank API
              Network times out after 3s

Did payment go through?
Client doesn't know
Order Service doesn't know
Payment Service might know
Bank definitely knows — but nobody asked in time

Result: Order placed, customer charged, order status = "failed"

Partial failures happen when:

  1. Request succeeds but response lost — action executed, caller never knew
  2. Slow response — caller times out, but service is still processing
  3. Partial write — some data written, some not, no atomicity
  4. Inconsistent state across services — Service A updated, Service B didn't

Fix 1: Idempotency Keys — Safe Retries

Make every state-changing operation safe to retry:

// Client: generate and send idempotency key
import { v4 as uuidv4 } from 'uuid'

async function placeOrder(items: CartItem[]): Promise<Order> {
  const idempotencyKey = uuidv4()  // Generated once per user action
  // Store key locally so retry uses SAME key
  localStorage.setItem('pending_order_key', idempotencyKey)

  try {
    const order = await orderService.create(items, { idempotencyKey })
    localStorage.removeItem('pending_order_key')
    return order
  } catch (err) {
    // On timeout: retry with SAME key — safe!
    if (err.code === 'TIMEOUT') {
      const existingKey = localStorage.getItem('pending_order_key')
      return orderService.create(items, { idempotencyKey: existingKey! })
    }
    throw err
  }
}
// Server: deduplicate using idempotency key
app.post('/orders', async (req, res) => {
  const key = req.headers['idempotency-key']
  if (!key) return res.status(400).json({ error: 'idempotency-key required' })

  // Check if we already processed this key
  const existing = await redis.get(`idempotency:${key}`)
  if (existing) {
    // Return cached response — operation already completed
    return res.status(200).json(JSON.parse(existing))
  }

  // Set a placeholder to handle concurrent requests with same key
  await redis.set(`idempotency:${key}`, 'processing', 'EX', 30)

  try {
    const order = await createOrder(req.body)

    // Cache the response for 24 hours
    await redis.set(`idempotency:${key}`, JSON.stringify(order), 'EX', 86400)
    res.status(201).json(order)
  } catch (err) {
    await redis.del(`idempotency:${key}`)
    throw err
  }
})

Fix 2: Saga Pattern — Distributed Transactions

When you need atomicity across services, use Sagas with compensating transactions:

// Orchestration-based Saga
class OrderSaga {
  async execute(orderId: string, items: CartItem[], userId: string) {
    const steps: CompletedStep[] = []

    try {
      // Step 1: Reserve inventory
      await inventoryService.reserve(orderId, items)
      steps.push({ service: 'inventory', action: 'reserve', orderId })

      // Step 2: Charge payment
      const charge = await paymentService.charge(userId, calculateTotal(items))
      steps.push({ service: 'payment', action: 'charge', chargeId: charge.id })

      // Step 3: Create order
      const order = await orderService.create({ orderId, items, userId })
      steps.push({ service: 'order', action: 'create', orderId })

      return order

    } catch (error) {
      // Compensate in reverse order
      await this.compensate(steps, error)
      throw error
    }
  }

  private async compensate(steps: CompletedStep[], originalError: Error) {
    // Execute compensating transactions in reverse
    for (const step of [...steps].reverse()) {
      try {
        if (step.service === 'inventory') {
          await inventoryService.release(step.orderId)
        } else if (step.service === 'payment') {
          await paymentService.refund(step.chargeId)
        } else if (step.service === 'order') {
          await orderService.cancel(step.orderId)
        }
      } catch (compensationError) {
        // Log but continue — compensating transactions can fail too
        logger.error(`Compensation failed for ${step.service}:`, compensationError)
        // Send to dead letter queue for manual intervention
        await dlqService.send({ step, originalError, compensationError })
      }
    }
  }
}

Fix 3: Outbox Pattern — Reliable Event Publishing

Never trust a service to both write to DB AND publish an event atomically:

// ❌ DANGEROUS — event publish can fail after DB commit
async function createOrder(data: CreateOrderInput) {
  const order = await db.order.create(data)  // DB committed
  await eventBus.publish('order.created', order)  // Can fail!
  // If publish fails: DB says order exists, but no event was emitted
  // Downstream services never know the order was created
}

// ✅ Outbox pattern — atomically write event to DB, publish separately
async function createOrder(data: CreateOrderInput) {
  await db.transaction(async (tx) => {
    const order = await tx.order.create(data)

    // Write event to outbox table IN THE SAME TRANSACTION
    await tx.outbox.create({
      topic: 'order.created',
      payload: JSON.stringify(order),
      status: 'pending',
      createdAt: new Date(),
    })
    // If DB fails → both rollback. If DB succeeds → event guaranteed.
  })
}

// Separate outbox processor (retry-safe)
async function processOutbox() {
  const pendingEvents = await db.outbox.findMany({
    where: { status: 'pending' },
    take: 100,
    orderBy: { createdAt: 'asc' },
  })

  for (const event of pendingEvents) {
    try {
      await eventBus.publish(event.topic, JSON.parse(event.payload))
      await db.outbox.update({
        where: { id: event.id },
        data: { status: 'published', publishedAt: new Date() },
      })
    } catch (err) {
      await db.outbox.update({
        where: { id: event.id },
        data: { attempts: { increment: 1 }, lastError: err.message },
      })
    }
  }
}

// Run every 5 seconds
setInterval(processOutbox, 5000)

Fix 4: Health Check Propagation

Services should report partial health, not just alive/dead:

// Detailed health check for upstream dependencies
app.get('/health', async (req, res) => {
  const checks = await Promise.allSettled([
    db.query('SELECT 1').then(() => ({ name: 'database', status: 'ok' })),
    redis.ping().then(() => ({ name: 'cache', status: 'ok' })),
    paymentService.ping().then(() => ({ name: 'payment-service', status: 'ok' })),
  ])

  const results = checks.map((check, i) => {
    const names = ['database', 'cache', 'payment-service']
    return check.status === 'fulfilled'
      ? check.value
      : { name: names[i], status: 'degraded', error: check.reason?.message }
  })

  const allHealthy = results.every(r => r.status === 'ok')
  const anyUp = results.some(r => r.status === 'ok')

  res.status(allHealthy ? 200 : anyUp ? 207 : 503).json({
    status: allHealthy ? 'healthy' : 'degraded',
    checks: results,
  })
})

Fix 5: Timeout + Fallback Strategy

Never wait forever on a downstream service:

async function getProductWithFallback(productId: string) {
  try {
    // Primary: live catalog service (3s timeout)
    const product = await Promise.race([
      catalogService.getProduct(productId),
      timeout(3000, 'Catalog service timeout'),
    ])
    return product
  } catch (err) {
    logger.warn(`Catalog service failed: ${err.message} — using cached data`)

    // Fallback 1: Redis cache
    const cached = await redis.get(`product:${productId}`)
    if (cached) return JSON.parse(cached)

    // Fallback 2: Stale database snapshot
    const stale = await db.productSnapshot.findById(productId)
    if (stale) return { ...stale, isStale: true }

    // Fallback 3: Graceful degradation
    return { id: productId, name: 'Product Unavailable', error: 'catalog_unavailable' }
  }
}

function timeout<T>(ms: number, message: string): Promise<T> {
  return new Promise((_, reject) =>
    setTimeout(() => reject(new Error(message)), ms)
  )
}

Conclusion

Partial failure is the defining challenge of distributed systems — and accepting that it will happen is the first step to handling it well. Build for partial failure from the start: idempotency keys make retries safe, the outbox pattern guarantees event delivery, sagas provide distributed rollback, and graceful degradation ensures one failing service doesn't bring down everything else. Design every service call assuming it might partially fail — because it will.