Published on

Dealing With Silent System Failure — The Bug That's Been Running for Three Months

Authors

Introduction

Silent failures are qualitatively different from loud ones. Loud failures get paged, investigated, and fixed. Silent failures accumulate — they erode correctness quietly, often for months, until a user complaint or an audit surfaces them. Email jobs that log "success" but don't send. Background workers that swallow errors and continue processing. Integrations that retry three times and then silently drop the record. The defense against silent failures is systematic: treat every background operation as suspect until proven observable, instrument everything, and regularly audit expected outcomes against actual ones.

Types of Silent Failures

Silent failure patterns:

1. Error swallowing
try { await doWork() } catch { /* silently continue */ }
The operation failed but nothing indicates it
50,000 calls, 30% failure rate, 0 alerts

2. Job success with failed subtasks
Job reports "completed successfully"
30% of items in the batch failed
No per-item failure tracking

3. Queue processing with no dead-letter handling
Message fails processing 3 times, goes to dead-letter queue
Nobody monitors DLQ
10,000 messages accumulate over 3 months

4. Backup job that uploads 0 bytes
   → pg_dump failed silently
   → upload command succeeded on a 0-byte file
All checks pass, backup is empty

5. Rate limiting that drops instead of queues
API client hits rate limit
Library returns null instead of throwing
Callers don't check for null
Data silently lost

6. Integration sync that skips on errors
Sync job: "if error, log and continue"
10% of records silently skipped every run
After 6 months: database diverged from source of truth

Fix 1: Never Swallow Errors — Log and Alert

// ❌ Error swallowing — the root cause of most silent failures
async function processEmailQueue() {
  const messages = await queue.receive()
  for (const msg of messages) {
    try {
      await sendEmail(msg.email, msg.subject, msg.body)
      await queue.delete(msg.id)
    } catch (err) {
      // ❌ Silently continue — email dropped
    }
  }
}

// ✅ Error is observable, alertable, and retryable
async function processEmailQueue() {
  const messages = await queue.receive()

  for (const msg of messages) {
    try {
      await sendEmail(msg.email, msg.subject, msg.body)
      await queue.delete(msg.id)
      metrics.increment('email.sent')
    } catch (err) {
      logger.error({ messageId: msg.id, error: err.message }, 'Email send failed')
      metrics.increment('email.failed')

      // Don't delete — move to dead-letter or retry
      if (msg.retryCount < 3) {
        await queue.retry(msg.id, { delay: exponentialBackoff(msg.retryCount) })
      } else {
        await queue.moveToDeadLetter(msg.id, err.message)
        await alerting.warn(`Email permanently failed after 3 retries: ${msg.id}`)
      }
    }
  }
}

Fix 2: Dead-Letter Queues With Monitoring

// Every queue must have a dead-letter queue
// Every DLQ must have an alert when messages accumulate

// SQS DLQ configuration (terraform)
resource "aws_sqs_queue" "email_dlq" {
  name                      = "email-dlq"
  message_retention_seconds = 1209600  # 14 days
}

resource "aws_sqs_queue" "email" {
  name = "email"
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.email_dlq.arn
    maxReceiveCount     = 3  # After 3 failures, move to DLQ
  })
}

# Alert when DLQ has messages:
resource "aws_cloudwatch_metric_alarm" "email_dlq_messages" {
  alarm_name  = "email-dlq-has-messages"
  namespace   = "AWS/SQS"
  metric_name = "ApproximateNumberOfMessagesVisible"

  dimensions = {
    QueueName = aws_sqs_queue.email_dlq.name
  }

  comparison_operator = "GreaterThanThreshold"
  threshold           = 0  # Alert on first message in DLQ
  evaluation_periods  = 1
  period              = 300

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Fix 3: Outcome Auditing — Verify Expected vs Actual

// Don't just trust that the job "ran" — verify it produced expected outcomes
// Run this daily as a sanity check

async function auditEmailOutcomes(): Promise<void> {
  // For the last 24 hours:
  // How many emails should have been sent (orders completed, signups, etc.)?
  // How many were actually sent according to SendGrid?

  const [ordersYesterday, emailsSent] = await Promise.all([
    db.query(`
      SELECT COUNT(*) as count
      FROM orders
      WHERE status = 'paid'
      AND paid_at > NOW() - INTERVAL '24 hours'
    `),

    sendgrid.request({
      method: 'GET',
      url: '/v3/stats',
      qs: {
        start_date: yesterday(),
        end_date: today(),
        aggregated_by: 'day',
      },
    }),
  ])

  const expectedEmails = ordersYesterday.rows[0].count  // One confirmation per order
  const actualSent = emailsSent.body[0]?.stats[0]?.metrics.delivered ?? 0

  // Allow 5% variance for timing differences
  const variance = Math.abs(expectedEmails - actualSent) / expectedEmails

  if (variance > 0.05) {
    await alerting.warn(
      `Email audit: expected ${expectedEmails} order confirmations, SendGrid shows ${actualSent} delivered (${(variance * 100).toFixed(1)}% variance)`
    )
  }
}

// Run daily at 8 AM:
cron.schedule('0 8 * * *', auditEmailOutcomes)

Fix 4: Health Check Endpoints for Background Jobs

// Background jobs should expose health status that can be polled
// Don't rely on "the job runs" as evidence it's working

// job-health.ts — each job reports its health
const jobHealth: Map<string, JobHealthStatus> = new Map()

interface JobHealthStatus {
  lastRunAt: Date | null
  lastSuccessAt: Date | null
  lastFailureAt: Date | null
  lastError: string | null
  processedCount: number
  failedCount: number
}

// Jobs register their status after each run:
async function runEmailJob() {
  const start = Date.now()
  let processed = 0
  let failed = 0

  // ... job logic ...

  jobHealth.set('email-job', {
    lastRunAt: new Date(),
    lastSuccessAt: failed === 0 ? new Date() : jobHealth.get('email-job')?.lastSuccessAt ?? null,
    lastFailureAt: failed > 0 ? new Date() : jobHealth.get('email-job')?.lastFailureAt ?? null,
    lastError: null,
    processedCount: processed,
    failedCount: failed,
  })
}

// Health endpoint for monitoring systems:
app.get('/internal/jobs/health', (req, res) => {
  const jobs = Object.fromEntries(jobHealth)
  const unhealthyJobs = Object.entries(jobs).filter(([name, status]) => {
    const lastRunHoursAgo = status.lastRunAt
      ? (Date.now() - status.lastRunAt.getTime()) / 3600000
      : Infinity

    return lastRunHoursAgo > 2  // Job should run at least every 2 hours
  })

  res.json({
    healthy: unhealthyJobs.length === 0,
    jobs,
    unhealthy: unhealthyJobs.map(([name]) => name),
  })
})

Fix 5: Structured Logging That Makes Silent Failures Visible

// Every background operation should log: what was attempted, what succeeded, what failed
// This creates a queryable audit trail

async function syncRecords(sourceRecords: SourceRecord[]): Promise<SyncResult> {
  const results: SyncResult = {
    total: sourceRecords.length,
    succeeded: 0,
    failed: 0,
    skipped: 0,
    errors: [],
  }

  for (const record of sourceRecords) {
    try {
      await upsertRecord(record)
      results.succeeded++
    } catch (err) {
      results.failed++
      results.errors.push({ recordId: record.id, error: err.message })

      logger.error({
        operation: 'sync_record',
        recordId: record.id,
        error: err.message,
      }, 'Record sync failed')
    }
  }

  // Log summary at the end — makes it easy to query in log aggregation
  logger.info({
    operation: 'sync_batch',
    total: results.total,
    succeeded: results.succeeded,
    failed: results.failed,
    skipped: results.skipped,
    successRate: results.succeeded / results.total,
  }, 'Sync batch completed')

  // Alert if failure rate is too high
  if (results.failed / results.total > 0.05) {
    await alerting.warn(`Sync batch: ${results.failed}/${results.total} records failed (${(results.failed/results.total*100).toFixed(0)}%)`)
  }

  return results
}

Silent Failure Prevention Checklist

  • ✅ No bare catch {} blocks — every exception logged and tracked
  • ✅ Every queue has a dead-letter queue with a CloudWatch alarm
  • ✅ Outcome auditing: daily check of expected vs actual results (emails, records, etc.)
  • ✅ Background jobs expose health endpoints with last run time and error counts
  • ✅ Job summary logs: total processed, succeeded, failed per batch
  • ✅ Alert fires when failure rate in any batch exceeds 5%
  • ✅ DLQ messages reviewed and redriven or investigated within 24 hours

Conclusion

Silent failures are the reliability problems that don't show up in error rates or latency dashboards — they show up in wrong data, missing emails, and diverged systems discovered weeks later. The defense is systematic observability of background operations: log what was attempted and what failed, monitor dead-letter queues, audit expected outcomes against actuals, and alert when batch failure rates exceed thresholds. No background operation should be able to fail quietly — every failure should leave a trace, queue for retry, and eventually alert if it can't be resolved automatically.