Backup That Never Worked — The False Safety Net That Fails When You Need It Most

Introduction

Every team believes they have backups — until they need them. The backup job runs on a schedule, the cron logs show success, and everyone sleeps soundly. Then a disk fails or someone runs DELETE FROM users without a WHERE clause, and you discover the backup files are 0 bytes, the S3 bucket has a lifecycle policy that deleted them last week, or the backup credentials rotated and the job has been silently failing for months.

How Backups Fail Silently
Fix 1: Verify Every Backup by Restoring It
Fix 2: Alert on Backup Failure, Not Just Cron Exit Code
Fix 3: Structured Backup Script with Error Handling
Fix 4: Monthly Restore Drill
Backup Checklist
Conclusion

How Backups Fail Silently

Common failure modes:

1. Backup job exits with code 0 even on partial failure
   → Cron reports "success", file is 0 bytes or truncated

2. Disk full mid-backup
   → File written, truncated at 2GB, job reports success

3. Credentials expired (S3 key rotated, DB password changed)
   → Upload fails, but cron job swallows the error

4. Backup runs but excludes the most important table
   → `pg_dump --exclude-table=large_logs` — logs excluded for size,
      but someone put critical data in that table

5. Backup of replica, not primary
   → Replica was 3 hours behind due to network issue
   → "Successful" backup is missing 3 hours of data

6. Lifecycle policy deleted old backups
   → S3 bucket: keep 30 days, delete older
   → Incident discovered on day 45 — no backups available

7. Backup encrypted with lost key
   → Files exist, cannot decrypt

8. Wrong format: backup works, restore doesn't
   → Dump created with pg_dump 13, trying to restore with pg_restore 12
   → Incompatible format

Fix 1: Verify Every Backup by Restoring It

A backup you've never restored is not a backup. You don't know it works until you've proved it:

// backup-verify.ts — run after every backup
import { exec } from 'child_process'
import { promisify } from 'util'
import * as fs from 'fs'

const execAsync = promisify(exec)

async function verifyBackup(backupFile: string): Promise<void> {
  console.log(`Verifying backup: ${backupFile}`)

  // 1. Check file exists and has non-zero size
  const stats = fs.statSync(backupFile)
  if (stats.size < 1024) {
    throw new Error(`Backup file suspiciously small: ${stats.size} bytes`)
  }
  console.log(`✅ File size: ${(stats.size / 1024 / 1024).toFixed(1)} MB`)

  // 2. Check the backup is valid (without fully restoring)
  // For pg_dump custom format:
  const { stdout } = await execAsync(`pg_restore --list ${backupFile}`)
  const tableCount = stdout.split('\n').filter(l => l.includes('TABLE DATA')).length
  if (tableCount < 10) {
    throw new Error(`Backup has only ${tableCount} tables — expected at least 10`)
  }
  console.log(`✅ Tables in backup: ${tableCount}`)

  // 3. Restore to a test database and run spot-check queries
  const testDb = `verify_${Date.now()}`
  try {
    await execAsync(`createdb ${testDb}`)
    await execAsync(`pg_restore -d ${testDb} ${backupFile}`)

    // Spot-check: verify critical tables exist and have rows
    const checks = [
      { table: 'users', minRows: 100 },
      { table: 'orders', minRows: 50 },
      { table: 'products', minRows: 10 },
    ]

    for (const check of checks) {
      const { stdout } = await execAsync(
        `psql -d ${testDb} -c "SELECT COUNT(*) FROM ${check.table}" -t`
      )
      const count = parseInt(stdout.trim())
      if (count < check.minRows) {
        throw new Error(`Table ${check.table}: ${count} rows (expected >= ${check.minRows})`)
      }
      console.log(`✅ ${check.table}: ${count} rows`)
    }

    console.log('✅ Backup verification passed')
  } finally {
    // Always clean up test database
    await execAsync(`dropdb --if-exists ${testDb}`)
  }
}

// Run after every backup job
verifyBackup('/backups/prod-2026-03-15.pgdump')
  .catch((err) => {
    console.error('❌ BACKUP VERIFICATION FAILED:', err.message)
    // Alert! This is a critical failure
    alerting.critical(`Backup verification failed: ${err.message}`)
    process.exit(1)
  })

Fix 2: Alert on Backup Failure, Not Just Cron Exit Code

// backup-monitor.ts — separate from the backup job itself
// Checks that a recent backup exists and is valid

import cron from 'node-cron'

cron.schedule('0 8 * * *', async () => {
  await checkBackupHealth()
})

async function checkBackupHealth() {
  const s3 = new S3Client({ region: 'us-east-1' })

  // List recent backups
  const list = await s3.send(new ListObjectsV2Command({
    Bucket: process.env.BACKUP_BUCKET!,
    Prefix: 'backups/prod-',
  }))

  const objects = list.Contents ?? []

  // Find the most recent backup
  const mostRecent = objects
    .sort((a, b) => (b.LastModified?.getTime() ?? 0) - (a.LastModified?.getTime() ?? 0))[0]

  if (!mostRecent) {
    await alerting.critical('NO BACKUPS FOUND in S3 bucket!')
    return
  }

  const ageMs = Date.now() - (mostRecent.LastModified?.getTime() ?? 0)
  const ageHours = ageMs / (1000 * 60 * 60)

  if (ageHours > 26) {
    // More than 26 hours since last backup (should be daily)
    await alerting.critical(`Last backup is ${ageHours.toFixed(0)}h old — backup may be failing!`)
    return
  }

  const sizeMb = (mostRecent.Size ?? 0) / (1024 * 1024)
  if (sizeMb < 1) {
    await alerting.critical(`Last backup is only ${sizeMb.toFixed(2)}MB — likely corrupted or empty!`)
    return
  }

  console.log(`✅ Backup health check passed: ${mostRecent.Key}, ${sizeMb.toFixed(0)}MB, ${ageHours.toFixed(1)}h ago`)
}

Fix 3: Structured Backup Script with Error Handling

#!/bin/bash
# backup.sh — backup with proper error handling and alerting

set -euo pipefail  # Exit on error, undefined var, pipe failure

BACKUP_FILE="/tmp/backup-$(date +%Y-%m-%d-%H%M%S).pgdump"
BUCKET="s3://my-backups/prod/"
LOG_FILE="/var/log/backup.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"

log() {
  echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $1" | tee -a "$LOG_FILE"
}

alert_failure() {
  local message="$1"
  log "FAILURE: $message"
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\": \"🚨 *BACKUP FAILED*: $message\"}"
  exit 1
}

log "Starting backup..."

# 1. Create backup
pg_dump \
  --format=custom \
  --compress=9 \
  --no-password \
  --dbname="$DATABASE_URL" \
  --file="$BACKUP_FILE" \
  || alert_failure "pg_dump failed with exit code $?"

# 2. Verify backup file is non-empty
SIZE=$(stat -c%s "$BACKUP_FILE")
if [ "$SIZE" -lt 1048576 ]; then  # < 1MB
  alert_failure "Backup file too small: ${SIZE} bytes"
fi
log "Backup size: $((SIZE / 1024 / 1024)) MB"

# 3. Test backup is valid (list contents without restoring)
pg_restore --list "$BACKUP_FILE" > /dev/null \
  || alert_failure "pg_restore --list failed — backup may be corrupted"

# 4. Upload to S3
aws s3 cp "$BACKUP_FILE" "${BUCKET}$(basename $BACKUP_FILE)" \
  --storage-class STANDARD_IA \
  || alert_failure "S3 upload failed"

# 5. Verify upload completed
aws s3 ls "${BUCKET}$(basename $BACKUP_FILE)" \
  || alert_failure "S3 verification failed — file not found after upload"

log "✅ Backup completed successfully"

# 6. Clean up local file
rm "$BACKUP_FILE"

# 7. Alert on success too (so you notice when alerts stop)
curl -s -X POST "$SLACK_WEBHOOK" \
  -H 'Content-type: application/json' \
  -d "{\"text\": \"✅ Backup completed: $((SIZE / 1024 / 1024))MB\"}"

Fix 4: Monthly Restore Drill

// restore-drill.ts — run monthly to confirm you can actually restore

async function monthlyRestoreDrill() {
  console.log('Starting monthly restore drill...')

  // 1. Download most recent backup
  const backupFile = await downloadLatestBackup()

  // 2. Restore to a staging database
  const drillDb = `restore_drill_${Date.now()}`
  await execAsync(`createdb ${drillDb}`)

  const startTime = Date.now()
  await execAsync(`pg_restore -d ${drillDb} ${backupFile}`)
  const restoreTimeMs = Date.now() - startTime

  console.log(`Restore time: ${(restoreTimeMs / 1000 / 60).toFixed(1)} minutes`)

  // 3. Validate data integrity
  await runIntegrityChecks(drillDb)

  // 4. Log the results
  await db.query(`
    INSERT INTO restore_drill_log (backup_file, restore_time_seconds, row_counts, passed, drilled_at)
    VALUES ($1, $2, $3, true, NOW())
  `, [backupFile, restoreTimeMs / 1000, await getRowCounts(drillDb)])

  // 5. Clean up
  await execAsync(`dropdb ${drillDb}`)

  console.log('✅ Monthly restore drill passed')
}

// Schedule monthly
cron.schedule('0 3 1 * *', monthlyRestoreDrill)

Backup Checklist

✅ Verify every backup by checking file size AND listing contents
✅ Actually restore to a test DB weekly — confirm data integrity
✅ Alert on backup failure, not just log it
✅ Monitor that backups are recent — alert if > 26 hours since last backup
✅ Store backups in a separate account/region from production
✅ Test that backup credentials are still valid monthly
✅ Document and test the restore procedure — don't discover it during an incident
✅ Run a full restore drill quarterly — measure how long it takes
✅ Keep at least 30 days of backups — some incidents are discovered weeks later

Conclusion

The only backup that matters is one you've successfully restored. Everything else is a file that might work. The fix isn't a better backup tool — it's a verification step after every backup, an alert when backups are missing or too small, and a monthly restore drill that proves you can actually get data back. Set up the monitoring before you need it: a daily check that the backup exists, is recent, is non-trivially sized, and passes a structural validity check. That 5-minute cron job might save your company.