Published on

Backup That Never Worked — The False Safety Net That Fails When You Need It Most

Authors

Introduction

Every team believes they have backups — until they need them. The backup job runs on a schedule, the cron logs show success, and everyone sleeps soundly. Then a disk fails or someone runs DELETE FROM users without a WHERE clause, and you discover the backup files are 0 bytes, the S3 bucket has a lifecycle policy that deleted them last week, or the backup credentials rotated and the job has been silently failing for months.

How Backups Fail Silently

Common failure modes:

1. Backup job exits with code 0 even on partial failure
Cron reports "success", file is 0 bytes or truncated

2. Disk full mid-backup
File written, truncated at 2GB, job reports success

3. Credentials expired (S3 key rotated, DB password changed)
Upload fails, but cron job swallows the error

4. Backup runs but excludes the most important table
`pg_dump --exclude-table=large_logs` — logs excluded for size,
      but someone put critical data in that table

5. Backup of replica, not primary
Replica was 3 hours behind due to network issue
"Successful" backup is missing 3 hours of data

6. Lifecycle policy deleted old backups
S3 bucket: keep 30 days, delete older
Incident discovered on day 45 — no backups available

7. Backup encrypted with lost key
Files exist, cannot decrypt

8. Wrong format: backup works, restore doesn't
Dump created with pg_dump 13, trying to restore with pg_restore 12
Incompatible format

Fix 1: Verify Every Backup by Restoring It

A backup you've never restored is not a backup. You don't know it works until you've proved it:

// backup-verify.ts — run after every backup
import { exec } from 'child_process'
import { promisify } from 'util'
import * as fs from 'fs'

const execAsync = promisify(exec)

async function verifyBackup(backupFile: string): Promise<void> {
  console.log(`Verifying backup: ${backupFile}`)

  // 1. Check file exists and has non-zero size
  const stats = fs.statSync(backupFile)
  if (stats.size < 1024) {
    throw new Error(`Backup file suspiciously small: ${stats.size} bytes`)
  }
  console.log(`✅ File size: ${(stats.size / 1024 / 1024).toFixed(1)} MB`)

  // 2. Check the backup is valid (without fully restoring)
  // For pg_dump custom format:
  const { stdout } = await execAsync(`pg_restore --list ${backupFile}`)
  const tableCount = stdout.split('\n').filter(l => l.includes('TABLE DATA')).length
  if (tableCount < 10) {
    throw new Error(`Backup has only ${tableCount} tables — expected at least 10`)
  }
  console.log(`✅ Tables in backup: ${tableCount}`)

  // 3. Restore to a test database and run spot-check queries
  const testDb = `verify_${Date.now()}`
  try {
    await execAsync(`createdb ${testDb}`)
    await execAsync(`pg_restore -d ${testDb} ${backupFile}`)

    // Spot-check: verify critical tables exist and have rows
    const checks = [
      { table: 'users', minRows: 100 },
      { table: 'orders', minRows: 50 },
      { table: 'products', minRows: 10 },
    ]

    for (const check of checks) {
      const { stdout } = await execAsync(
        `psql -d ${testDb} -c "SELECT COUNT(*) FROM ${check.table}" -t`
      )
      const count = parseInt(stdout.trim())
      if (count < check.minRows) {
        throw new Error(`Table ${check.table}: ${count} rows (expected >= ${check.minRows})`)
      }
      console.log(`${check.table}: ${count} rows`)
    }

    console.log('✅ Backup verification passed')
  } finally {
    // Always clean up test database
    await execAsync(`dropdb --if-exists ${testDb}`)
  }
}

// Run after every backup job
verifyBackup('/backups/prod-2026-03-15.pgdump')
  .catch((err) => {
    console.error('❌ BACKUP VERIFICATION FAILED:', err.message)
    // Alert! This is a critical failure
    alerting.critical(`Backup verification failed: ${err.message}`)
    process.exit(1)
  })

Fix 2: Alert on Backup Failure, Not Just Cron Exit Code

// backup-monitor.ts — separate from the backup job itself
// Checks that a recent backup exists and is valid

import cron from 'node-cron'

cron.schedule('0 8 * * *', async () => {
  await checkBackupHealth()
})

async function checkBackupHealth() {
  const s3 = new S3Client({ region: 'us-east-1' })

  // List recent backups
  const list = await s3.send(new ListObjectsV2Command({
    Bucket: process.env.BACKUP_BUCKET!,
    Prefix: 'backups/prod-',
  }))

  const objects = list.Contents ?? []

  // Find the most recent backup
  const mostRecent = objects
    .sort((a, b) => (b.LastModified?.getTime() ?? 0) - (a.LastModified?.getTime() ?? 0))[0]

  if (!mostRecent) {
    await alerting.critical('NO BACKUPS FOUND in S3 bucket!')
    return
  }

  const ageMs = Date.now() - (mostRecent.LastModified?.getTime() ?? 0)
  const ageHours = ageMs / (1000 * 60 * 60)

  if (ageHours > 26) {
    // More than 26 hours since last backup (should be daily)
    await alerting.critical(`Last backup is ${ageHours.toFixed(0)}h old — backup may be failing!`)
    return
  }

  const sizeMb = (mostRecent.Size ?? 0) / (1024 * 1024)
  if (sizeMb < 1) {
    await alerting.critical(`Last backup is only ${sizeMb.toFixed(2)}MB — likely corrupted or empty!`)
    return
  }

  console.log(`✅ Backup health check passed: ${mostRecent.Key}, ${sizeMb.toFixed(0)}MB, ${ageHours.toFixed(1)}h ago`)
}

Fix 3: Structured Backup Script with Error Handling

#!/bin/bash
# backup.sh — backup with proper error handling and alerting

set -euo pipefail  # Exit on error, undefined var, pipe failure

BACKUP_FILE="/tmp/backup-$(date +%Y-%m-%d-%H%M%S).pgdump"
BUCKET="s3://my-backups/prod/"
LOG_FILE="/var/log/backup.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"

log() {
  echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $1" | tee -a "$LOG_FILE"
}

alert_failure() {
  local message="$1"
  log "FAILURE: $message"
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\": \"🚨 *BACKUP FAILED*: $message\"}"
  exit 1
}

log "Starting backup..."

# 1. Create backup
pg_dump \
  --format=custom \
  --compress=9 \
  --no-password \
  --dbname="$DATABASE_URL" \
  --file="$BACKUP_FILE" \
  || alert_failure "pg_dump failed with exit code $?"

# 2. Verify backup file is non-empty
SIZE=$(stat -c%s "$BACKUP_FILE")
if [ "$SIZE" -lt 1048576 ]; then  # < 1MB
  alert_failure "Backup file too small: ${SIZE} bytes"
fi
log "Backup size: $((SIZE / 1024 / 1024)) MB"

# 3. Test backup is valid (list contents without restoring)
pg_restore --list "$BACKUP_FILE" > /dev/null \
  || alert_failure "pg_restore --list failed — backup may be corrupted"

# 4. Upload to S3
aws s3 cp "$BACKUP_FILE" "${BUCKET}$(basename $BACKUP_FILE)" \
  --storage-class STANDARD_IA \
  || alert_failure "S3 upload failed"

# 5. Verify upload completed
aws s3 ls "${BUCKET}$(basename $BACKUP_FILE)" \
  || alert_failure "S3 verification failed — file not found after upload"

log "✅ Backup completed successfully"

# 6. Clean up local file
rm "$BACKUP_FILE"

# 7. Alert on success too (so you notice when alerts stop)
curl -s -X POST "$SLACK_WEBHOOK" \
  -H 'Content-type: application/json' \
  -d "{\"text\": \"✅ Backup completed: $((SIZE / 1024 / 1024))MB\"}"

Fix 4: Monthly Restore Drill

// restore-drill.ts — run monthly to confirm you can actually restore

async function monthlyRestoreDrill() {
  console.log('Starting monthly restore drill...')

  // 1. Download most recent backup
  const backupFile = await downloadLatestBackup()

  // 2. Restore to a staging database
  const drillDb = `restore_drill_${Date.now()}`
  await execAsync(`createdb ${drillDb}`)

  const startTime = Date.now()
  await execAsync(`pg_restore -d ${drillDb} ${backupFile}`)
  const restoreTimeMs = Date.now() - startTime

  console.log(`Restore time: ${(restoreTimeMs / 1000 / 60).toFixed(1)} minutes`)

  // 3. Validate data integrity
  await runIntegrityChecks(drillDb)

  // 4. Log the results
  await db.query(`
    INSERT INTO restore_drill_log (backup_file, restore_time_seconds, row_counts, passed, drilled_at)
    VALUES ($1, $2, $3, true, NOW())
  `, [backupFile, restoreTimeMs / 1000, await getRowCounts(drillDb)])

  // 5. Clean up
  await execAsync(`dropdb ${drillDb}`)

  console.log('✅ Monthly restore drill passed')
}

// Schedule monthly
cron.schedule('0 3 1 * *', monthlyRestoreDrill)

Backup Checklist

  • ✅ Verify every backup by checking file size AND listing contents
  • ✅ Actually restore to a test DB weekly — confirm data integrity
  • ✅ Alert on backup failure, not just log it
  • ✅ Monitor that backups are recent — alert if > 26 hours since last backup
  • ✅ Store backups in a separate account/region from production
  • ✅ Test that backup credentials are still valid monthly
  • ✅ Document and test the restore procedure — don't discover it during an incident
  • ✅ Run a full restore drill quarterly — measure how long it takes
  • ✅ Keep at least 30 days of backups — some incidents are discovered weeks later

Conclusion

The only backup that matters is one you've successfully restored. Everything else is a file that might work. The fix isn't a better backup tool — it's a verification step after every backup, an alert when backups are missing or too small, and a monthly restore drill that proves you can actually get data back. Set up the monitoring before you need it: a daily check that the backup exists, is recent, is non-trivially sized, and passes a structural validity check. That 5-minute cron job might save your company.