- Published on
Backup That Never Worked — The False Safety Net That Fails When You Need It Most
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Every team believes they have backups — until they need them. The backup job runs on a schedule, the cron logs show success, and everyone sleeps soundly. Then a disk fails or someone runs DELETE FROM users without a WHERE clause, and you discover the backup files are 0 bytes, the S3 bucket has a lifecycle policy that deleted them last week, or the backup credentials rotated and the job has been silently failing for months.
- How Backups Fail Silently
- Fix 1: Verify Every Backup by Restoring It
- Fix 2: Alert on Backup Failure, Not Just Cron Exit Code
- Fix 3: Structured Backup Script with Error Handling
- Fix 4: Monthly Restore Drill
- Backup Checklist
- Conclusion
How Backups Fail Silently
Common failure modes:
1. Backup job exits with code 0 even on partial failure
→ Cron reports "success", file is 0 bytes or truncated
2. Disk full mid-backup
→ File written, truncated at 2GB, job reports success
3. Credentials expired (S3 key rotated, DB password changed)
→ Upload fails, but cron job swallows the error
4. Backup runs but excludes the most important table
→ `pg_dump --exclude-table=large_logs` — logs excluded for size,
but someone put critical data in that table
5. Backup of replica, not primary
→ Replica was 3 hours behind due to network issue
→ "Successful" backup is missing 3 hours of data
6. Lifecycle policy deleted old backups
→ S3 bucket: keep 30 days, delete older
→ Incident discovered on day 45 — no backups available
7. Backup encrypted with lost key
→ Files exist, cannot decrypt
8. Wrong format: backup works, restore doesn't
→ Dump created with pg_dump 13, trying to restore with pg_restore 12
→ Incompatible format
Fix 1: Verify Every Backup by Restoring It
A backup you've never restored is not a backup. You don't know it works until you've proved it:
// backup-verify.ts — run after every backup
import { exec } from 'child_process'
import { promisify } from 'util'
import * as fs from 'fs'
const execAsync = promisify(exec)
async function verifyBackup(backupFile: string): Promise<void> {
console.log(`Verifying backup: ${backupFile}`)
// 1. Check file exists and has non-zero size
const stats = fs.statSync(backupFile)
if (stats.size < 1024) {
throw new Error(`Backup file suspiciously small: ${stats.size} bytes`)
}
console.log(`✅ File size: ${(stats.size / 1024 / 1024).toFixed(1)} MB`)
// 2. Check the backup is valid (without fully restoring)
// For pg_dump custom format:
const { stdout } = await execAsync(`pg_restore --list ${backupFile}`)
const tableCount = stdout.split('\n').filter(l => l.includes('TABLE DATA')).length
if (tableCount < 10) {
throw new Error(`Backup has only ${tableCount} tables — expected at least 10`)
}
console.log(`✅ Tables in backup: ${tableCount}`)
// 3. Restore to a test database and run spot-check queries
const testDb = `verify_${Date.now()}`
try {
await execAsync(`createdb ${testDb}`)
await execAsync(`pg_restore -d ${testDb} ${backupFile}`)
// Spot-check: verify critical tables exist and have rows
const checks = [
{ table: 'users', minRows: 100 },
{ table: 'orders', minRows: 50 },
{ table: 'products', minRows: 10 },
]
for (const check of checks) {
const { stdout } = await execAsync(
`psql -d ${testDb} -c "SELECT COUNT(*) FROM ${check.table}" -t`
)
const count = parseInt(stdout.trim())
if (count < check.minRows) {
throw new Error(`Table ${check.table}: ${count} rows (expected >= ${check.minRows})`)
}
console.log(`✅ ${check.table}: ${count} rows`)
}
console.log('✅ Backup verification passed')
} finally {
// Always clean up test database
await execAsync(`dropdb --if-exists ${testDb}`)
}
}
// Run after every backup job
verifyBackup('/backups/prod-2026-03-15.pgdump')
.catch((err) => {
console.error('❌ BACKUP VERIFICATION FAILED:', err.message)
// Alert! This is a critical failure
alerting.critical(`Backup verification failed: ${err.message}`)
process.exit(1)
})
Fix 2: Alert on Backup Failure, Not Just Cron Exit Code
// backup-monitor.ts — separate from the backup job itself
// Checks that a recent backup exists and is valid
import cron from 'node-cron'
cron.schedule('0 8 * * *', async () => {
await checkBackupHealth()
})
async function checkBackupHealth() {
const s3 = new S3Client({ region: 'us-east-1' })
// List recent backups
const list = await s3.send(new ListObjectsV2Command({
Bucket: process.env.BACKUP_BUCKET!,
Prefix: 'backups/prod-',
}))
const objects = list.Contents ?? []
// Find the most recent backup
const mostRecent = objects
.sort((a, b) => (b.LastModified?.getTime() ?? 0) - (a.LastModified?.getTime() ?? 0))[0]
if (!mostRecent) {
await alerting.critical('NO BACKUPS FOUND in S3 bucket!')
return
}
const ageMs = Date.now() - (mostRecent.LastModified?.getTime() ?? 0)
const ageHours = ageMs / (1000 * 60 * 60)
if (ageHours > 26) {
// More than 26 hours since last backup (should be daily)
await alerting.critical(`Last backup is ${ageHours.toFixed(0)}h old — backup may be failing!`)
return
}
const sizeMb = (mostRecent.Size ?? 0) / (1024 * 1024)
if (sizeMb < 1) {
await alerting.critical(`Last backup is only ${sizeMb.toFixed(2)}MB — likely corrupted or empty!`)
return
}
console.log(`✅ Backup health check passed: ${mostRecent.Key}, ${sizeMb.toFixed(0)}MB, ${ageHours.toFixed(1)}h ago`)
}
Fix 3: Structured Backup Script with Error Handling
#!/bin/bash
# backup.sh — backup with proper error handling and alerting
set -euo pipefail # Exit on error, undefined var, pipe failure
BACKUP_FILE="/tmp/backup-$(date +%Y-%m-%d-%H%M%S).pgdump"
BUCKET="s3://my-backups/prod/"
LOG_FILE="/var/log/backup.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
log() {
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $1" | tee -a "$LOG_FILE"
}
alert_failure() {
local message="$1"
log "FAILURE: $message"
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\": \"🚨 *BACKUP FAILED*: $message\"}"
exit 1
}
log "Starting backup..."
# 1. Create backup
pg_dump \
--format=custom \
--compress=9 \
--no-password \
--dbname="$DATABASE_URL" \
--file="$BACKUP_FILE" \
|| alert_failure "pg_dump failed with exit code $?"
# 2. Verify backup file is non-empty
SIZE=$(stat -c%s "$BACKUP_FILE")
if [ "$SIZE" -lt 1048576 ]; then # < 1MB
alert_failure "Backup file too small: ${SIZE} bytes"
fi
log "Backup size: $((SIZE / 1024 / 1024)) MB"
# 3. Test backup is valid (list contents without restoring)
pg_restore --list "$BACKUP_FILE" > /dev/null \
|| alert_failure "pg_restore --list failed — backup may be corrupted"
# 4. Upload to S3
aws s3 cp "$BACKUP_FILE" "${BUCKET}$(basename $BACKUP_FILE)" \
--storage-class STANDARD_IA \
|| alert_failure "S3 upload failed"
# 5. Verify upload completed
aws s3 ls "${BUCKET}$(basename $BACKUP_FILE)" \
|| alert_failure "S3 verification failed — file not found after upload"
log "✅ Backup completed successfully"
# 6. Clean up local file
rm "$BACKUP_FILE"
# 7. Alert on success too (so you notice when alerts stop)
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\": \"✅ Backup completed: $((SIZE / 1024 / 1024))MB\"}"
Fix 4: Monthly Restore Drill
// restore-drill.ts — run monthly to confirm you can actually restore
async function monthlyRestoreDrill() {
console.log('Starting monthly restore drill...')
// 1. Download most recent backup
const backupFile = await downloadLatestBackup()
// 2. Restore to a staging database
const drillDb = `restore_drill_${Date.now()}`
await execAsync(`createdb ${drillDb}`)
const startTime = Date.now()
await execAsync(`pg_restore -d ${drillDb} ${backupFile}`)
const restoreTimeMs = Date.now() - startTime
console.log(`Restore time: ${(restoreTimeMs / 1000 / 60).toFixed(1)} minutes`)
// 3. Validate data integrity
await runIntegrityChecks(drillDb)
// 4. Log the results
await db.query(`
INSERT INTO restore_drill_log (backup_file, restore_time_seconds, row_counts, passed, drilled_at)
VALUES ($1, $2, $3, true, NOW())
`, [backupFile, restoreTimeMs / 1000, await getRowCounts(drillDb)])
// 5. Clean up
await execAsync(`dropdb ${drillDb}`)
console.log('✅ Monthly restore drill passed')
}
// Schedule monthly
cron.schedule('0 3 1 * *', monthlyRestoreDrill)
Backup Checklist
- ✅ Verify every backup by checking file size AND listing contents
- ✅ Actually restore to a test DB weekly — confirm data integrity
- ✅ Alert on backup failure, not just log it
- ✅ Monitor that backups are recent — alert if > 26 hours since last backup
- ✅ Store backups in a separate account/region from production
- ✅ Test that backup credentials are still valid monthly
- ✅ Document and test the restore procedure — don't discover it during an incident
- ✅ Run a full restore drill quarterly — measure how long it takes
- ✅ Keep at least 30 days of backups — some incidents are discovered weeks later
Conclusion
The only backup that matters is one you've successfully restored. Everything else is a file that might work. The fix isn't a better backup tool — it's a verification step after every backup, an alert when backups are missing or too small, and a monthly restore drill that proves you can actually get data back. Set up the monitoring before you need it: a daily check that the backup exists, is recent, is non-trivially sized, and passes a structural validity check. That 5-minute cron job might save your company.