Published on

Restore That Took 9 Hours — Why You Need to Know Your RTO Before the Incident

Authors

Introduction

Recovery Time Objective (RTO) — how long can your service be down — is usually defined in a business document somewhere. What's rarely defined is whether your actual restore procedure can meet that objective. A 9-hour restore on an 800GB database isn't a failure of the backup system. It's a failure to test and optimize the restore path before the incident, when there's time to fix it.

Why Restores Take So Long

800GB database backup restore — default approach:

Step 1: Download from S345 minutes (10 Gbps link, but S3 throttled)
Step 2: pg_restore (single-threaded)6 hours   (indexes rebuilt one at a time)
Step 3: ANALYZE (vacuum/statistics)45 minutes
Step 4: Verify data integrity             → 30 minutes
Step 5: Update DNS / connection strings   → 15 minutes
Step 6: Warm up caches                    → 30 minutes

Total: ~9 hours

Same restore — optimized:

Step 1: Stream from S3 while restoring    → Parallel, saves 45 min
Step 2: pg_restore --jobs=8 (parallel)90 minutes (not 6 hours)
Step 3: ANALYZE VERBOSE20 minutes (focused)
Step 4+5+6: Automated runbook             → 15 minutes

Total: ~2 hours, 15 minutes

Fix 1: Parallel pg_restore

The single biggest improvement: restore with multiple jobs in parallel:

# ❌ Default single-threaded restore — rebuilds one index at a time
pg_restore \
  --dbname=postgres \
  --format=custom \
  backup.pgdump

# ✅ Parallel restore — uses multiple CPU cores
pg_restore \
  --dbname=postgres \
  --format=custom \
  --jobs=8 \              # Use 8 parallel workers (match your CPU count)
  --no-owner \
  --no-privileges \
  backup.pgdump

# Note: --jobs requires custom or directory format, not plain SQL format
# If your backups are plain SQL, switch to custom format:
pg_dump --format=custom --file=backup.pgdump ...

Parallel restore is 4-8x faster for large databases. The bottleneck shifts from index rebuilding (sequential) to I/O and CPU (parallelizable).

Fix 2: Stream Download and Restore Simultaneously

Don't wait for the full download before starting the restore:

# ❌ Sequential: download (45 min) THEN restore (6 hours)
aws s3 cp s3://backups/prod.pgdump /tmp/prod.pgdump
pg_restore --jobs=8 -d mydb /tmp/prod.pgdump

# ✅ Stream directly into pg_restore — no temp file needed
aws s3 cp s3://backups/prod.pgdump - | \
  pg_restore --jobs=4 --format=custom -d mydb /dev/stdin

# Or using streaming for very large files:
aws s3 cp s3://backups/prod.pgdump - \
  --no-progress \
  | pg_restore \
      --jobs=8 \
      --format=custom \
      --dbname="postgresql://user:pass@localhost/mydb" \
      -

Fix 3: Pre-Create the Database Before Restore

#!/bin/bash
# fast-restore.sh

set -euo pipefail

DB_NAME="myapp_production"
BACKUP_S3_PATH="s3://my-backups/prod/latest.pgdump"
DB_URL="postgresql://postgres@localhost"

echo "Starting fast restore..."

# 1. Create empty database (optimized for restore)
psql "$DB_URL" -c "DROP DATABASE IF EXISTS ${DB_NAME}_new"
psql "$DB_URL" -c "CREATE DATABASE ${DB_NAME}_new"

# 2. Tune PostgreSQL for fast restore (bypass WAL for speed)
psql "$DB_URL/${DB_NAME}_new" <<EOF
  -- These settings dramatically speed up bulk loads
  -- (Reset to normal after restore completes)
  ALTER SYSTEM SET max_wal_size = '4GB';
  ALTER SYSTEM SET checkpoint_completion_target = 0.9;
  ALTER SYSTEM SET synchronous_commit = off;   -- Don't wait for WAL on each row
  ALTER SYSTEM SET wal_level = minimal;        -- Less WAL overhead
  SELECT pg_reload_conf();
EOF

# 3. Restore in parallel, streaming from S3
echo "Downloading and restoring in parallel..."
aws s3 cp "$BACKUP_S3_PATH" - | \
  pg_restore \
    --jobs=8 \
    --format=custom \
    --no-owner \
    --no-privileges \
    --dbname="${DB_URL}/${DB_NAME}_new"

# 4. Restore normal settings
psql "$DB_URL/${DB_NAME}_new" <<EOF
  ALTER SYSTEM SET max_wal_size = '1GB';
  ALTER SYSTEM SET synchronous_commit = on;
  ALTER SYSTEM SET wal_level = replica;
  SELECT pg_reload_conf();
EOF

# 5. Run ANALYZE to update query planner statistics
echo "Running ANALYZE..."
psql "$DB_URL/${DB_NAME}_new" -c "ANALYZE"

# 6. Atomic swap: rename old → _old, new → active
psql "$DB_URL" -c "ALTER DATABASE ${DB_NAME} RENAME TO ${DB_NAME}_old"
psql "$DB_URL" -c "ALTER DATABASE ${DB_NAME}_new RENAME TO ${DB_NAME}"

echo "✅ Restore complete. Old database available as ${DB_NAME}_old for verification."

Fix 4: Point-in-Time Recovery (PITR) — Faster Than Full Restore

For PostgreSQL, PITR + WAL archiving lets you restore to any point in time much faster than a full dump restore:

# postgresql.conf — enable WAL archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
wal_level = replica

# With PITR + base backup, restore is:
# 1. Restore base backup (smaller than full dump — no indexes in WAL)
# 2. Apply WAL segments up to the recovery point
# 3. Total time: much faster for large databases
# Restore with PITR:
# 1. Restore base backup
aws s3 sync s3://my-backups/base-backup/ /var/lib/postgresql/data/

# 2. Configure recovery target
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2026-03-15 02:30:00 UTC'  -- restore to just before incident
recovery_target_action = 'promote'
EOF

# 3. Start PostgreSQL — it replays WAL automatically
pg_ctl start

PITR also gives you surgical recovery: restore to exactly 1 minute before the accidental DELETE, not to last night's backup.

Fix 5: Document and Automate the Restore Runbook

The worst time to figure out your restore procedure is during an incident. Document it in advance:

# Database Restore Runbook

## Pre-conditions
- [ ] Access to AWS console and S3 backup bucket
- [ ] PostgreSQL client installed on restore target
- [ ] Database credentials in 1Password under "DB Restore"

## Steps

### 1. Spin up restore instance (5 min)
- Launch RDS snapshot, OR
- Launch EC2 with PostgreSQL 15 matching production version

### 2. Download latest backup (streaming)
```bash
aws s3 ls s3://my-backups/prod/ | sort | tail -5  # find latest
export BACKUP=prod-2026-03-15-020000.pgdump

3. Run restore (90 min)

./scripts/fast-restore.sh $BACKUP

4. Verify

psql $NEW_DB_URL -f scripts/restore-verify.sql

5. Cutover

  • Update DATABASE_URL in AWS Secrets Manager
  • Trigger rolling restart of all app instances
  • Monitor error rate for 10 minutes

Expected Times

StepDuration
Spin up instance5 min
Restore (800GB)90 min
ANALYZE20 min
Verify + cutover15 min
Total RTO~2 hours

Escalation

  • First 15 min: On-call engineer
  • After 30 min: Incident commander
  • After 1 hour: CTO notification

<AdSense adSlot={'5967010145'} />

## Measuring Your Actual RTO

```ts
// restore-drill-timer.ts — run quarterly
async function timedRestoreDrill() {
  const timeline: Record<string, number> = {}

  timeline.start = Date.now()

  // Step 1: Download
  await downloadBackup('/tmp/restore-drill.pgdump')
  timeline.download = Date.now()

  // Step 2: Restore
  await execAsync('pg_restore --jobs=8 -d restore_drill /tmp/restore-drill.pgdump')
  timeline.restore = Date.now()

  // Step 3: Analyze
  await execAsync('psql -d restore_drill -c "ANALYZE"')
  timeline.analyze = Date.now()

  // Step 4: Verify
  await runVerificationChecks('restore_drill')
  timeline.verify = Date.now()

  // Log results
  const totalMinutes = (timeline.verify - timeline.start) / 60000

  console.log(`Restore Drill Results:`)
  console.log(`  Download:   ${((timeline.download - timeline.start) / 60000).toFixed(0)} min`)
  console.log(`  Restore:    ${((timeline.restore - timeline.download) / 60000).toFixed(0)} min`)
  console.log(`  ANALYZE:    ${((timeline.analyze - timeline.restore) / 60000).toFixed(0)} min`)
  console.log(`  Verify:     ${((timeline.verify - timeline.analyze) / 60000).toFixed(0)} min`)
  console.log(`  TOTAL RTO:  ${totalMinutes.toFixed(0)} min`)

  // Store in database for trending
  await db.query(`
    INSERT INTO restore_drills (duration_minutes, steps, drilled_at)
    VALUES ($1, $2, NOW())
  `, [totalMinutes, JSON.stringify(timeline)])

  // Alert if RTO exceeds target
  if (totalMinutes > 120) {  // 2 hour target
    await alerting.warn(`Restore drill exceeded RTO target: ${totalMinutes.toFixed(0)} min (target: 120 min)`)
  }
}

Restore Checklist

  • ✅ Use pg_restore --jobs=N to parallelize (N = CPU core count)
  • ✅ Stream from S3 directly into pg_restore — skip the temp file
  • ✅ Tune PostgreSQL write settings during restore (disable sync_commit, increase WAL)
  • ✅ Consider WAL archiving + PITR for faster, more precise recovery
  • ✅ Document the restore runbook — exact commands, expected times, escalation path
  • ✅ Run a timed restore drill quarterly — measure actual RTO vs target
  • ✅ Keep restore tools (pg_restore, aws CLI) installed on a ready standby instance
  • ✅ Verify the restored database before cutting over — don't find errors mid-cutover

Conclusion

A 9-hour restore is almost always avoidable. The biggest wins are parallel restore (--jobs=8 reduces 6 hours to 90 minutes), streaming from S3 to save the download wait, and WAL archiving for PITR (lets you restore to a surgical point in time, not just last night's backup). But the most important fix is running a timed restore drill before you have an incident — so you know your actual RTO, discover the bottlenecks with time to fix them, and have a written runbook the on-call engineer can follow at 2 AM without guessing. If you've never tested your restore, you don't have a recovery plan. You have a backup file and a hope.