Database Backup and Disaster Recovery 2026: Never Lose Data Again
Advertisement
Database Disaster Recovery 2026: Plan for the Worst
The question is not if your database will have a problem โ it is when. The difference between a minor incident and a business-ending disaster is preparation.
- RTO and RPO: Define Your Targets
- Automated PostgreSQL Backups
- Point-in-Time Recovery (PITR)
- Read Replicas and Failover
- Backup Verification
- Disaster Recovery Runbook
- S3 Backup Lifecycle Policy
RTO and RPO: Define Your Targets
RTO (Recovery Time Objective): How long can you be down?
RPO (Recovery Point Objective): How much data can you lose?
Tier 1 (e-commerce, banking): RTO < 1h, RPO < 5min
Tier 2 (SaaS apps): RTO < 4h, RPO < 1h
Tier 3 (blogs, internal tools): RTO < 24h, RPO < 24h
Automated PostgreSQL Backups
#!/bin/bash
# backup-postgres.sh
set -euo pipefail
DB_URL="${DATABASE_URL}"
S3_BUCKET="webcoderspeed-backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="postgres_${TIMESTAMP}.sql.gz"
echo "Starting PostgreSQL backup..."
# Create compressed backup
pg_dump "$DB_URL" | gzip > "/tmp/$BACKUP_FILE"
echo "Backup created: /tmp/$BACKUP_FILE ($(du -sh /tmp/$BACKUP_FILE | cut -f1))"
# Upload to S3
aws s3 cp "/tmp/$BACKUP_FILE" "s3://$S3_BUCKET/postgres/$BACKUP_FILE" \
--storage-class STANDARD_IA # Cheaper for backups
# Verify upload
aws s3 ls "s3://$S3_BUCKET/postgres/$BACKUP_FILE"
# Clean up local file
rm "/tmp/$BACKUP_FILE"
# Delete backups older than 30 days
aws s3 ls "s3://$S3_BUCKET/postgres/" | while read -r line; do
createDate=$(echo "$line" | awk '{print $1}')
if [[ $(date -d "$createDate" +%s) -lt $(date -d "30 days ago" +%s) ]]; then
fileName=$(echo "$line" | awk '{print $4}')
aws s3 rm "s3://$S3_BUCKET/postgres/$fileName"
fi
done
echo "Backup completed!"
# GitHub Actions: Automated daily backup
name: Database Backup
on:
schedule:
- cron: '0 0 * * *' # Daily at midnight UTC
workflow_dispatch: # Manual trigger
jobs:
backup:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run backup
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: bash scripts/backup-postgres.sh
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
channel-id: '#alerts'
slack-message: '๐จ Database backup FAILED! Check GitHub Actions.'
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
Point-in-Time Recovery (PITR)
# Enable continuous archiving (WAL archiving)
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'
archive_timeout = 60 # Force archive every 60 seconds
# Recover to specific point in time
# 1. Restore base backup
aws s3 sync s3://backups/base/latest/ /var/lib/postgresql/data/
# 2. Create recovery config
cat > /var/lib/postgresql/data/postgresql.auto.conf <<EOF
restore_command = 'aws s3 cp s3://backups/wal/%f %p'
recovery_target_time = '2026-03-26 14:30:00'
recovery_target_action = 'promote'
EOF
# 3. Start PostgreSQL โ it will replay WAL to the target time
pg_ctl start
Read Replicas and Failover
// Use primary for writes, replica for reads
import { Pool } from 'pg'
const primaryPool = new Pool({
host: process.env.DB_PRIMARY_HOST, // Read-write
})
const replicaPool = new Pool({
host: process.env.DB_REPLICA_HOST, // Read-only
})
class DatabaseService {
async query(sql: string, params: any[]) {
// Default: replica for reads
return replicaPool.query(sql, params)
}
async write(sql: string, params: any[]) {
// Writes always go to primary
return primaryPool.query(sql, params)
}
async transaction<T>(fn: (client: any) => Promise<T>): Promise<T> {
const client = await primaryPool.connect()
try {
await client.query('BEGIN')
const result = await fn(client)
await client.query('COMMIT')
return result
} catch (error) {
await client.query('ROLLBACK')
throw error
} finally {
client.release()
}
}
}
Backup Verification
#!/bin/bash
# verify-backup.sh โ Test restore every week
BACKUP_URL="s3://webcoderspeed-backups/postgres/latest.sql.gz"
TEST_DB="backup_test_$(date +%Y%m%d)"
echo "Testing backup restore..."
# Download and restore to temp DB
aws s3 cp "$BACKUP_URL" /tmp/backup.sql.gz
createdb "$TEST_DB"
gunzip -c /tmp/backup.sql.gz | psql "$TEST_DB"
# Verify key tables exist
TABLES=$(psql "$TEST_DB" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'")
echo "Tables in backup: $TABLES"
# Verify row counts
USERS=$(psql "$TEST_DB" -t -c "SELECT count(*) FROM users")
POSTS=$(psql "$TEST_DB" -t -c "SELECT count(*) FROM posts")
echo "Users: $USERS, Posts: $POSTS"
# Cleanup
dropdb "$TEST_DB"
rm /tmp/backup.sql.gz
echo "Backup verification complete!"
Disaster Recovery Runbook
## Database Disaster Recovery Runbook
### Incident: Database server unreachable
**Step 1: Assess (5 minutes)**
- [ ] Check database server status
- [ ] Check connection pool logs
- [ ] Determine if replica is available
**Step 2: Failover to replica (10 minutes)**
- [ ] Promote replica to primary:
`aws rds promote-read-replica --db-instance-identifier mydb-replica`
- [ ] Update DNS CNAME: db.webcoderspeed.com โ replica endpoint
- [ ] Verify application connects successfully
- [ ] Notify team via Slack
**Step 3: Restore from backup (if no replica)**
- [ ] Identify latest backup in S3
- [ ] Launch new RDS instance
- [ ] Restore backup
- [ ] Update connection strings
- [ ] Verify data integrity
**Step 4: Post-incident**
- [ ] Document timeline and root cause
- [ ] Create replica for new primary
- [ ] Review monitoring alerts
- [ ] Update runbook if needed
S3 Backup Lifecycle Policy
{
"Rules": [{
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 }
}]
}
A backup that has never been tested is not a backup โ it's hope. Schedule quarterly disaster recovery drills. The drill reveals what the theory misses.
Advertisement