Published on

No Rollback Strategy — The Deploy That Can't Be Undone

Authors

Introduction

Rollback sounds simple until you actually need it. Revert the code, redeploy the old version — done. Except the database migration already ran and the old version doesn't understand the new schema. Or the message queue has new event formats the old consumer doesn't recognize. Or the S3 bucket has files in a new structure the old code doesn't know how to read. Every deploy that touches persistence makes rollback a data problem, not just a code problem — and that problem needs to be solved before you push.

Why Rollbacks Fail

Common rollback failures:

1. Database migration ran before code deploy
Old code tries to read new schema: column doesn't exist
New schema has NOT NULL constraint old code can't satisfy

2. Irreversible migration
DROP COLUMN: data is gone
RENAME COLUMN: old code uses old name, fails with "column not found"
Change column type: values truncated, old code can't write

3. Message format changed
Old consumer can't deserialize new event format
Queue has 10,000 messages in new format, old code crashes on each

4. External contract changed
Deployed new API response shape to client
Rolled back server — client now getting unexpected shape

5. File/storage structure changed
New code writes files to /uploads/2026/03/file.jpg
Old code expects /uploads/file.jpg
Rolled back, new files are orphaned

Fix 1: Expand/Contract Migrations (Never Break Rollback)

-- The pattern: make the DB compatible with BOTH old and new code
-- before deploying new code.

-- ❌ BAD: Rename column in one migration
ALTER TABLE users RENAME COLUMN username TO display_name;
-- Old code breaks immediately. No safe rollback.

-- ✅ GOOD: Expand/Contract (3-phase migration)

-- Phase 1: EXPAND — add new column, keep old one
-- Deploy this migration alone, before code change
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
UPDATE users SET display_name = username;

-- Now BOTH old code (uses username) and new code (uses display_name) work.
-- Deploy new code here. Old code still runs fine during rollout.

-- Phase 2: CONTRACT — remove old column (in a future deploy, after old code is gone)
-- Only after you're confident rollback won't be needed
ALTER TABLE users DROP COLUMN username;
// TypeScript: write new code to handle both column names during transition
async function getUserDisplayName(userId: string): Promise<string> {
  const user = await db.query(
    'SELECT display_name, username FROM users WHERE id = $1',
    [userId]
  )
  // Handle both: new rows have display_name, migrated rows may use username
  return user.rows[0].display_name ?? user.rows[0].username
}

Fix 2: Versioned Migrations With Down Scripts

// Every migration must have an up AND a down
// db/migrations/20260315_add_user_tier.ts

export async function up(db: Knex): Promise<void> {
  await db.schema.table('users', (table) => {
    table.string('tier').defaultTo('free').notNullable()
  })
  await db.raw("UPDATE users SET tier = 'free' WHERE tier IS NULL")
}

export async function down(db: Knex): Promise<void> {
  // Must be safe to run — not just "drop the column"
  await db.schema.table('users', (table) => {
    table.dropColumn('tier')
  })
}
# Rollback one migration
yarn migrate:rollback

# Rollback to specific version
yarn migrate:rollback --to 20260314_create_orders

# Test rollback before production:
# 1. Run migration on staging
# 2. Run down migration
# 3. Verify app still works
# 4. Only then deploy to production

Fix 3: Blue/Green Deployment for Instant Rollback

# Blue/Green: maintain two identical environments
# Traffic switches atomically between them

# terraform/blue-green.tf
resource "aws_lb_listener_rule" "app" {
  listener_arn = aws_lb_listener.https.arn

  action {
    type             = "forward"
    target_group_arn = var.active_target_group  # blue or green
  }
}

# Deployment process:
# 1. Deploy new version to GREEN (currently inactive)
# 2. Run smoke tests against GREEN
# 3. Switch load balancer to GREEN (takes ~30 seconds)
# 4. BLUE is still running and intact
# 5. If GREEN has issues: switch back to BLUE in 30 seconds
# 6. After confidence period: terminate BLUE instances
#!/bin/bash
# deploy-blue-green.sh

ACTIVE=$(aws elbv2 describe-target-groups \
  --query 'TargetGroups[?Tags[?Key==`active`&&Value==`true`]].TargetGroupName' \
  --output text)

if [ "$ACTIVE" == "blue" ]; then
  DEPLOY_TO="green"
  KEEP_ACTIVE="blue"
else
  DEPLOY_TO="green"
  KEEP_ACTIVE="blue"
fi

echo "Deploying to $DEPLOY_TO (keeping $KEEP_ACTIVE hot for rollback)"

# Deploy to inactive environment
aws ecs update-service \
  --cluster myapp \
  --service "myapp-${DEPLOY_TO}" \
  --task-definition "myapp:${NEW_TASK_DEF}"

# Wait for healthy
aws ecs wait services-stable \
  --cluster myapp \
  --services "myapp-${DEPLOY_TO}"

# Run smoke tests
./scripts/smoke-tests.sh "${DEPLOY_TO_URL}"

# Switch traffic
aws elbv2 modify-rule \
  --rule-arn "$LISTENER_RULE_ARN" \
  --actions "Type=forward,TargetGroupArn=${DEPLOY_TO_TARGET_GROUP_ARN}"

echo "✅ Traffic switched to $DEPLOY_TO. $KEEP_ACTIVE ready for instant rollback."
echo "Run: ./rollback.sh to switch back to $KEEP_ACTIVE"

Fix 4: Feature Flags as Logical Rollback

// Feature flags give you rollback without redeployment
// When new code breaks in production: flip flag, not deploy

import { getFlag } from './feature-flags'

async function processPayment(order: Order): Promise<PaymentResult> {
  const useNewPaymentFlow = await getFlag('new-payment-flow', {
    userId: order.userId,
    rolloutPercentage: 10,  // Start at 10%, increase if stable
  })

  if (useNewPaymentFlow) {
    return processPaymentV2(order)  // New implementation
  }

  return processPaymentV1(order)  // Proven implementation
}

// If new-payment-flow causes issues:
// 1. Set flag rolloutPercentage to 0 in dashboard
// 2. All traffic immediately falls back to V1
// 3. No deployment, no migration, no downtime
// 4. Debug V2 while V1 handles production

Fix 5: Automated Rollback on Error Rate Spike

// deploy-monitor.ts — watch error rate after deploy, auto-rollback if needed
async function monitorDeploy(deployId: string, rollbackFn: () => Promise<void>) {
  const WINDOW_MINUTES = 10
  const ERROR_RATE_THRESHOLD = 0.05  // 5%
  const CHECK_INTERVAL_MS = 30_000

  console.log(`Monitoring deploy ${deployId} for ${WINDOW_MINUTES} minutes...`)

  const startTime = Date.now()

  while (Date.now() - startTime < WINDOW_MINUTES * 60 * 1000) {
    await sleep(CHECK_INTERVAL_MS)

    const metrics = await fetchMetrics({
      window: '2m',
      deployId,
    })

    const errorRate = metrics.errors / metrics.requests

    if (errorRate > ERROR_RATE_THRESHOLD) {
      console.error(`🚨 Error rate ${(errorRate * 100).toFixed(1)}% exceeds threshold`)
      console.error(`Initiating automatic rollback...`)

      await rollbackFn()
      await alerting.critical(`Deploy ${deployId} auto-rolled-back: error rate ${(errorRate * 100).toFixed(1)}%`)
      return
    }

    console.log(`✅ Error rate ${(errorRate * 100).toFixed(2)}% — OK`)
  }

  console.log(`✅ Deploy ${deployId} stable after ${WINDOW_MINUTES} minutes`)
}

Rollback Readiness Checklist

  • ✅ Every migration has a tested down() script
  • ✅ Migrations use expand/contract — never rename or drop columns in a single step
  • ✅ New code can handle both old and new schema/format during transition window
  • ✅ Blue/green or canary deployment in place — old version stays alive
  • ✅ Feature flags wrap risky changes — flip flag without redeploying
  • ✅ Deploy monitor watches error rate and auto-rolls back on threshold breach
  • ✅ Rollback procedure is documented and tested quarterly
  • ✅ Everyone on the team knows the rollback command, not just the deploy owner

Conclusion

A deploy without a rollback strategy is a one-way door. The way to keep the door two-way is to make rollback a design requirement — expand/contract migrations so the database is always compatible with both versions, keep the previous deployment alive for a confidence period, wrap risky changes in feature flags that can be flipped without a deploy, and automate monitoring that triggers rollback when error rates spike. The 30 minutes you spend planning rollback before each deploy will save hours of emergency debugging after one.