Published on

Migration Locking the Table — The ALTER TABLE That Took Down Production

Authors

Introduction

Database migrations are the riskiest part of most deployments. A migration that adds a NOT NULL column with a DEFAULT rewrites every row. A migration that adds a regular index locks the table for writes. On a small table, this is milliseconds. On a 40-million row table, it's minutes — and every write to that table is blocked the entire time.

The Problem

ALTER TABLE orders ADD COLUMN priority VARCHAR(20) DEFAULT 'normal' NOT NULL;

Production table: 40 million rows
What PostgreSQL does: rewrites every single row to add the new column
Duration: ~8 minutes
Lock held during rewrite: ACCESS EXCLUSIVE (blocks ALL reads AND writes)

Result: Your app can't read or write orders for 8 minutes

Which ALTER TABLE Operations Are Dangerous

-- ❌ DANGEROUS — rewrites table (PostgreSQL < 11)
ALTER TABLE orders ADD COLUMN priority VARCHAR DEFAULT 'normal' NOT NULL;
-- PostgreSQL must update every row to set the default

-- ✅ Safe in PostgreSQL 11+ — no table rewrite
ALTER TABLE orders ADD COLUMN priority VARCHAR DEFAULT 'normal' NOT NULL;
-- PostgreSQL 11+: stores default in catalog, no row rewrite

-- ❌ DANGEROUS — any change to column type rewrites table
ALTER TABLE users ALTER COLUMN age TYPE BIGINT;

-- ❌ DANGEROUS — adding NOT NULL to existing column (even in PG 11+)
ALTER TABLE orders ALTER COLUMN user_id SET NOT NULL;

-- ❌ DANGEROUS — regular index creation (not concurrent)
CREATE INDEX ON orders (user_id);

-- ❌ DANGEROUS — adding constraints that need validation
ALTER TABLE orders ADD CONSTRAINT check_status CHECK (status IN ('pending', 'paid', 'cancelled'));

Safe Pattern: Expand/Contract Migrations

The "expand/contract" (or "parallel change") pattern lets you evolve schema without locking:

Step 1: Add column as nullable (no default, no constraint)

-- Deploy 1: Safe — just adds a null column, no rewrite
ALTER TABLE orders ADD COLUMN priority VARCHAR(20);
-- Instant, no lock, no row rewrite

Step 2: Backfill in batches (background job, not migration)

// Backfill 1000 rows at a time — no lock held between batches
async function backfillPriority() {
  let lastId = '00000000-0000-0000-0000-000000000000'

  while (true) {
    const result = await db.query(`
      UPDATE orders
      SET priority = 'normal'
      WHERE id > $1
        AND priority IS NULL
      ORDER BY id
      LIMIT 1000
      RETURNING id
    `, [lastId])

    if (result.rows.length === 0) break

    lastId = result.rows[result.rows.length - 1].id
    console.log(`Backfilled up to ${lastId}`)
    await sleep(100)  // brief pause to reduce database load
  }
}

Step 3: Add NOT NULL constraint without validation

-- Deploy 2: Add constraint without checking existing rows
-- (Safe because backfill ensures all rows are filled)
ALTER TABLE orders
  ALTER COLUMN priority SET DEFAULT 'normal',
  ALTER COLUMN priority SET NOT NULL;

-- Or use NOT VALID to skip row-level validation scan:
ALTER TABLE orders ADD CONSTRAINT orders_priority_not_null
  CHECK (priority IS NOT NULL) NOT VALID;

-- Later, validate in a separate migration (takes a SHARE UPDATE EXCLUSIVE lock — allows reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_priority_not_null;

Safe Index Creation

-- ❌ Blocks writes for the entire index build (minutes on large table)
CREATE INDEX ON orders (user_id);

-- ✅ CONCURRENTLY — no write lock, builds in background
CREATE INDEX CONCURRENTLY ON orders (user_id);
-- Takes 2-3x longer but doesn't block anything
-- Can't run inside a transaction block
// In a migration file (knex example)
exports.up = async (knex) => {
  // Can't use knex.schema helpers for concurrent index — use raw SQL
  await knex.raw(`
    CREATE INDEX CONCURRENTLY IF NOT EXISTS
    orders_user_id_idx ON orders (user_id)
  `)
}

// Note: knex transactions wrap migrations by default
// For CONCURRENT index, disable the transaction wrapper:
exports.config = { transaction: false }

Safely Adding Foreign Key Constraints

-- ❌ Validates every row while holding lock
ALTER TABLE orders ADD CONSTRAINT fk_user
  FOREIGN KEY (user_id) REFERENCES users(id);

-- ✅ Step 1: Add constraint without validating existing rows
ALTER TABLE orders ADD CONSTRAINT fk_user
  FOREIGN KEY (user_id) REFERENCES users(id)
  NOT VALID;

-- ✅ Step 2: Validate separately (SHARE UPDATE EXCLUSIVE lock — allows writes)
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

Lock Timeout as a Safety Net

Set a lock timeout so that if your migration can't get the lock quickly, it fails fast rather than queueing and blocking all subsequent queries:

-- In your migration script — fail fast if table is busy
SET lock_timeout = '5s';

-- If it can't get the lock within 5 seconds, raises error:
-- ERROR: canceling statement due to lock timeout
-- Much better than silently queuing and blocking for minutes

ALTER TABLE orders ADD COLUMN ...;
// In Node.js migration framework
const client = await pool.connect()
try {
  await client.query("SET lock_timeout = '5s'")
  await client.query("SET statement_timeout = '60s'")
  await client.query('ALTER TABLE orders ADD COLUMN ...')
} finally {
  client.release()
}

Detecting Long-Running Migrations

-- Monitor active locks and waiting queries
SELECT
  pid,
  now() - query_start AS duration,
  state,
  wait_event_type,
  wait_event,
  left(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

-- Find what's blocking what
SELECT
  blocked.pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

Migration Checklist

  • ✅ Use ALTER TABLE ... ADD COLUMN without NOT NULL on PG < 11
  • ✅ Backfill new columns in batches via background jobs, not migrations
  • ✅ Use NOT VALID when adding constraints, then VALIDATE separately
  • ✅ Use CREATE INDEX CONCURRENTLY — always
  • ✅ Set lock_timeout to fail fast rather than queue indefinitely
  • ✅ Test migrations on a production-sized data copy first
  • ✅ Deploy schema migrations separately from code migrations (expand, then contract)
  • ✅ Monitor pg_stat_activity during migration deploys

Conclusion

Schema changes on large tables require thinking in phases: expand first (add nullable columns, non-enforced constraints), backfill in the background, then contract (add NOT NULL, validate constraints). Never run CREATE INDEX without CONCURRENTLY. Always set lock_timeout so a long-queued migration doesn't block your entire app silently. The safest migrations are the ones that can be run during business hours without anyone noticing — and that's achievable with the expand/contract pattern.