Config Drift Across Environments — When Prod Behaves Differently Than Staging

Introduction

Config drift happens when environment-specific configuration diverges gradually over time — through manual changes, undocumented tweaks, and "temporary fixes" that become permanent. The result is a staging environment that doesn't behave like production, making it useless for verifying that a deploy will work.

Common Sources of Config Drift
Fix 1: Config as Code — Everything Checked In
Fix 2: Environment Parity Check in CI
Fix 3: Secret Rotation Checklist
Fix 4: Infrastructure as Code for All Environments
Fix 5: Change Log for Production Config Changes
Config Drift Checklist
Conclusion

Common Sources of Config Drift

1. Someone SSH'd into prod and changed an env var to fix an incident
   → Never documented, never replicated to staging

2. "Temporary" production tweak became permanent
   → DB_POOL_SIZE was 10, changed to 50 during traffic spike
   → Staging still has 10

3. Different versions of third-party service configs
   → Prod: Stripe webhook timeout 30s | Staging: 10s
   → Feature behaves differently in each environment

4. Secrets rotated in prod but not staging
   → Staging API key expired → staging is broken but nobody notices

5. Feature flags managed per-environment with no audit trail
   → New feature ON in staging, forgotten, never turned ON in prod

6. Infrastructure differences never documented
   → Prod: 16GB RAM | Staging: 4GB RAM
   → Memory leak hidden by larger prod headroom

Fix 1: Config as Code — Everything Checked In

// config/environments/production.ts
export const productionConfig = {
  database: {
    poolMin: 5,
    poolMax: 50,
    connectionTimeout: 30_000,
    idleTimeout: 600_000,
  },
  redis: {
    maxRetriesPerRequest: 3,
    connectTimeout: 5_000,
  },
  api: {
    timeout: 30_000,
    retries: 3,
  },
  cache: {
    ttl: 300,  // 5 minutes
  },
} as const

// config/environments/staging.ts — MIRRORS production with documented differences
export const stagingConfig = {
  ...productionConfig,  // Start from production config
  // Intentional differences documented with reasons:
  database: {
    ...productionConfig.database,
    poolMax: 20,  // staging has less capacity — intentional
  },
} as const

// If staging differs from prod without documentation, CI fails

Fix 2: Environment Parity Check in CI

// ci/check-config-parity.ts
import { productionConfig } from '../config/environments/production'
import { stagingConfig } from '../config/environments/staging'

const ALLOWED_DIFFERENCES = ['database.poolMax', 'redis.maxRetriesPerRequest']

function checkConfigParity() {
  const diffs = deepDiff(productionConfig, stagingConfig)

  const unexpectedDiffs = diffs.filter(diff => !ALLOWED_DIFFERENCES.includes(diff.path))

  if (unexpectedDiffs.length > 0) {
    console.error('Config parity check failed — unexpected differences:')
    unexpectedDiffs.forEach(diff => {
      console.error(`  ${diff.path}: prod=${diff.prodValue} staging=${diff.stagingValue}`)
    })
    process.exit(1)
  }
}

Fix 3: Secret Rotation Checklist

// Auto-detect when secrets are about to expire or have drifted
async function checkSecretHealth() {
  const secrets = [
    { name: 'STRIPE_API_KEY', test: () => stripe.balance.retrieve() },
    { name: 'SENDGRID_API_KEY', test: () => sgMail.send({ to: 'test@example.com', from: 'test@example.com', subject: 'health check', text: 'test' }) },
    { name: 'DATABASE_URL', test: () => db.query('SELECT 1') },
  ]

  for (const secret of secrets) {
    try {
      await secret.test()
      logger.info({ secret: secret.name }, 'Secret health check passed')
    } catch (err) {
      logger.error({ secret: secret.name, error: err.message }, 'Secret health check FAILED')
      alerting.critical(`Secret ${secret.name} is invalid or expired!`)
    }
  }
}

// Run on deploy and daily

Fix 4: Infrastructure as Code for All Environments

# terraform/environments/staging/main.tf — mirrors production structure
module "app" {
  source = "../../modules/app"

  environment     = "staging"
  instance_type   = "t3.large"  # intentionally smaller than prod (t3.xlarge)
  db_instance     = "db.t3.medium"
  min_instances   = 1
  max_instances   = 2

  # All config values explicitly set — no defaults to drift
  db_pool_max     = 20
  api_timeout_ms  = 30000
  cache_ttl_sec   = 300
}

Fix 5: Change Log for Production Config Changes

# Production Config Change Log

## 2026-03-10
- Changed DB_POOL_MAX: 10 → 50 (reason: traffic spike during campaign)
- TODO: Update staging to match (ticket: ENG-1234)
- Rotated STRIPE_SECRET_KEY (expires 2027-03-10)

## 2026-02-22
- Enabled FEATURE_NEW_CHECKOUT: false → true (rolled out to 100%)
- TODO: Remove flag from codebase (ticket: ENG-1200)

Config Drift Checklist

✅ All config is in version control — no undocumented prod-only settings
✅ Staging config is derived from production config with explicit, documented overrides
✅ CI check fails if staging/prod configs diverge unexpectedly
✅ Production config changes are logged with reason and a PR
✅ Secrets are health-checked on deploy and daily
✅ Infrastructure is managed as code — no manual console changes
✅ "Config change" is a first-class operation with runbook and rollback plan

Conclusion

Config drift is a slow poison. Each individual deviation seems harmless, but they compound until staging and production are effectively different systems. The fix is treating configuration as code: every setting lives in version control, differences between environments are explicit and documented, and production config changes go through the same review process as code changes. "It works on staging" should mean something.