Published on

Zero-Downtime Deployments — Rolling Updates, Blue/Green, and Health Check Patterns

Authors

Introduction

Every deployment is a risk. A single pod restart can drop milliseconds of traffic; coordinating 100 pod restarts incorrectly can cascade into an outage.

Zero-downtime deployment means: no dropped connections, no request failures during rollout, no human intervention to monitor the changeover. It's achieved through rolling updates with proper health checks, graceful shutdown, and careful orchestration.

This guide implements production-grade zero-downtime deployment patterns.

Rolling Updates in Kubernetes

Kubernetes rolling updates gradually replace old pods with new ones, maintaining service availability.

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max extra pods during rollout (1 = 1 extra)
      maxUnavailable: 0  # Never terminate a pod before replacement ready

  template:
    metadata:
      labels:
        app: api-server
        version: '2.1.0'

    spec:
      # Graceful shutdown: 60 seconds to finish requests
      terminationGracePeriodSeconds: 60

      containers:
        - name: api
          image: api-server:2.1.0
          ports:
            - containerPort: 3000
              name: http

          # Readiness check: is pod ready to receive traffic?
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2

          # Liveness check: is pod alive? (restart if not)
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

          # Startup check: allow extra time for slow startup
          startupProbe:
            httpGet:
              path: /health/startup
              port: http
            failureThreshold: 30
            periodSeconds: 5

          # Graceful shutdown hook
          lifecycle:
            preStop:
              exec:
                command: ['/bin/sh', '-c', 'sleep 15']

Graceful Shutdown and SIGTERM Handling

When Kubernetes sends SIGTERM, the process must stop accepting new connections and drain existing ones.

// lib/gracefulShutdown.ts
import http from 'http';
import signal from 'signal-exit';

export class GracefulShutdownManager {
  private server: http.Server | null = null;
  private activeConnections = new Set<http.IncomingMessage>();
  private isShuttingDown = false;
  private readonly shutdownTimeout = 60000; // 60 seconds

  initialize(server: http.Server) {
    this.server = server;

    // Track incoming connections
    server.on('connection', (conn) => {
      this.activeConnections.add(conn);
      conn.on('close', () => {
        this.activeConnections.delete(conn);
      });
    });

    // Handle SIGTERM (Kubernetes termination signal)
    process.on('SIGTERM', () => this.shutdown('SIGTERM'));
    process.on('SIGINT', () => this.shutdown('SIGINT'));
  }

  async shutdown(signal: string) {
    if (this.isShuttingDown) return;

    this.isShuttingDown = true;
    console.log(`Received ${signal}, starting graceful shutdown...`);

    // Step 1: Stop accepting new connections
    this.server?.close(() => {
      console.log('HTTP server closed, no longer accepting connections');
    });

    // Step 2: Wait for existing connections to close
    const drainStart = Date.now();
    const drainInterval = setInterval(() => {
      const elapsed = Date.now() - drainStart;
      if (this.activeConnections.size === 0) {
        console.log(`All connections drained after ${elapsed}ms`);
        clearInterval(drainInterval);
        process.exit(0);
      }

      if (elapsed % 5000 === 0) {
        console.log(
          `Draining: ${this.activeConnections.size} active connections...`
        );
      }
    }, 500);

    // Step 3: Force shutdown if drain timeout exceeded
    setTimeout(() => {
      const remaining = this.activeConnections.size;
      if (remaining > 0) {
        console.warn(
          `Timeout: ${remaining} connections still open, forcing shutdown`
        );
        process.exit(1);
      }
    }, this.shutdownTimeout);
  }

  // Drain database connections, cache, etc.
  async drainResources() {
    console.log('Draining database connections...');
    await db.$disconnect();

    console.log('Closing cache connections...');
    await redis.quit();
  }
}

// Usage in Next.js
import { createServer } from 'http';
import { parse } from 'url';
import next from 'next';

const dev = process.env.NODE_ENV !== 'production';
const app = next({ dev });
const handle = app.getRequestHandler();

app.prepare().then(() => {
  const server = createServer(async (req, res) => {
    try {
      const parsedUrl = parse(req.url!, true);
      await handle(req, res, parsedUrl);
    } catch (err) {
      res.statusCode = 500;
      res.end('Internal server error');
    }
  });

  const shutdownManager = new GracefulShutdownManager();
  shutdownManager.initialize(server);

  const port = parseInt(process.env.PORT || '3000', 10);
  server.listen(port, () => {
    console.log(`Server running on port ${port}`);
  });
});

Health Check Endpoints

Three distinct health checks serve different purposes in Kubernetes.

// pages/api/health/startup.ts - indicates pod is ready to start receiving traffic
export default function handler(req: NextApiRequest, res: NextApiResponse) {
  // Check: database is reachable, cache is warm, config is loaded
  const checks = {
    database: { ok: true, latency: 2 },
    cache: { ok: true, latency: 1 },
    dependencies: { ok: true },
  };

  const allHealthy = Object.values(checks).every((c) => c.ok);

  if (!allHealthy) {
    return res.status(503).json({ status: 'unhealthy', checks });
  }

  res.json({ status: 'ready', checks });
}

// pages/api/health/ready.ts - indicates pod is ready for traffic
export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse
) {
  // Check: quick verification that pod can serve requests
  // This runs frequently (every 5 seconds), so keep it fast
  const checks = {
    memory: { ok: process.memoryUsage().heapUsed < 400 * 1024 * 1024 }, // < 400MB
    requestQueue: { ok: pendingRequests < maxQueueSize },
    database: { ok: db.isConnected() }, // Ping database (cached)
  };

  const allHealthy = Object.values(checks).every((c) => c.ok);

  if (!allHealthy) {
    return res.status(503).json({ status: 'not-ready', checks });
  }

  res.json({ status: 'ready', checks });
}

// pages/api/health/live.ts - indicates pod is alive
export default function handler(req: NextApiRequest, res: NextApiResponse) {
  // Check: the pod itself is not stuck or deadlocked
  // Restart pod if this fails
  const checks = {
    eventLoop: { ok: !isEventLoopBlocked() }, // Check for blocking code
    memory: { ok: process.memoryUsage().heapUsed < 512 * 1024 * 1024 }, // < 512MB
    processPid: { ok: process.pid > 0 },
  };

  const allHealthy = Object.values(checks).every((c) => c.ok);

  if (!allHealthy) {
    return res.status(503).json({ status: 'not-live', checks });
  }

  res.json({ status: 'alive', checks });
}

// Helper to detect event loop blocking
function isEventLoopBlocked(): boolean {
  let blocked = false;
  const start = Date.now();

  setImmediate(() => {
    const elapsed = Date.now() - start;
    if (elapsed > 100) {
      // Event loop blocked for >100ms
      blocked = true;
    }
  });

  // Spin to ensure setImmediate runs
  while (!blocked && Date.now() - start < 200) {
    // Busy wait (for testing only!)
  }

  return blocked;
}

PreStop Hook for Connection Draining

The preStop hook runs before SIGTERM, allowing graceful request completion.

# k8s-deployment.yaml (lifecycle section)
containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          # Wait 15 seconds, giving connections time to drain
          # Paired with terminationGracePeriodSeconds: 60
          command: ['/bin/sh', '-c', 'sleep 15']

# Total shutdown window:
# 1. preStop hook executes (15s) — connection draining period
# 2. SIGTERM sent → application exits gracefully
# 3. terminationGracePeriodSeconds timeout (60s) — kill -9 if still alive

Advanced preStop with health checks:

containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command:
            - /bin/sh
            - -c
            - |
              # Mark pod as not ready to stop receiving new requests
              touch /tmp/shutdown

              # Wait for in-flight requests to complete (max 45 seconds)
              timeout 45s /bin/bash -c '
                while [ "$(curl -s http://localhost:3000/health/requests | jq .pending)" -gt 0 ]; do
                  sleep 0.5
                done
              '

              # Allow any remaining requests 5 seconds to finish
              sleep 5

Blue/Green Deployments with Weighted DNS

For critical services, use blue/green: two complete environment copies, switch traffic instantly.

// lib/blueGreenDeploy.ts
import aws from 'aws-sdk';

const route53 = new aws.Route53();

interface BlueGreenConfig {
  domainName: string;
  blueEnvironment: string; // Current production
  greenEnvironment: string; // New version
  trafficWeightBlue: number; // 90 (90% to blue, 10% to green)
  trafficWeightGreen: number; // 10
}

export async function canaryDeployment(
  config: BlueGreenConfig
): Promise<void> {
  // Phase 1: Deploy to green environment
  console.log('Deploying new version to green environment...');
  await deployToEnvironment(config.greenEnvironment, newVersion);

  // Phase 2: Run smoke tests against green
  console.log('Running smoke tests against green...');
  const smokeTestsPassed = await runSmokeTests(
    getEnvironmentUrl(config.greenEnvironment)
  );

  if (!smokeTestsPassed) {
    throw new Error('Smoke tests failed, rolling back');
  }

  // Phase 3: Gradually shift traffic to green
  const trafficShifts = [
    { blue: 95, green: 5, duration: 60 }, // 1 minute
    { blue: 80, green: 20, duration: 300 }, // 5 minutes
    { blue: 50, green: 50, duration: 600 }, // 10 minutes
    { blue: 0, green: 100, duration: 60 }, // Final: 100% to green
  ];

  for (const shift of trafficShifts) {
    console.log(
      `Shifting traffic: ${shift.blue}% blue, ${shift.green}% green`
    );

    await updateRoute53Weights(config.domainName, {
      blue: config.blueEnvironment,
      green: config.greenEnvironment,
      blueWeight: shift.blue,
      greenWeight: shift.green,
    });

    // Monitor metrics during shift
    const healthy = await monitorHealth(
      config.greenEnvironment,
      shift.duration
    );

    if (!healthy) {
      console.error('Green environment unhealthy, rolling back');
      await updateRoute53Weights(config.domainName, {
        blue: config.blueEnvironment,
        green: config.greenEnvironment,
        blueWeight: 100,
        greenWeight: 0,
      });
      throw new Error('Deployment rolled back due to health check failure');
    }
  }

  // Phase 4: Clean up blue environment
  console.log('Deployment successful, cleaning up blue environment');
  await decommissionEnvironment(config.blueEnvironment);
}

async function updateRoute53Weights(
  domainName: string,
  weights: {
    blue: string;
    green: string;
    blueWeight: number;
    greenWeight: number;
  }
) {
  const params = {
    ChangeBatch: {
      Changes: [
        {
          Action: 'UPSERT',
          ResourceRecordSet: {
            Name: domainName,
            Type: 'A',
            SetIdentifier: 'blue',
            Weight: weights.blueWeight,
            AliasTarget: {
              HostedZoneId: 'Z123456',
              DNSName: getLoadBalancerDns(weights.blue),
              EvaluateTargetHealth: true,
            },
          },
        },
        {
          Action: 'UPSERT',
          ResourceRecordSet: {
            Name: domainName,
            Type: 'A',
            SetIdentifier: 'green',
            Weight: weights.greenWeight,
            AliasTarget: {
              HostedZoneId: 'Z123456',
              DNSName: getLoadBalancerDns(weights.green),
              EvaluateTargetHealth: true,
            },
          },
        },
      ],
    },
    HostedZoneId: 'Z123456',
  };

  await route53.changeResourceRecordSets(params).promise();
}

Deployment Smoke Test Automation

Automated verification that new version is production-ready.

// lib/smokeTests.ts
export async function runSmokeTests(
  baseUrl: string,
  timeout = 30000
): Promise<boolean> {
  const tests: Array<{
    name: string;
    fn: () => Promise<void>;
  }> = [
    {
      name: 'Health check',
      fn: async () => {
        const res = await fetch(`${baseUrl}/health/ready`, { timeout });
        if (!res.ok) throw new Error(`Health check failed: ${res.status}`);
      },
    },
    {
      name: 'API connectivity',
      fn: async () => {
        const res = await fetch(`${baseUrl}/api/users`, {
          headers: { Authorization: 'Bearer test-token' },
          timeout,
        });
        if (!res.ok) throw new Error(`API call failed: ${res.status}`);
      },
    },
    {
      name: 'Critical database queries',
      fn: async () => {
        const res = await fetch(`${baseUrl}/api/stats`, {
          headers: { Authorization: 'Bearer test-token' },
          timeout,
        });
        if (!res.ok) throw new Error(`Database query failed: ${res.status}`);
      },
    },
    {
      name: 'Third-party integrations',
      fn: async () => {
        const res = await fetch(`${baseUrl}/api/integrations/status`, {
          headers: { Authorization: 'Bearer test-token' },
          timeout,
        });
        if (!res.ok)
          throw new Error(`Integration check failed: ${res.status}`);
      },
    },
  ];

  let passed = 0;
  for (const test of tests) {
    try {
      await test.fn();
      console.log(`${test.name}`);
      passed++;
    } catch (err) {
      console.error(`${test.name}: ${err.message}`);
    }
  }

  return passed === tests.length;
}

Conclusion

Zero-downtime deployments require orchestration at every layer: Kubernetes rolling updates with proper health checks, graceful SIGTERM handling, preStop hooks for connection draining, and canary/blue-green deployment for critical services.

Start with rolling updates and health checks. Graduate to blue/green for critical systems. Automate smoke tests to catch issues before reaching users.

The goal is boring deployments—they should work so reliably that nobody notices them happening.