- Published on
Zero-Downtime Deployments — Rolling Updates, Blue/Green, and Health Check Patterns
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Every deployment is a risk. A single pod restart can drop milliseconds of traffic; coordinating 100 pod restarts incorrectly can cascade into an outage.
Zero-downtime deployment means: no dropped connections, no request failures during rollout, no human intervention to monitor the changeover. It's achieved through rolling updates with proper health checks, graceful shutdown, and careful orchestration.
This guide implements production-grade zero-downtime deployment patterns.
- Rolling Updates in Kubernetes
- Graceful Shutdown and SIGTERM Handling
- Health Check Endpoints
- PreStop Hook for Connection Draining
- Blue/Green Deployments with Weighted DNS
- Deployment Smoke Test Automation
- Conclusion
Rolling Updates in Kubernetes
Kubernetes rolling updates gradually replace old pods with new ones, maintaining service availability.
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max extra pods during rollout (1 = 1 extra)
maxUnavailable: 0 # Never terminate a pod before replacement ready
template:
metadata:
labels:
app: api-server
version: '2.1.0'
spec:
# Graceful shutdown: 60 seconds to finish requests
terminationGracePeriodSeconds: 60
containers:
- name: api
image: api-server:2.1.0
ports:
- containerPort: 3000
name: http
# Readiness check: is pod ready to receive traffic?
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
# Liveness check: is pod alive? (restart if not)
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Startup check: allow extra time for slow startup
startupProbe:
httpGet:
path: /health/startup
port: http
failureThreshold: 30
periodSeconds: 5
# Graceful shutdown hook
lifecycle:
preStop:
exec:
command: ['/bin/sh', '-c', 'sleep 15']
Graceful Shutdown and SIGTERM Handling
When Kubernetes sends SIGTERM, the process must stop accepting new connections and drain existing ones.
// lib/gracefulShutdown.ts
import http from 'http';
import signal from 'signal-exit';
export class GracefulShutdownManager {
private server: http.Server | null = null;
private activeConnections = new Set<http.IncomingMessage>();
private isShuttingDown = false;
private readonly shutdownTimeout = 60000; // 60 seconds
initialize(server: http.Server) {
this.server = server;
// Track incoming connections
server.on('connection', (conn) => {
this.activeConnections.add(conn);
conn.on('close', () => {
this.activeConnections.delete(conn);
});
});
// Handle SIGTERM (Kubernetes termination signal)
process.on('SIGTERM', () => this.shutdown('SIGTERM'));
process.on('SIGINT', () => this.shutdown('SIGINT'));
}
async shutdown(signal: string) {
if (this.isShuttingDown) return;
this.isShuttingDown = true;
console.log(`Received ${signal}, starting graceful shutdown...`);
// Step 1: Stop accepting new connections
this.server?.close(() => {
console.log('HTTP server closed, no longer accepting connections');
});
// Step 2: Wait for existing connections to close
const drainStart = Date.now();
const drainInterval = setInterval(() => {
const elapsed = Date.now() - drainStart;
if (this.activeConnections.size === 0) {
console.log(`All connections drained after ${elapsed}ms`);
clearInterval(drainInterval);
process.exit(0);
}
if (elapsed % 5000 === 0) {
console.log(
`Draining: ${this.activeConnections.size} active connections...`
);
}
}, 500);
// Step 3: Force shutdown if drain timeout exceeded
setTimeout(() => {
const remaining = this.activeConnections.size;
if (remaining > 0) {
console.warn(
`Timeout: ${remaining} connections still open, forcing shutdown`
);
process.exit(1);
}
}, this.shutdownTimeout);
}
// Drain database connections, cache, etc.
async drainResources() {
console.log('Draining database connections...');
await db.$disconnect();
console.log('Closing cache connections...');
await redis.quit();
}
}
// Usage in Next.js
import { createServer } from 'http';
import { parse } from 'url';
import next from 'next';
const dev = process.env.NODE_ENV !== 'production';
const app = next({ dev });
const handle = app.getRequestHandler();
app.prepare().then(() => {
const server = createServer(async (req, res) => {
try {
const parsedUrl = parse(req.url!, true);
await handle(req, res, parsedUrl);
} catch (err) {
res.statusCode = 500;
res.end('Internal server error');
}
});
const shutdownManager = new GracefulShutdownManager();
shutdownManager.initialize(server);
const port = parseInt(process.env.PORT || '3000', 10);
server.listen(port, () => {
console.log(`Server running on port ${port}`);
});
});
Health Check Endpoints
Three distinct health checks serve different purposes in Kubernetes.
// pages/api/health/startup.ts - indicates pod is ready to start receiving traffic
export default function handler(req: NextApiRequest, res: NextApiResponse) {
// Check: database is reachable, cache is warm, config is loaded
const checks = {
database: { ok: true, latency: 2 },
cache: { ok: true, latency: 1 },
dependencies: { ok: true },
};
const allHealthy = Object.values(checks).every((c) => c.ok);
if (!allHealthy) {
return res.status(503).json({ status: 'unhealthy', checks });
}
res.json({ status: 'ready', checks });
}
// pages/api/health/ready.ts - indicates pod is ready for traffic
export default async function handler(
req: NextApiRequest,
res: NextApiResponse
) {
// Check: quick verification that pod can serve requests
// This runs frequently (every 5 seconds), so keep it fast
const checks = {
memory: { ok: process.memoryUsage().heapUsed < 400 * 1024 * 1024 }, // < 400MB
requestQueue: { ok: pendingRequests < maxQueueSize },
database: { ok: db.isConnected() }, // Ping database (cached)
};
const allHealthy = Object.values(checks).every((c) => c.ok);
if (!allHealthy) {
return res.status(503).json({ status: 'not-ready', checks });
}
res.json({ status: 'ready', checks });
}
// pages/api/health/live.ts - indicates pod is alive
export default function handler(req: NextApiRequest, res: NextApiResponse) {
// Check: the pod itself is not stuck or deadlocked
// Restart pod if this fails
const checks = {
eventLoop: { ok: !isEventLoopBlocked() }, // Check for blocking code
memory: { ok: process.memoryUsage().heapUsed < 512 * 1024 * 1024 }, // < 512MB
processPid: { ok: process.pid > 0 },
};
const allHealthy = Object.values(checks).every((c) => c.ok);
if (!allHealthy) {
return res.status(503).json({ status: 'not-live', checks });
}
res.json({ status: 'alive', checks });
}
// Helper to detect event loop blocking
function isEventLoopBlocked(): boolean {
let blocked = false;
const start = Date.now();
setImmediate(() => {
const elapsed = Date.now() - start;
if (elapsed > 100) {
// Event loop blocked for >100ms
blocked = true;
}
});
// Spin to ensure setImmediate runs
while (!blocked && Date.now() - start < 200) {
// Busy wait (for testing only!)
}
return blocked;
}
PreStop Hook for Connection Draining
The preStop hook runs before SIGTERM, allowing graceful request completion.
# k8s-deployment.yaml (lifecycle section)
containers:
- name: api
lifecycle:
preStop:
exec:
# Wait 15 seconds, giving connections time to drain
# Paired with terminationGracePeriodSeconds: 60
command: ['/bin/sh', '-c', 'sleep 15']
# Total shutdown window:
# 1. preStop hook executes (15s) — connection draining period
# 2. SIGTERM sent → application exits gracefully
# 3. terminationGracePeriodSeconds timeout (60s) — kill -9 if still alive
Advanced preStop with health checks:
containers:
- name: api
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Mark pod as not ready to stop receiving new requests
touch /tmp/shutdown
# Wait for in-flight requests to complete (max 45 seconds)
timeout 45s /bin/bash -c '
while [ "$(curl -s http://localhost:3000/health/requests | jq .pending)" -gt 0 ]; do
sleep 0.5
done
'
# Allow any remaining requests 5 seconds to finish
sleep 5
Blue/Green Deployments with Weighted DNS
For critical services, use blue/green: two complete environment copies, switch traffic instantly.
// lib/blueGreenDeploy.ts
import aws from 'aws-sdk';
const route53 = new aws.Route53();
interface BlueGreenConfig {
domainName: string;
blueEnvironment: string; // Current production
greenEnvironment: string; // New version
trafficWeightBlue: number; // 90 (90% to blue, 10% to green)
trafficWeightGreen: number; // 10
}
export async function canaryDeployment(
config: BlueGreenConfig
): Promise<void> {
// Phase 1: Deploy to green environment
console.log('Deploying new version to green environment...');
await deployToEnvironment(config.greenEnvironment, newVersion);
// Phase 2: Run smoke tests against green
console.log('Running smoke tests against green...');
const smokeTestsPassed = await runSmokeTests(
getEnvironmentUrl(config.greenEnvironment)
);
if (!smokeTestsPassed) {
throw new Error('Smoke tests failed, rolling back');
}
// Phase 3: Gradually shift traffic to green
const trafficShifts = [
{ blue: 95, green: 5, duration: 60 }, // 1 minute
{ blue: 80, green: 20, duration: 300 }, // 5 minutes
{ blue: 50, green: 50, duration: 600 }, // 10 minutes
{ blue: 0, green: 100, duration: 60 }, // Final: 100% to green
];
for (const shift of trafficShifts) {
console.log(
`Shifting traffic: ${shift.blue}% blue, ${shift.green}% green`
);
await updateRoute53Weights(config.domainName, {
blue: config.blueEnvironment,
green: config.greenEnvironment,
blueWeight: shift.blue,
greenWeight: shift.green,
});
// Monitor metrics during shift
const healthy = await monitorHealth(
config.greenEnvironment,
shift.duration
);
if (!healthy) {
console.error('Green environment unhealthy, rolling back');
await updateRoute53Weights(config.domainName, {
blue: config.blueEnvironment,
green: config.greenEnvironment,
blueWeight: 100,
greenWeight: 0,
});
throw new Error('Deployment rolled back due to health check failure');
}
}
// Phase 4: Clean up blue environment
console.log('Deployment successful, cleaning up blue environment');
await decommissionEnvironment(config.blueEnvironment);
}
async function updateRoute53Weights(
domainName: string,
weights: {
blue: string;
green: string;
blueWeight: number;
greenWeight: number;
}
) {
const params = {
ChangeBatch: {
Changes: [
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: domainName,
Type: 'A',
SetIdentifier: 'blue',
Weight: weights.blueWeight,
AliasTarget: {
HostedZoneId: 'Z123456',
DNSName: getLoadBalancerDns(weights.blue),
EvaluateTargetHealth: true,
},
},
},
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: domainName,
Type: 'A',
SetIdentifier: 'green',
Weight: weights.greenWeight,
AliasTarget: {
HostedZoneId: 'Z123456',
DNSName: getLoadBalancerDns(weights.green),
EvaluateTargetHealth: true,
},
},
},
],
},
HostedZoneId: 'Z123456',
};
await route53.changeResourceRecordSets(params).promise();
}
Deployment Smoke Test Automation
Automated verification that new version is production-ready.
// lib/smokeTests.ts
export async function runSmokeTests(
baseUrl: string,
timeout = 30000
): Promise<boolean> {
const tests: Array<{
name: string;
fn: () => Promise<void>;
}> = [
{
name: 'Health check',
fn: async () => {
const res = await fetch(`${baseUrl}/health/ready`, { timeout });
if (!res.ok) throw new Error(`Health check failed: ${res.status}`);
},
},
{
name: 'API connectivity',
fn: async () => {
const res = await fetch(`${baseUrl}/api/users`, {
headers: { Authorization: 'Bearer test-token' },
timeout,
});
if (!res.ok) throw new Error(`API call failed: ${res.status}`);
},
},
{
name: 'Critical database queries',
fn: async () => {
const res = await fetch(`${baseUrl}/api/stats`, {
headers: { Authorization: 'Bearer test-token' },
timeout,
});
if (!res.ok) throw new Error(`Database query failed: ${res.status}`);
},
},
{
name: 'Third-party integrations',
fn: async () => {
const res = await fetch(`${baseUrl}/api/integrations/status`, {
headers: { Authorization: 'Bearer test-token' },
timeout,
});
if (!res.ok)
throw new Error(`Integration check failed: ${res.status}`);
},
},
];
let passed = 0;
for (const test of tests) {
try {
await test.fn();
console.log(`✓ ${test.name}`);
passed++;
} catch (err) {
console.error(`✗ ${test.name}: ${err.message}`);
}
}
return passed === tests.length;
}
Conclusion
Zero-downtime deployments require orchestration at every layer: Kubernetes rolling updates with proper health checks, graceful SIGTERM handling, preStop hooks for connection draining, and canary/blue-green deployment for critical services.
Start with rolling updates and health checks. Graduate to blue/green for critical systems. Automate smoke tests to catch issues before reaching users.
The goal is boring deployments—they should work so reliably that nobody notices them happening.