Published on

Deploying Without Canary — How One Bad Deploy Hits All Your Users at Once

Authors

Introduction

A standard deploy replaces every running instance with the new version simultaneously. If the new version has a bug that affects 5% of requests, 100% of your users are immediately exposed to it. By the time your error rate alert fires and you start a rollback, the incident has been running for minutes across your entire user base. Canary deployments invert this: expose 1–5% of traffic to the new version first, watch the metrics, and only promote if everything looks good.

The Cost of All-at-Once Deploys

All-at-once deploy incident timeline:

T+0:    Deploy ships to all 20 pods simultaneously
T+2min: New version is live on 100% of traffic
T+4min: Error rate climbs from 0.1%3%
T+5min: Alert fires (5-minute evaluation window)
T+7min: Engineer acknowledges alert, starts investigation
T+12min: Root cause identified: new code has null-pointer bug
T+15min: Rollback initiated
T+18min: Old version restored
18 minutes of elevated errors for ALL users

Canary deploy — same bug:

T+0:    Deploy ships to 1 of 20 pods (5% traffic)
T+2min: 5% of traffic hits new version
T+4min: Error rate on canary pod: 3%  (vs 0.1% baseline)
T+5min: Canary health check detects divergence, auto-pauses
T+6min: 95% of users never saw the bug
T+7min: Engineer investigates with full prod data, zero urgency
Only 5% of users exposed for 5 minutes

Fix 1: Kubernetes Canary With Traffic Splitting

# Deploy the canary alongside the stable version
# Stable: 19 replicas | Canary: 1 replica = ~5% traffic split

# deployment-stable.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
  labels:
    app: myapp
    track: stable
spec:
  replicas: 19
  selector:
    matchLabels:
      app: myapp
      track: stable
  template:
    metadata:
      labels:
        app: myapp
        track: stable
        version: "v1.4.2"
    spec:
      containers:
        - name: myapp
          image: myapp:v1.4.2

---
# deployment-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
  labels:
    app: myapp
    track: canary
spec:
  replicas: 1  # 1 out of 20 total = 5% traffic
  selector:
    matchLabels:
      app: myapp
      track: canary
  template:
    metadata:
      labels:
        app: myapp
        track: canary
        version: "v1.4.3"  # New version
    spec:
      containers:
        - name: myapp
          image: myapp:v1.4.3

---
# service.yaml — routes to BOTH deployments by app label
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp  # Matches both stable and canary pods
  ports:
    - port: 80
      targetPort: 3000

Fix 2: Automated Canary Analysis

// canary-analyzer.ts — decide to promote or rollback based on metrics
interface CanaryMetrics {
  errorRate: number
  p99Latency: number
  p50Latency: number
  requestCount: number
}

async function analyzeCanary(
  canaryMetrics: CanaryMetrics,
  baselineMetrics: CanaryMetrics
): Promise<'promote' | 'rollback' | 'hold'> {
  // Need enough traffic to be statistically significant
  if (canaryMetrics.requestCount < 100) {
    return 'hold'
  }

  // Error rate check: canary must not be significantly worse
  const errorRateDelta = canaryMetrics.errorRate - baselineMetrics.errorRate
  if (errorRateDelta > 0.01) {  // More than 1% higher error rate
    console.error(`Canary error rate ${(canaryMetrics.errorRate * 100).toFixed(2)}% vs baseline ${(baselineMetrics.errorRate * 100).toFixed(2)}%`)
    return 'rollback'
  }

  // Latency check: canary p99 must not regress by more than 20%
  const latencyRegression = (canaryMetrics.p99Latency - baselineMetrics.p99Latency) / baselineMetrics.p99Latency
  if (latencyRegression > 0.20) {
    console.error(`Canary p99 latency ${canaryMetrics.p99Latency}ms vs baseline ${baselineMetrics.p99Latency}ms (${(latencyRegression * 100).toFixed(0)}% regression)`)
    return 'rollback'
  }

  // All checks passed
  return 'promote'
}

// Progressive rollout controller
async function progressiveRollout(newVersion: string) {
  const stages = [
    { canaryPercent: 5, durationMinutes: 10 },
    { canaryPercent: 25, durationMinutes: 15 },
    { canaryPercent: 50, durationMinutes: 15 },
    { canaryPercent: 100, durationMinutes: 0 },  // Full rollout
  ]

  for (const stage of stages) {
    console.log(`Setting canary to ${stage.canaryPercent}%...`)
    await setCanaryWeight(newVersion, stage.canaryPercent)

    if (stage.durationMinutes > 0) {
      await sleep(stage.durationMinutes * 60 * 1000)

      const [canaryMetrics, baselineMetrics] = await Promise.all([
        fetchMetrics({ version: newVersion }),
        fetchMetrics({ version: 'stable' }),
      ])

      const decision = await analyzeCanary(canaryMetrics, baselineMetrics)

      if (decision === 'rollback') {
        await setCanaryWeight(newVersion, 0)
        await alerting.critical(`Canary rollback at ${stage.canaryPercent}%: metrics degraded`)
        return
      }
    }
  }

  console.log('✅ Full rollout complete')
}

Fix 3: Nginx / Ingress Traffic Splitting

# nginx canary config — weight-based upstream routing
upstream myapp_stable {
  server stable-pod-1:3000 weight=19;
  server stable-pod-2:3000 weight=19;
  # ... 19 stable pods
}

upstream myapp_canary {
  server canary-pod-1:3000 weight=1;
}

# Split: 95% stable, 5% canary
split_clients "${remote_addr}AAA" $upstream {
  95%   myapp_stable;
  *     myapp_canary;
}

server {
  location / {
    proxy_pass http://$upstream;
    add_header X-Upstream $upstream;  # For debugging
  }
}
# Kubernetes nginx-ingress canary annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"  # 5% to canary
spec:
  rules:
    - host: api.myapp.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp-canary
                port:
                  number: 80

Fix 4: User-Segment Canary (Employees First)

// Route specific users to canary: internal employees, beta users, etc.
// Before random traffic splitting, validate with controlled group

function shouldUseCanary(userId: string, request: Request): boolean {
  // Internal employees always get canary
  if (isInternalEmail(request.headers['x-user-email'])) {
    return true
  }

  // Beta users opted in
  if (userFlags.has(userId, 'canary-access')) {
    return true
  }

  // Percentage rollout based on user ID hash (consistent per user)
  const hash = murmurHash(userId) % 100
  return hash < CANARY_PERCENT  // e.g., 5 for 5%
}

// In your load balancer / API gateway
app.use((req, res, next) => {
  const userId = req.user?.id ?? req.ip
  if (shouldUseCanary(userId, req)) {
    req.headers['x-route-to'] = 'canary'
  }
  next()
})

Canary Metrics Dashboard

// Track canary vs stable side-by-side in Prometheus
// These labels let you split metrics by version in Grafana

app.use((req, res, next) => {
  const start = Date.now()
  const version = process.env.APP_VERSION ?? 'unknown'
  const track = process.env.DEPLOY_TRACK ?? 'stable'  // 'stable' or 'canary'

  res.on('finish', () => {
    const duration = Date.now() - start
    const status = res.statusCode >= 500 ? 'error' : 'success'

    // Prometheus metrics with version labels
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path ?? 'unknown', status, version, track },
      duration / 1000
    )
    httpRequestTotal.inc({ status, version, track })
  })

  next()
})

// Grafana query:
// Error rate canary vs stable:
// rate(http_requests_total{status="error",track="canary"}[5m])
//   / rate(http_requests_total{track="canary"}[5m])

Canary Deployment Checklist

  • ✅ New version deploys to a small slice (1–5%) of traffic first
  • ✅ Canary metrics are compared against stable baseline automatically
  • ✅ Rollback happens automatically if error rate or latency exceeds threshold
  • ✅ Internal employees and beta users receive canary before general traffic
  • ✅ Canary duration is long enough to cover all traffic patterns (15–30 minutes minimum)
  • ✅ Full rollout only happens after all canary stages pass
  • ✅ Canary vs stable metrics are visible in dashboards in real time

Conclusion

All-at-once deploys are a bet that your testing caught everything. Canary deployments are an acknowledgment that testing never catches everything — so you expose 5% of real traffic first and watch what happens. The mechanics are straightforward: run one pod on the new version while the rest stay on the current version, compare error rates and latency, and automate the promote/rollback decision. The extra 15 minutes of canary observation will catch the bugs your staging environment missed, every time.