Published on

Leader Election Gone Wrong — When Two Nodes Both Think They're in Charge

Authors

Introduction

Leader election sounds simple: pick one node to be in charge, have others stand by. The hard part is handling the moment when the leader goes quiet — is it dead, or just slow? If you declare a new leader too early, you have two leaders. If you wait too long, work stops. Most homegrown election implementations get this wrong in production.

The Problem

3-node cluster. Node 1 is elected leader.

T=0:00  Node 1 is leader, processing job queue
T=0:10  Network partition: Node 1 can't reach Nodes 2 and 3
T=0:15  Nodes 2 and 3 declare Node 1 dead — elect Node 2 as leader
T=0:15  Node 1 still thinks it's leader (it can reach the database!)
T=0:20  Both Node 1 AND Node 2 pull jobs from queue
T=0:20  Same invoice processed twice → customer charged twice
T=0:25  Network heals. Now what?

The core problem is that "is the leader healthy?" is a distributed question with no perfect answer. Every election algorithm is a trade-off between availability and consistency.

Fix 1: Redis Lease with Strict Expiry

The simplest correct approach: a leader holds a short-lived Redis key. If it can't renew the key, it stops working — even if it "feels" healthy.

import { Redis } from 'ioredis'

class LeaderLease {
  private isLeader = false
  private leaseKey: string
  private leaseTTL: number  // milliseconds
  private renewTimer: NodeJS.Timeout | null = null

  constructor(
    private redis: Redis,
    private instanceId: string,
    options: { name: string; ttlMs: number } = { name: 'leader', ttlMs: 15_000 }
  ) {
    this.leaseKey = `leader:${options.name}`
    this.leaseTTL = options.ttlMs
  }

  async start(): Promise<void> {
    await this.tryAcquire()
    // Renew at 1/3 of TTL to give 2 renewal attempts before expiry
    this.renewTimer = setInterval(() => this.renew(), this.leaseTTL / 3)
  }

  private async tryAcquire(): Promise<void> {
    const acquired = await this.redis.set(
      this.leaseKey,
      this.instanceId,
      'PX',   // millisecond precision
      this.leaseTTL,
      'NX'    // only set if key doesn't exist
    )

    if (acquired) {
      this.isLeader = true
      console.log(`[Leader] ${this.instanceId} acquired lease`)
      await this.onBecameLeader()
    }
  }

  private async renew(): Promise<void> {
    if (!this.isLeader) {
      // Not leader — try to acquire
      await this.tryAcquire()
      return
    }

    // Renew only if we still own the lease (Lua script for atomicity)
    const renewed = await this.redis.eval(
      `
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('pexpire', KEYS[1], ARGV[2])
      else
        return 0
      end
      `,
      1,
      this.leaseKey,
      this.instanceId,
      String(this.leaseTTL)
    )

    if (renewed === 0) {
      // Lost the lease (Redis restart? Eviction? Another node won?)
      console.warn(`[Leader] ${this.instanceId} lost lease — stepping down`)
      this.isLeader = false
      await this.onLostLeadership()
    }
  }

  getIsLeader(): boolean {
    return this.isLeader
  }

  // Override these in subclasses or pass as options
  protected async onBecameLeader(): Promise<void> {}
  protected async onLostLeadership(): Promise<void> {}

  stop(): void {
    if (this.renewTimer) clearInterval(this.renewTimer)
    this.isLeader = false
  }
}

The critical safety rule: a node must stop doing leader work the moment it can't renew its lease, even if the local process is healthy. Redis being unreachable = no longer leader.

class SafeLeader extends LeaderLease {
  private jobTimer: NodeJS.Timeout | null = null

  protected async onBecameLeader() {
    // Start doing leader-only work
    this.jobTimer = setInterval(() => this.processJobQueue(), 10_000)
  }

  protected async onLostLeadership() {
    // STOP doing leader-only work immediately
    if (this.jobTimer) clearInterval(this.jobTimer)
    this.jobTimer = null
  }

  private async processJobQueue() {
    // Double-check at the start of every critical operation
    if (!this.getIsLeader()) return

    const jobs = await db.job.findPending()
    for (const job of jobs) {
      // Check again before each job — lease could expire mid-loop
      if (!this.getIsLeader()) break
      await processJob(job)
    }
  }
}

Fix 2: Fencing Tokens

Even with lease renewal, there's a window where an old leader might be mid-operation when it loses the lease. A fencing token prevents it from completing that operation.

// Fencing: every lease acquisition returns a monotonically increasing token
// The token must be presented when writing — database rejects stale tokens

class FencedLeaderLease {
  private fencingToken: number = 0

  async acquire(): Promise<number | null> {
    // Lua: increment counter and set key atomically
    const result = await this.redis.eval(
      `
      local token = redis.call('incr', KEYS[1] .. ':token')
      local set = redis.call('set', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
      if set then
        redis.call('set', KEYS[1] .. ':token:' .. ARGV[1], token)
        return token
      else
        return nil
      end
      `,
      1,
      this.leaseKey,
      this.instanceId,
      String(this.leaseTTL)
    )

    if (result) {
      this.fencingToken = result as number
      return this.fencingToken
    }
    return null
  }

  getToken(): number { return this.fencingToken }
}

// In the job processor — include fencing token in every write
async function processJobWithFencing(jobId: string, token: number) {
  // Database enforces: only accept writes if token > last seen token
  await db.query(`
    UPDATE jobs
    SET status = 'complete', processed_by_token = $1
    WHERE id = $2
      AND (processed_by_token IS NULL OR processed_by_token < $1)
  `, [token, jobId])

  // If 0 rows affected → stale leader, another leader already processed this
}

Fencing tokens turn "maybe processed twice" into "definitely processed once." Even if the old leader wins a race to the database, the write is rejected because its token is lower than the new leader's.

Fix 3: Kubernetes Leader Election (Production-Ready)

For Kubernetes deployments, use the built-in leader election via Lease objects — no Redis required:

import * as k8s from '@kubernetes/client-node'

async function runWithKubernetesLeaderElection(onLeader: () => void) {
  const kc = new k8s.KubeConfig()
  kc.loadFromDefault()

  const coordinationClient = kc.makeApiClient(k8s.CoordinationV1Api)
  const leaseName = 'my-service-leader'
  const namespace = process.env.POD_NAMESPACE ?? 'default'
  const identity = process.env.HOSTNAME ?? 'unknown'

  const le = new k8s.LeaderElection(coordinationClient)

  await le.run({
    leaseName,
    namespace,
    identity,
    leaseDuration: 15,   // seconds
    renewDeadline: 10,   // must renew within 10s or step down
    retryPeriod: 2,      // retry every 2s
    onStartedLeading: () => {
      console.log(`${identity} became leader`)
      onLeader()
    },
    onStoppedLeading: () => {
      console.log(`${identity} lost leadership — exiting`)
      process.exit(1)  // Let Kubernetes restart the pod
    },
    onNewLeader: (newLeader: string) => {
      console.log(`New leader elected: ${newLeader}`)
    },
  })
}

The Kubernetes Lease resource replaces Redis. The controller manager handles the renewal protocol, and the renewDeadline / leaseDuration parameters are tuned to prevent split-brain by default.

Fix 4: etcd for Strong Consistency

If you need linearizable leader election (stronger than Redis, which is eventually consistent under network partitions), use etcd:

import { Etcd3 } from 'etcd3'

const etcd = new Etcd3({ hosts: 'etcd:2379' })

async function runAsLeader() {
  const election = etcd.election('my-service-leader')

  // campaign() blocks until this instance becomes leader
  // Uses etcd's Raft consensus — linearizable, no split-brain
  const campaign = await election.campaign(process.env.HOSTNAME!)

  console.log('Became leader!')

  // When step down is needed:
  // await campaign.resign()

  // Watch for leadership changes
  election.observe().on('change', (leader) => {
    console.log('Current leader:', leader)
  })
}

runAsLeader().catch((err) => {
  console.error('Lost election or error:', err)
  process.exit(1)
})

etcd uses the Raft consensus algorithm — it's impossible for two nodes to both believe they're leader at the same time (unlike Redis, which can have clock-based issues in network partition scenarios).

The Lease Duration Trade-off

Short lease (5 seconds):
Fast failover — new leader elected in ~5 seconds
More renewal network traffic
Noisy — brief Redis hiccup causes unnecessary leader change
Old leader gets 5 seconds where it might do stale work

Long lease (60 seconds):
Robust to transient network issues
Less renewal traffic
Slow failover — 60 seconds of downtime if leader dies
Old leader has 60 seconds to do stale work (unless using fencing)

A good default for most services:

  • leaseTTL: 15 seconds
  • renewInterval: 5 seconds (renew at 1/3 TTL)
  • failover time: ~15 seconds (one full TTL expiry)
  • Add fencing tokens if even 15 seconds of split-brain is unacceptable

Leader Election Checklist

RiskSolution
Two leaders simultaneouslyFencing tokens + lease per-operation check
Leader dies, no failoverTTL on lease — auto-expires and enables new election
Redis restart → all leases lostHandle onLostLeadership, graceful shutdown
Leader can't renew but is healthyStop all leader work when renewal fails
Split-brain during partitionetcd or K8s Lease (Raft-based, linearizable)
Stale leader finishes long operationFencing token rejected by database

Conclusion

Leader election bugs are silent in development (single node, no partitions) and catastrophic in production (duplicate billing, corrupted state, split-brain). The minimum viable safe implementation has three properties: a short-lived lease that auto-expires, strict step-down when renewal fails (even if the node is otherwise healthy), and fencing tokens on any writes so stale leaders can't corrupt state. For Kubernetes deployments, use the built-in Lease resource rather than rolling your own. For stronger guarantees, use etcd's Raft-based election. Whatever approach you choose, test the network-partition scenario — that's where elections go wrong.