Leader Election Gone Wrong — When Two Nodes Both Think They're in Charge

Introduction

Leader election sounds simple: pick one node to be in charge, have others stand by. The hard part is handling the moment when the leader goes quiet — is it dead, or just slow? If you declare a new leader too early, you have two leaders. If you wait too long, work stops. Most homegrown election implementations get this wrong in production.

The Problem
Fix 1: Redis Lease with Strict Expiry
Fix 2: Fencing Tokens
Fix 3: Kubernetes Leader Election (Production-Ready)
Fix 4: etcd for Strong Consistency
The Lease Duration Trade-off
Leader Election Checklist
Conclusion

The Problem

3-node cluster. Node 1 is elected leader.

T=0:00  Node 1 is leader, processing job queue
T=0:10  Network partition: Node 1 can't reach Nodes 2 and 3
T=0:15  Nodes 2 and 3 declare Node 1 dead — elect Node 2 as leader
T=0:15  Node 1 still thinks it's leader (it can reach the database!)
T=0:20  Both Node 1 AND Node 2 pull jobs from queue
T=0:20  Same invoice processed twice → customer charged twice
T=0:25  Network heals. Now what?

The core problem is that "is the leader healthy?" is a distributed question with no perfect answer. Every election algorithm is a trade-off between availability and consistency.

Fix 1: Redis Lease with Strict Expiry

The simplest correct approach: a leader holds a short-lived Redis key. If it can't renew the key, it stops working — even if it "feels" healthy.

import { Redis } from 'ioredis'

class LeaderLease {
  private isLeader = false
  private leaseKey: string
  private leaseTTL: number  // milliseconds
  private renewTimer: NodeJS.Timeout | null = null

  constructor(
    private redis: Redis,
    private instanceId: string,
    options: { name: string; ttlMs: number } = { name: 'leader', ttlMs: 15_000 }
  ) {
    this.leaseKey = `leader:${options.name}`
    this.leaseTTL = options.ttlMs
  }

  async start(): Promise<void> {
    await this.tryAcquire()
    // Renew at 1/3 of TTL to give 2 renewal attempts before expiry
    this.renewTimer = setInterval(() => this.renew(), this.leaseTTL / 3)
  }

  private async tryAcquire(): Promise<void> {
    const acquired = await this.redis.set(
      this.leaseKey,
      this.instanceId,
      'PX',   // millisecond precision
      this.leaseTTL,
      'NX'    // only set if key doesn't exist
    )

    if (acquired) {
      this.isLeader = true
      console.log(`[Leader] ${this.instanceId} acquired lease`)
      await this.onBecameLeader()
    }
  }

  private async renew(): Promise<void> {
    if (!this.isLeader) {
      // Not leader — try to acquire
      await this.tryAcquire()
      return
    }

    // Renew only if we still own the lease (Lua script for atomicity)
    const renewed = await this.redis.eval(
      `
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('pexpire', KEYS[1], ARGV[2])
      else
        return 0
      end
      `,
      1,
      this.leaseKey,
      this.instanceId,
      String(this.leaseTTL)
    )

    if (renewed === 0) {
      // Lost the lease (Redis restart? Eviction? Another node won?)
      console.warn(`[Leader] ${this.instanceId} lost lease — stepping down`)
      this.isLeader = false
      await this.onLostLeadership()
    }
  }

  getIsLeader(): boolean {
    return this.isLeader
  }

  // Override these in subclasses or pass as options
  protected async onBecameLeader(): Promise<void> {}
  protected async onLostLeadership(): Promise<void> {}

  stop(): void {
    if (this.renewTimer) clearInterval(this.renewTimer)
    this.isLeader = false
  }
}

The critical safety rule: a node must stop doing leader work the moment it can't renew its lease, even if the local process is healthy. Redis being unreachable = no longer leader.

class SafeLeader extends LeaderLease {
  private jobTimer: NodeJS.Timeout | null = null

  protected async onBecameLeader() {
    // Start doing leader-only work
    this.jobTimer = setInterval(() => this.processJobQueue(), 10_000)
  }

  protected async onLostLeadership() {
    // STOP doing leader-only work immediately
    if (this.jobTimer) clearInterval(this.jobTimer)
    this.jobTimer = null
  }

  private async processJobQueue() {
    // Double-check at the start of every critical operation
    if (!this.getIsLeader()) return

    const jobs = await db.job.findPending()
    for (const job of jobs) {
      // Check again before each job — lease could expire mid-loop
      if (!this.getIsLeader()) break
      await processJob(job)
    }
  }
}

Fix 2: Fencing Tokens

Even with lease renewal, there's a window where an old leader might be mid-operation when it loses the lease. A fencing token prevents it from completing that operation.

// Fencing: every lease acquisition returns a monotonically increasing token
// The token must be presented when writing — database rejects stale tokens

class FencedLeaderLease {
  private fencingToken: number = 0

  async acquire(): Promise<number | null> {
    // Lua: increment counter and set key atomically
    const result = await this.redis.eval(
      `
      local token = redis.call('incr', KEYS[1] .. ':token')
      local set = redis.call('set', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
      if set then
        redis.call('set', KEYS[1] .. ':token:' .. ARGV[1], token)
        return token
      else
        return nil
      end
      `,
      1,
      this.leaseKey,
      this.instanceId,
      String(this.leaseTTL)
    )

    if (result) {
      this.fencingToken = result as number
      return this.fencingToken
    }
    return null
  }

  getToken(): number { return this.fencingToken }
}

// In the job processor — include fencing token in every write
async function processJobWithFencing(jobId: string, token: number) {
  // Database enforces: only accept writes if token > last seen token
  await db.query(`
    UPDATE jobs
    SET status = 'complete', processed_by_token = $1
    WHERE id = $2
      AND (processed_by_token IS NULL OR processed_by_token < $1)
  `, [token, jobId])

  // If 0 rows affected → stale leader, another leader already processed this
}

Fencing tokens turn "maybe processed twice" into "definitely processed once." Even if the old leader wins a race to the database, the write is rejected because its token is lower than the new leader's.

Fix 3: Kubernetes Leader Election (Production-Ready)

For Kubernetes deployments, use the built-in leader election via Lease objects — no Redis required:

import * as k8s from '@kubernetes/client-node'

async function runWithKubernetesLeaderElection(onLeader: () => void) {
  const kc = new k8s.KubeConfig()
  kc.loadFromDefault()

  const coordinationClient = kc.makeApiClient(k8s.CoordinationV1Api)
  const leaseName = 'my-service-leader'
  const namespace = process.env.POD_NAMESPACE ?? 'default'
  const identity = process.env.HOSTNAME ?? 'unknown'

  const le = new k8s.LeaderElection(coordinationClient)

  await le.run({
    leaseName,
    namespace,
    identity,
    leaseDuration: 15,   // seconds
    renewDeadline: 10,   // must renew within 10s or step down
    retryPeriod: 2,      // retry every 2s
    onStartedLeading: () => {
      console.log(`${identity} became leader`)
      onLeader()
    },
    onStoppedLeading: () => {
      console.log(`${identity} lost leadership — exiting`)
      process.exit(1)  // Let Kubernetes restart the pod
    },
    onNewLeader: (newLeader: string) => {
      console.log(`New leader elected: ${newLeader}`)
    },
  })
}

The Kubernetes Lease resource replaces Redis. The controller manager handles the renewal protocol, and the renewDeadline / leaseDuration parameters are tuned to prevent split-brain by default.

Fix 4: etcd for Strong Consistency

If you need linearizable leader election (stronger than Redis, which is eventually consistent under network partitions), use etcd:

import { Etcd3 } from 'etcd3'

const etcd = new Etcd3({ hosts: 'etcd:2379' })

async function runAsLeader() {
  const election = etcd.election('my-service-leader')

  // campaign() blocks until this instance becomes leader
  // Uses etcd's Raft consensus — linearizable, no split-brain
  const campaign = await election.campaign(process.env.HOSTNAME!)

  console.log('Became leader!')

  // When step down is needed:
  // await campaign.resign()

  // Watch for leadership changes
  election.observe().on('change', (leader) => {
    console.log('Current leader:', leader)
  })
}

runAsLeader().catch((err) => {
  console.error('Lost election or error:', err)
  process.exit(1)
})

etcd uses the Raft consensus algorithm — it's impossible for two nodes to both believe they're leader at the same time (unlike Redis, which can have clock-based issues in network partition scenarios).

The Lease Duration Trade-off

Short lease (5 seconds):
  ✅ Fast failover — new leader elected in ~5 seconds
  ❌ More renewal network traffic
  ❌ Noisy — brief Redis hiccup causes unnecessary leader change
  ❌ Old leader gets 5 seconds where it might do stale work

Long lease (60 seconds):
  ✅ Robust to transient network issues
  ✅ Less renewal traffic
  ❌ Slow failover — 60 seconds of downtime if leader dies
  ❌ Old leader has 60 seconds to do stale work (unless using fencing)

A good default for most services:

leaseTTL: 15 seconds
renewInterval: 5 seconds (renew at 1/3 TTL)
failover time: ~15 seconds (one full TTL expiry)
Add fencing tokens if even 15 seconds of split-brain is unacceptable

Leader Election Checklist

Risk	Solution
Two leaders simultaneously	Fencing tokens + lease per-operation check
Leader dies, no failover	TTL on lease — auto-expires and enables new election
Redis restart → all leases lost	Handle `onLostLeadership`, graceful shutdown
Leader can't renew but is healthy	Stop all leader work when renewal fails
Split-brain during partition	etcd or K8s Lease (Raft-based, linearizable)
Stale leader finishes long operation	Fencing token rejected by database

Conclusion

Leader election bugs are silent in development (single node, no partitions) and catastrophic in production (duplicate billing, corrupted state, split-brain). The minimum viable safe implementation has three properties: a short-lived lease that auto-expires, strict step-down when renewal fails (even if the node is otherwise healthy), and fencing tokens on any writes so stale leaders can't corrupt state. For Kubernetes deployments, use the built-in Lease resource rather than rolling your own. For stronger guarantees, use etcd's Raft-based election. Whatever approach you choose, test the network-partition scenario — that's where elections go wrong.