Published on

Distributed Locking — Redis Redlock, Database Locks, and When You Actually Need Them

Authors

Introduction

Distributed locks prevent concurrent access to shared resources across multiple servers. They're seductive—they feel safe—but they're also a common source of subtle bugs. A network partition can leave a lock held forever. A crashed process never releases its lock. This post covers the tools (Redis, PostgreSQL) and when you actually need them. Spoiler: often you don't.

Redis SET NX PX: Simple Distributed Lock

The simplest distributed lock uses Redis SET with NX (only if not exists) and PX (expiration).

class RedisLock {
  constructor(private redis: Redis, private lockTimeout = 30000) {}

  async acquire(key: string): Promise<string | null> {
    const token = uuid();
    const result = await this.redis.set(
      `lock:${key}`,
      token,
      'NX', // Only set if not exists
      'PX', // Milliseconds
      this.lockTimeout
    );

    return result === 'OK' ? token : null;
  }

  async release(key: string, token: string): Promise<boolean> {
    // Use Lua script to ensure atomic check-and-delete
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;

    const result = await this.redis.eval(script, 1, `lock:${key}`, token);
    return result === 1;
  }

  async withLock<T>(key: string, fn: () => Promise<T>, retries = 3): Promise<T> {
    for (let i = 0; i < retries; i++) {
      const token = await this.acquire(key);
      if (token) {
        try {
          return await fn();
        } finally {
          await this.release(key, token);
        }
      }
      // Exponential backoff
      await sleep(Math.pow(2, i) * 100 + Math.random() * 100);
    }
    throw new Error(`Failed to acquire lock: ${key}`);
  }
}

Pitfalls:

  • If the client crashes, the lock expires after timeout (good), but operations during that timeout can race
  • Network partition: client thinks it released lock, but Redis didn't receive the message; another client acquires lock
  • Process pauses (GC, context switch) can cause lock expiration while code still runs

Redlock Algorithm

Redlock uses multiple Redis instances to improve safety. Acquire a lock on a majority (typically 5) nodes.

class RedLock {
  private locks: Redis[];
  private lockTimeout = 30000;
  private clockDrift = 200; // Clock drift tolerance (ms)

  constructor(redisInstances: Redis[]) {
    this.locks = redisInstances;
    if (redisInstances.length % 2 === 0) {
      console.warn('Odd number of Redis instances recommended for Redlock');
    }
  }

  async acquire(key: string, retries = 3): Promise<string | null> {
    const token = uuid();
    const quorum = Math.floor(this.locks.length / 2) + 1;
    const deadline = Date.now() + this.lockTimeout;

    for (let attempt = 0; attempt < retries; attempt++) {
      const startTime = Date.now();
      let successCount = 0;

      // Try to acquire lock on all instances
      const promises = this.locks.map(lock =>
        lock
          .set(`lock:${key}`, token, 'NX', 'PX', this.lockTimeout)
          .then(() => {
            successCount++;
          })
          .catch(() => {}) // Ignore individual lock failures
      );

      await Promise.all(promises);

      const elapsed = Date.now() - startTime;
      const lockValidityTime = this.lockTimeout - elapsed - this.clockDrift;

      // Did we get a quorum?
      if (successCount >= quorum && lockValidityTime > 0) {
        return token;
      }

      // Release all locks we acquired
      await this.releaseAll(key, token);

      // Backoff before retry
      const backoff = Math.random() * Math.pow(2, attempt) * 100;
      await sleep(backoff);
    }

    return null;
  }

  async release(key: string, token: string): Promise<number> {
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;

    // Release from all instances
    const results = await Promise.all(
      this.locks.map(lock => lock.eval(script, 1, `lock:${key}`, token).catch(() => 0))
    );

    return results.reduce((sum, r) => sum + r, 0);
  }

  private async releaseAll(key: string, token: string): Promise<void> {
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;

    await Promise.all(
      this.locks.map(lock => lock.eval(script, 1, `lock:${key}`, token).catch(() => {}))
    );
  }

  async withLock<T>(key: string, fn: () => Promise<T>): Promise<T> {
    const token = await this.acquire(key);
    if (!token) {
      throw new Error(`Failed to acquire lock: ${key}`);
    }

    try {
      return await fn();
    } finally {
      await this.release(key, token);
    }
  }
}

Why 5 nodes?

  • Quorum = 3 (majority)
  • Can tolerate 2 failures
  • Even with 1 node down for maintenance, you keep running

Lock Renewal for Long Operations

If your operation takes longer than the lock timeout, renew the lock.

class RenewableLock {
  async withLock<T>(
    key: string,
    fn: () => Promise<T>,
    lockTimeout = 30000,
    renewInterval = 10000
  ): Promise<T> {
    const token = uuid();
    let isLocked = true;
    let renewalError: Error | null = null;

    const acquire = await this.redis.set(
      `lock:${key}`,
      token,
      'NX',
      'PX',
      lockTimeout
    );
    if (acquire !== 'OK') {
      throw new Error(`Failed to acquire lock: ${key}`);
    }

    // Start renewal interval
    const renewalHandle = setInterval(async () => {
      try {
        const script = `
          if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("pexpire", KEYS[1], ARGV[2])
          else
            return 0
          end
        `;
        const result = await this.redis.eval(script, 1, `lock:${key}`, token, lockTimeout);
        if (result === 0) {
          renewalError = new Error('Lock lost during renewal');
          isLocked = false;
        }
      } catch (error) {
        renewalError = error as Error;
        isLocked = false;
      }
    }, renewInterval);

    try {
      const result = await fn();

      // Check if lock was lost during operation
      if (!isLocked || renewalError) {
        throw renewalError || new Error('Lock was lost during operation');
      }

      return result;
    } finally {
      clearInterval(renewalHandle);
      await this.release(key, token);
    }
  }

  private async release(key: string, token: string): Promise<void> {
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;
    await this.redis.eval(script, 1, `lock:${key}`, token);
  }
}

Fencing Tokens for Safety

After a lock expires, another client acquires it and starts work. The first client (which was paused) resumes and makes changes. Fencing tokens prevent this: include a monotonically increasing token with each operation.

class FencedLock {
  async acquire(key: string): Promise<{ token: string; fence: number } | null> {
    const token = uuid();
    const script = `
      local fence = redis.call("incr", KEYS[1] .. ":fence")
      local result = redis.call("set", KEYS[1], ARGV[1], "NX", "PX", ARGV[2])
      if result == "OK" then
        return {token, fence}
      else
        return nil
      end
    `;

    const result = await this.redis.eval(
      script,
      1,
      `lock:${key}`,
      token,
      30000
    );

    if (result) {
      return { token: result[0], fence: result[1] };
    }
    return null;
  }

  async executeWithFence<T>(
    key: string,
    fn: (fence: number) => Promise<T>
  ): Promise<T> {
    const lock = await this.acquire(key);
    if (!lock) {
      throw new Error(`Failed to acquire lock: ${key}`);
    }

    try {
      // Pass fence token to the operation
      // The operation must use this token to guard state mutations
      return await fn(lock.fence);
    } finally {
      await this.release(key, lock.token);
    }
  }

  private async release(key: string, token: string): Promise<void> {
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;
    await this.redis.eval(script, 1, `lock:${key}`, token);
  }
}

// Usage with fenced operations
class Account {
  async transfer(fromAccountId: string, toAccountId: string, amount: number): Promise<void> {
    await lock.executeWithFence(`transfer:${fromAccountId}`, async fence => {
      const account = await db.query(
        `SELECT balance, fence FROM accounts WHERE id = $1 FOR UPDATE`,
        [fromAccountId]
      );

      if (account.rows[0].fence >= fence) {
        throw new Error('Lock lost; another process modified this account');
      }

      const newBalance = account.rows[0].balance - amount;
      if (newBalance < 0) {
        throw new Error('Insufficient funds');
      }

      await db.query(
        `UPDATE accounts SET balance = $1, fence = $2 WHERE id = $3`,
        [newBalance, fence, fromAccountId]
      );

      await db.query(
        `UPDATE accounts SET balance = balance + $1 WHERE id = $2`,
        [amount, toAccountId]
      );
    });
  }
}

PostgreSQL Advisory Locks

For resources within a single database, advisory locks are simpler and safer than distributed locks.

class PostgreSQLAdvisoryLock {
  async withLock<T>(
    lockId: number,
    fn: () => Promise<T>,
    shared = false
  ): Promise<T> {
    const conn = await this.pool.connect();

    try {
      // pg_advisory_lock blocks; pg_try_advisory_lock returns immediately
      const lockFunc = shared ? 'pg_advisory_xact_lock_shared' : 'pg_advisory_xact_lock';
      await conn.query(`SELECT ${lockFunc}($1)`, [lockId]);

      return await fn();
    } finally {
      conn.release();
      // Lock auto-releases at transaction end
    }
  }

  async withTryLock<T>(lockId: number, fn: () => Promise<T>): Promise<T | null> {
    const conn = await this.pool.connect();

    try {
      const result = await conn.query(`SELECT pg_try_advisory_lock($1)`, [lockId]);
      if (!result.rows[0].pg_try_advisory_lock) {
        return null; // Lock not acquired
      }

      return await fn();
    } finally {
      conn.release();
    }
  }
}

SELECT FOR UPDATE SKIP LOCKED for Job Queues

Instead of explicit locks, use database constraints and SELECT FOR UPDATE.

class JobQueue {
  async dequeueJob(workerId: string): Promise<Job | null> {
    const result = await this.db.query(
      `UPDATE jobs
       SET worker_id = $1, status = 'processing', claimed_at = NOW()
       WHERE id = (
         SELECT id FROM jobs
         WHERE status = 'pending'
         ORDER BY priority DESC, created_at ASC
         FOR UPDATE SKIP LOCKED
         LIMIT 1
       )
       RETURNING id, payload, attempt_count`,
      [workerId]
    );

    return result.rows[0] || null;
  }

  async completeJob(jobId: string): Promise<void> {
    await this.db.query(
      `UPDATE jobs SET status = 'completed', completed_at = NOW() WHERE id = $1`,
      [jobId]
    );
  }

  async failJob(jobId: string, error: string): Promise<void> {
    const result = await this.db.query(
      `SELECT attempt_count, max_attempts FROM jobs WHERE id = $1`,
      [jobId]
    );

    const { attempt_count, max_attempts } = result.rows[0];
    if (attempt_count >= max_attempts) {
      await this.db.query(
        `UPDATE jobs SET status = 'failed', error = $1 WHERE id = $2`,
        [error, jobId]
      );
    } else {
      await this.db.query(
        `UPDATE jobs SET status = 'pending', attempt_count = attempt_count + 1 WHERE id = $1`,
        [jobId]
      );
    }
  }
}

Lock-Free Alternatives: Optimistic Concurrency

Often you don't need locks. Use version numbers and retry on conflict.

class OptimisticConcurrency {
  async updateAccount(accountId: string, updates: any): Promise<void> {
    let retries = 0;
    const maxRetries = 5;

    while (retries < maxRetries) {
      // Read current version
      const current = await this.db.query(
        `SELECT version, balance FROM accounts WHERE id = $1`,
        [accountId]
      );

      if (current.rows.length === 0) {
        throw new Error(`Account ${accountId} not found`);
      }

      const { version, balance } = current.rows[0];

      // Apply updates
      const newBalance = balance + updates.amount;

      // Try to update only if version hasn't changed
      const updateResult = await this.db.query(
        `UPDATE accounts SET balance = $1, version = version + 1
         WHERE id = $2 AND version = $3`,
        [newBalance, accountId, version]
      );

      if (updateResult.rowCount === 1) {
        return; // Success
      }

      // Version mismatch; retry
      retries++;
      await sleep(Math.pow(2, retries) * 10 + Math.random() * 10);
    }

    throw new Error(`Failed to update account after ${maxRetries} retries`);
  }
}

When Distributed Locks Are Wrong

Distributed locks add latency and complexity. Consider alternatives:

  1. Caching: Use cache-aside pattern with weak consistency
  2. Partitioning: Ensure no two servers handle the same resource
  3. Optimistic concurrency: Accept conflicts, retry on mismatch
  4. Work stealing: ONE server owns a resource; if it dies, another steals work
  5. Sharding: Route by customer/resource to a single owner

Checklist

  • Understand your locking requirement: are you preventing races or just coordinating?
  • Use database locks for single-database operations
  • Use optimistic concurrency unless strong mutual exclusion is required
  • If using Redis locks, implement token-based release (not just TTL)
  • For long operations, implement renewal or use fencing tokens
  • Test lock failures: crashed processes, network partitions, clock skew
  • Monitor lock contention and timeout rates
  • Avoid nested locks (deadlock risk)

Conclusion

Distributed locks feel like a safety net, but they have holes. For single-database operations, use advisory locks. For cross-service coordination, ask if you really need mutual exclusion—often optimistic concurrency or work partitioning is simpler and faster. If you do use locks, include tokens and test failure scenarios ruthlessly.