Distributed-systems

34 articles

System Design Interview Guide 2026: URL Shortener to Netflix Scale

Crack system design interviews in 2026: the RADIO framework, URL shortener, Instagram design, WhatsApp messaging, Netflix streaming, rate limiter, and key-value store — with diagrams, capacity estimation, and deep dives.

March 26, 2026Read →

background-jobs12 min read

Reliable Background Jobs — Handling Timeouts, Poison Pills, and Job Duplication

Build durable background job systems resistant to timeouts, poison pills, and duplication. Implement heartbeats, deduplication, and observability patterns.

March 15, 2026Read →

backend6 min read

Cache Invalidation Hell — The Second Hardest Problem in Computer Science

Users see stale prices. Admins update settings but the old value is served for 10 minutes. You delete a record but it keeps appearing. Cache invalidation is famously hard — and most implementations have subtle bugs that serve wrong data long after the source changed.

March 15, 2026Read →

backend6 min read

Circuit Breaker Not Triggering — When Your Safety Net Has Holes

You added a circuit breaker to protect against cascading failures. But it never opens — requests keep failing, the downstream service stays overloaded, and your system doesn''t recover. Here''s why circuit breakers fail silently and how to configure them correctly.

March 15, 2026Read →

clickhouse10 min read

ClickHouse for Backend Engineers — Real-Time Analytics Without the Data Warehouse Complexity

Master ClickHouse for analytics: MergeTree engines, materialized views, CDC from Postgres, dual-write patterns, and query optimization.

March 15, 2026Read →

backend6 min read

Clock Skew Breaking Tokens — When Servers Disagree on What Time It Is

Server A issues a JWT. Server B validates it 2 seconds later but thinks the token was issued in the future — invalid. Or a token that should be expired is still accepted because the validating server''s clock is 5 minutes behind. Clock skew causes authentication failures and security holes.

March 15, 2026Read →

backend6 min read

Cron Job Running Twice — When Your Scheduled Job Has Duplicate Instances

You scale your app to 3 instances. Your daily billing cron runs on all 3 simultaneously. 3x the emails, 3x the charges, 3x the chaos. Distributed cron requires distributed locking. Here''s how to ensure your scheduled jobs run exactly once across any number of instances.

March 15, 2026Read →

backend6 min read

Dead Letter Queue Ignored for Months — The Silent Data Graveyard

Your DLQ has 2 million messages. They''ve been there for 3 months. Nobody noticed. Those are failed orders, unpaid invoices, and unprocessed refunds — silently rotting. Here''s how to build a DLQ strategy that''s actually monitored, alerting, and self-healing.

March 15, 2026Read →

backend6 min read

Duplicate Event Processing — When Your Queue Delivers the Same Message Twice

Your message queue delivers an event twice. Your consumer processes it twice. The order ships twice, the email sends twice, the payment charges twice. At-least-once delivery is a guarantee — not a bug. Here''s how to build idempotent consumers that handle duplicate events safely.

March 15, 2026Read →

backend7 min read

Event Ordering Problem — When Events Arrive Out of Sequence

Order created at 10:00. Order cancelled at 10:01. Your consumer processes them in reverse — cancellation arrives first, then creation "succeeds." The order is now in an invalid state. Event ordering bugs are subtle, expensive, and entirely avoidable.

March 15, 2026Read →

eventual-consistency9 min read

Eventual Consistency in Practice — Building UIs and APIs That Handle Stale Data

Master read-after-write consistency, version vectors, causal consistency tokens, and UI patterns for systems where data is temporarily stale.

March 15, 2026Read →

backend6 min read

Inconsistent Reads — The Eventual Consistency Shock

User updates their profile. Refreshes the page — old data shows. They update again. Still old data. They''re furious. Your system is eventually consistent — but nobody told the user (or the developer who designed the UI). Here''s how to manage consistency expectations in distributed systems.

March 15, 2026Read →

backend5 min read

Hot Partition in Distributed Databases — When One Shard Gets All the Heat

You horizontally scaled your database to 10 shards, but 90% of traffic still hits just one of them. Writes queue, latency spikes, and one node is on fire while the others idle. This is the hot partition problem — and it''s all about key design.

March 15, 2026Read →

idempotency8 min read

Idempotency in Distributed Systems — Making Any Operation Safe to Retry

Learn idempotency key design, idempotency stores with TTL, request fingerprinting, and CQRS deduplication patterns for safe retries.

March 15, 2026Read →

backend6 min read

Idempotency Issues in Payment APIs — When Retries Charge Customers Twice

Network timeout on a payment request. Client retries. Customer gets charged twice. This is the most expensive bug in fintech — and it''s completely preventable with idempotency keys. Here''s the complete implementation.

March 15, 2026Read →

backend6 min read

Improper Sharding Strategy — When Your "Scalable" Database Isn't

You shard by user ID. 80% of writes go to 20% of shards because your top customers are assigned to the same shards. Or you shard by date and all writes go to the current month''s shard. Uneven distribution turns a scaling solution into a bottleneck.

March 15, 2026Read →

kafka9 min read

Kafka Consumer Patterns — At-Least-Once, Exactly-Once, and Everything in Between

Master consumer groups, offset management, exactly-once semantics, dead-letter queues, and consumer lag monitoring for production Kafka.

March 15, 2026Read →

backend8 min read

Leader Election Gone Wrong — When Two Nodes Both Think They're in Charge

Your service elects a leader to run background jobs. The network hiccups for 5 seconds. The old leader thinks it''s still leader. The new leader also thinks it''s leader. Both start processing the same queue. Now you have duplicate work, corrupted state, and a split-brain.

March 15, 2026Read →

load-balancing11 min read

Load Balancer Algorithms — Round Robin, Least Connections, and Consistent Hashing

Master load balancer algorithms for distributing traffic. Learn round-robin limitations, connection-aware routing, consistent hashing, and session affinity patterns.

March 15, 2026Read →

message-queuing10 min read

Message Ordering at Scale — When You Need Order and When You Don't

Master Kafka partition keys, FIFO queues, sequence numbers, global vs per-entity ordering, and when ordering isn''t worth the cost.

March 15, 2026Read →

backend6 min read

Message Queue Backlog Explosion — When Your Queue Grows Faster Than You Consume

Your queue has 50 million unprocessed messages. Consumers are processing 1,000/second. New messages arrive at 5,000/second. The backlog will never drain. Here''s how queue backlogs form, why they''re dangerous, and the patterns to prevent and recover from them.

March 15, 2026Read →

backend6 min read

Partial Failure Between Services — When Half Your System Lies

In distributed systems, failure is never all-or-nothing. A service returns a response — but it''s corrupt. An API call times out — but the action already executed. A message is delivered — but the reply never arrives. This is partial failure, and it is the hardest problem in distributed systems.

March 15, 2026Read →

backend6 min read

Race Conditions in Microservices — When Two Services Agree on Something Wrong

Two requests check inventory simultaneously — both see 1 item in stock. Both proceed to purchase. You ship 2 items from 1. Race conditions in distributed systems are subtler than single-process races because you can''t use mutexes across services. Here''s how to prevent them.

March 15, 2026Read →

backend5 min read

Read Replica Lag — Why Your Users See Stale Data After Saving

User saves their profile. Page reloads. Shows old data. They save again — same thing. The write went to the primary. The read came from the replica. The replica is 2 seconds behind. Read-after-write consistency is the hardest problem with read replicas.

March 15, 2026Read →

redis12 min read

Redis Cluster in Production — Sharding, Failover, and the Gotchas No One Warns You About

Understand Redis Cluster architecture, consistent hashing, CROSSSLOT errors, hot slot detection, replication, and monitoring for production deployments.

March 15, 2026Read →

backend6 min read

Retry Storm Amplifying Failure — When Good Intentions Crash the System

Your service is degraded, returning errors 30% of the time. Smart clients with retry logic start hammering it — 3 retries each means 3x the load on an already failing system. The retry storm amplifies the original failure until full collapse. Here''s how to retry safely.

March 15, 2026Read →

distributed-systems6 min read

The Saga Pattern — Managing Distributed Transactions Without Two-Phase Commit

Master choreography and orchestration sagas, compensation transactions, and failure handling for distributed transaction management without 2PC.

March 15, 2026Read →

backend5 min read

Schema Change Breaking Older Services — When Your Database Migration Breaks Half the Fleet

You rename a column. The new service version uses the new name. The old version, still running during the rolling deploy, tries to use the old name. Database error. The migration that passed all your tests breaks production because both old and new code run simultaneously during deployment.

March 15, 2026Read →

backend4 min read

Shared Database Across Services — The Hidden Monolith

You split into microservices but all of them share the same PostgreSQL database. You have the operational overhead of microservices with none of the independent scalability. A schema migration blocks all teams. A bad query in Service A slows down Service B.

March 15, 2026Read →

backend6 min read

Split Brain Scenario — When Your Cluster Can't Agree on Who's in Charge

Network partition splits your 3-node cluster into two halves. Both halves think they''re the primary. Both accept writes. Network heals. You have two diverged databases with conflicting data. This is split brain — one of the most dangerous failure modes in distributed systems.

March 15, 2026Read →

backend5 min read

Synchronous Calls Everywhere — When Your Architecture Can't Handle Failure

Every operation is a synchronous HTTP call. User signup calls email service, which calls template service, which calls asset service. Any service down means signup is down. Any service slow means signup is slow. Synchronous coupling is the enemy of resilience.

March 15, 2026Read →

temporal8 min read

Temporal.io — Durable Workflows That Survive Server Crashes and Network Failures

Master Temporal.io to build resilient workflows with automatic retries, durability, and orchestration that survive infrastructure failures.

March 15, 2026Read →

backend4 min read

Tight Coupling Between Services — When Changing One Service Breaks Five Others

Service A calls Service B synchronously. Service B calls Service C. Service C calls Service A. Now a deploy to any of them requires coordinating all three. A bug in Service B takes down Services A and C. This isn''t microservices — it''s a distributed monolith.

March 15, 2026Read →

backend6 min read

Timezone Bugs in Distributed Systems — When 9 AM Means Different Things

Your server is in UTC. Your database is in UTC. Your cron job runs at "9 AM" — but 9 AM where? Customer in Tokyo and customer in New York both get charged at your server''s 9 AM. Your "end of day" reports include data from tomorrow. Timezone bugs are invisible until they''re expensive.

March 15, 2026Read →