Distributed-systems

33 articles

backend6 min read

Circuit Breaker Not Triggering — When Your Safety Net Has Holes

You added a circuit breaker to protect against cascading failures. But it never opens — requests keep failing, the downstream service stays overloaded, and your system doesn''t recover. Here''s why circuit breakers fail silently and how to configure them correctly.

Read →
backend6 min read

Clock Skew Breaking Tokens — When Servers Disagree on What Time It Is

Server A issues a JWT. Server B validates it 2 seconds later but thinks the token was issued in the future — invalid. Or a token that should be expired is still accepted because the validating server''s clock is 5 minutes behind. Clock skew causes authentication failures and security holes.

Read →
backend6 min read

Dead Letter Queue Ignored for Months — The Silent Data Graveyard

Your DLQ has 2 million messages. They''ve been there for 3 months. Nobody noticed. Those are failed orders, unpaid invoices, and unprocessed refunds — silently rotting. Here''s how to build a DLQ strategy that''s actually monitored, alerting, and self-healing.

Read →
backend7 min read

Event Ordering Problem — When Events Arrive Out of Sequence

Order created at 10:00. Order cancelled at 10:01. Your consumer processes them in reverse — cancellation arrives first, then creation "succeeds." The order is now in an invalid state. Event ordering bugs are subtle, expensive, and entirely avoidable.

Read →
backend6 min read

Inconsistent Reads — The Eventual Consistency Shock

User updates their profile. Refreshes the page — old data shows. They update again. Still old data. They''re furious. Your system is eventually consistent — but nobody told the user (or the developer who designed the UI). Here''s how to manage consistency expectations in distributed systems.

Read →
backend6 min read

Partial Failure Between Services — When Half Your System Lies

In distributed systems, failure is never all-or-nothing. A service returns a response — but it''s corrupt. An API call times out — but the action already executed. A message is delivered — but the reply never arrives. This is partial failure, and it is the hardest problem in distributed systems.

Read →
backend5 min read

Read Replica Lag — Why Your Users See Stale Data After Saving

User saves their profile. Page reloads. Shows old data. They save again — same thing. The write went to the primary. The read came from the replica. The replica is 2 seconds behind. Read-after-write consistency is the hardest problem with read replicas.

Read →
backend4 min read

Shared Database Across Services — The Hidden Monolith

You split into microservices but all of them share the same PostgreSQL database. You have the operational overhead of microservices with none of the independent scalability. A schema migration blocks all teams. A bad query in Service A slows down Service B.

Read →
backend6 min read

Timezone Bugs in Distributed Systems — When 9 AM Means Different Things

Your server is in UTC. Your database is in UTC. Your cron job runs at "9 AM" — but 9 AM where? Customer in Tokyo and customer in New York both get charged at your server''s 9 AM. Your "end of day" reports include data from tomorrow. Timezone bugs are invisible until they''re expensive.

Read →