Reliable Background Jobs — Handling Timeouts, Poison Pills, and Job Duplication
Build durable background job systems resistant to timeouts, poison pills, and duplication. Implement heartbeats, deduplication, and observability patterns.
webcoderspeed.com
33 articles
Build durable background job systems resistant to timeouts, poison pills, and duplication. Implement heartbeats, deduplication, and observability patterns.
Users see stale prices. Admins update settings but the old value is served for 10 minutes. You delete a record but it keeps appearing. Cache invalidation is famously hard — and most implementations have subtle bugs that serve wrong data long after the source changed.
You added a circuit breaker to protect against cascading failures. But it never opens — requests keep failing, the downstream service stays overloaded, and your system doesn''t recover. Here''s why circuit breakers fail silently and how to configure them correctly.
Master ClickHouse for analytics: MergeTree engines, materialized views, CDC from Postgres, dual-write patterns, and query optimization.
Server A issues a JWT. Server B validates it 2 seconds later but thinks the token was issued in the future — invalid. Or a token that should be expired is still accepted because the validating server''s clock is 5 minutes behind. Clock skew causes authentication failures and security holes.
You scale your app to 3 instances. Your daily billing cron runs on all 3 simultaneously. 3x the emails, 3x the charges, 3x the chaos. Distributed cron requires distributed locking. Here''s how to ensure your scheduled jobs run exactly once across any number of instances.
Your DLQ has 2 million messages. They''ve been there for 3 months. Nobody noticed. Those are failed orders, unpaid invoices, and unprocessed refunds — silently rotting. Here''s how to build a DLQ strategy that''s actually monitored, alerting, and self-healing.
Your message queue delivers an event twice. Your consumer processes it twice. The order ships twice, the email sends twice, the payment charges twice. At-least-once delivery is a guarantee — not a bug. Here''s how to build idempotent consumers that handle duplicate events safely.
Order created at 10:00. Order cancelled at 10:01. Your consumer processes them in reverse — cancellation arrives first, then creation "succeeds." The order is now in an invalid state. Event ordering bugs are subtle, expensive, and entirely avoidable.
Master read-after-write consistency, version vectors, causal consistency tokens, and UI patterns for systems where data is temporarily stale.
User updates their profile. Refreshes the page — old data shows. They update again. Still old data. They''re furious. Your system is eventually consistent — but nobody told the user (or the developer who designed the UI). Here''s how to manage consistency expectations in distributed systems.
You horizontally scaled your database to 10 shards, but 90% of traffic still hits just one of them. Writes queue, latency spikes, and one node is on fire while the others idle. This is the hot partition problem — and it''s all about key design.
Learn idempotency key design, idempotency stores with TTL, request fingerprinting, and CQRS deduplication patterns for safe retries.
Network timeout on a payment request. Client retries. Customer gets charged twice. This is the most expensive bug in fintech — and it''s completely preventable with idempotency keys. Here''s the complete implementation.
You shard by user ID. 80% of writes go to 20% of shards because your top customers are assigned to the same shards. Or you shard by date and all writes go to the current month''s shard. Uneven distribution turns a scaling solution into a bottleneck.
Master consumer groups, offset management, exactly-once semantics, dead-letter queues, and consumer lag monitoring for production Kafka.
Your service elects a leader to run background jobs. The network hiccups for 5 seconds. The old leader thinks it''s still leader. The new leader also thinks it''s leader. Both start processing the same queue. Now you have duplicate work, corrupted state, and a split-brain.
Master load balancer algorithms for distributing traffic. Learn round-robin limitations, connection-aware routing, consistent hashing, and session affinity patterns.
Master Kafka partition keys, FIFO queues, sequence numbers, global vs per-entity ordering, and when ordering isn''t worth the cost.
Your queue has 50 million unprocessed messages. Consumers are processing 1,000/second. New messages arrive at 5,000/second. The backlog will never drain. Here''s how queue backlogs form, why they''re dangerous, and the patterns to prevent and recover from them.
In distributed systems, failure is never all-or-nothing. A service returns a response — but it''s corrupt. An API call times out — but the action already executed. A message is delivered — but the reply never arrives. This is partial failure, and it is the hardest problem in distributed systems.
Two requests check inventory simultaneously — both see 1 item in stock. Both proceed to purchase. You ship 2 items from 1. Race conditions in distributed systems are subtler than single-process races because you can''t use mutexes across services. Here''s how to prevent them.
User saves their profile. Page reloads. Shows old data. They save again — same thing. The write went to the primary. The read came from the replica. The replica is 2 seconds behind. Read-after-write consistency is the hardest problem with read replicas.
Understand Redis Cluster architecture, consistent hashing, CROSSSLOT errors, hot slot detection, replication, and monitoring for production deployments.
Your service is degraded, returning errors 30% of the time. Smart clients with retry logic start hammering it — 3 retries each means 3x the load on an already failing system. The retry storm amplifies the original failure until full collapse. Here''s how to retry safely.
Master choreography and orchestration sagas, compensation transactions, and failure handling for distributed transaction management without 2PC.
You rename a column. The new service version uses the new name. The old version, still running during the rolling deploy, tries to use the old name. Database error. The migration that passed all your tests breaks production because both old and new code run simultaneously during deployment.
You split into microservices but all of them share the same PostgreSQL database. You have the operational overhead of microservices with none of the independent scalability. A schema migration blocks all teams. A bad query in Service A slows down Service B.
Network partition splits your 3-node cluster into two halves. Both halves think they''re the primary. Both accept writes. Network heals. You have two diverged databases with conflicting data. This is split brain — one of the most dangerous failure modes in distributed systems.
Every operation is a synchronous HTTP call. User signup calls email service, which calls template service, which calls asset service. Any service down means signup is down. Any service slow means signup is slow. Synchronous coupling is the enemy of resilience.
Master Temporal.io to build resilient workflows with automatic retries, durability, and orchestration that survive infrastructure failures.
Service A calls Service B synchronously. Service B calls Service C. Service C calls Service A. Now a deploy to any of them requires coordinating all three. A bug in Service B takes down Services A and C. This isn''t microservices — it''s a distributed monolith.
Your server is in UTC. Your database is in UTC. Your cron job runs at "9 AM" — but 9 AM where? Customer in Tokyo and customer in New York both get charged at your server''s 9 AM. Your "end of day" reports include data from tomorrow. Timezone bugs are invisible until they''re expensive.