The query works fine in development with 1,000 rows. In production with 50 million rows it locks up the database for 3 minutes. One missing WHERE clause, one implicit type cast, one function wrapping an indexed column — and PostgreSQL ignores your index entirely.
You''ve been running backups for 18 months. The disk dies. You go to restore. The backup files are empty. Or corrupted. Or the backup job failed silently on month 4 and you''ve been running without a backup ever since. Untested backups are not backups.
Users see stale prices. Admins update settings but the old value is served for 10 minutes. You delete a record but it keeps appearing. Cache invalidation is famously hard — and most implementations have subtle bugs that serve wrong data long after the source changed.
You add ON DELETE CASCADE to a foreign key. You delete a test organization. It cascades to users, which cascades to sessions, orders, invoices, activity_logs — 10,000 rows gone in milliseconds. No warning, no undo. Cascade deletes are powerful and dangerous.
You store a price as a JavaScript float. You retrieve it as 19.99. You display it as 20.000000000000004. Or you store a BigInt user ID as JSON and it becomes the wrong number. Serialization bugs corrupt data silently — no error, just wrong values.
The email job has been failing silently for three months. 50,000 emails not sent. Or the background sync has been silently skipping records. Or the backup has been succeeding at creation but failing at upload. Silent failures are the most dangerous kind.
Your system handles 1,000 users today. You''re designing for 10,000. Not 10 million — 10,000. Most "design for scale" advice is written for companies you''re not. What actually changes at 10x, and what''s over-engineering that will hurt more than help?
User updates their profile. Refreshes the page — old data shows. They update again. Still old data. They''re furious. Your system is eventually consistent — but nobody told the user (or the developer who designed the UI). Here''s how to manage consistency expectations in distributed systems.
"The app is slow. Fix it." — said by the founder, with no further context. Is the homepage slow? Checkout? API responses? For which users? On mobile? Under what conditions? Turning vague business pressure into actionable performance work requires measurement before code.
A user submits a GDPR deletion request. You have 30 days to comply. But their data is in the main DB, the analytics DB, S3, Redis, CloudWatch logs, third-party integrations, and three months of database backups. You have 30 days. Start now.
You shard by user ID. 80% of writes go to 20% of shards because your top customers are assigned to the same shards. Or you shard by date and all writes go to the current month''s shard. Uneven distribution turns a scaling solution into a bottleneck.
The senior engineer proposes Kafka for the notification system. You have 500 users. The junior engineer proposes a direct function call. The senior engineer is technically correct and strategically wrong. Knowing when good architecture is overkill is the skill that separates senior from staff.
You need to export 10 million rows. You paginate with OFFSET, fetching 1,000 rows at a time. The first batch takes 50ms. By batch 5,000 the offset is 5 million rows and each batch takes 30 seconds. The total job takes 6 hours and gets slower as it goes.
Audit logs are critical for compliance and debugging. But an audit_logs table that grows without bounds will fill your disk, slow every query that touches it, and eventually crash your database. Here''s how to keep your logs without letting them kill production.
You deploy a migration that runs ALTER TABLE on a 40-million row table. PostgreSQL rewrites the entire table. Your app is stuck waiting for the lock. Users see 503s for 8 minutes. Schema changes on large tables require a completely different approach.
Month 1 — queries are fast. Month 6 — users notice slowness. Month 12 — the dashboard times out. The data grew but the indexes didn''t. Finding and adding the right index is often a 10-minute fix that makes queries 1000x faster.
Error rate spikes after deploy. You need to roll back. But the migration already ran, the old binary can''t read the new schema, and "reverting the deploy" means a data loss decision. Rollback is only possible if you design for it before you deploy.
Page 1 loads in 10ms. Page 100 loads in 500ms. Page 1000 loads in 5 seconds. OFFSET pagination makes the database skip rows by reading them all first. Cursor-based pagination fixes this — same performance on page 1 and page 10,000.
Stripe times out at 30 seconds. Did the charge happen? You don''t know. You charge again and double-charge the customer. Or you don''t charge and ship for free. Payment idempotency and webhook reconciliation are the only reliable path through this.
Master PostgreSQL indexing strategies including B-Tree for general queries, GIN for JSONB/arrays, BRIN for time-series, partial indexes, covering indexes, and how to identify unused indexes with pg_stat_user_indexes.
Learn when sharding becomes necessary, compare hash vs range vs list partitioning, explore Citus for horizontal scaling, and understand the costs of distributed queries and shard key selection.
Two requests check inventory simultaneously — both see 1 item in stock. Both proceed to purchase. You ship 2 items from 1. Race conditions in distributed systems are subtler than single-process races because you can''t use mutexes across services. Here''s how to prevent them.
User saves their profile. Page reloads. Shows old data. They save again — same thing. The write went to the primary. The read came from the replica. The replica is 2 seconds behind. Read-after-write consistency is the hardest problem with read replicas.
The codebase is a mess. Nobody wants to touch it. The "obvious fix" requires changing 40 files. Every change breaks three things. Refactoring legacy code safely requires the strangler fig pattern, comprehensive tests before changing anything, and very small steps.
The disk dies at 2 AM. You have backups. But the restore takes 9 hours because nobody tested it, the database is 800GB, the download from S3 is throttled, and pg_restore runs single-threaded by default. You could have restored in 45 minutes with the right setup.
The codebase is painful. The team wants to rewrite it. The CTO wants to maintain velocity. Both are right. The rewrite vs refactor decision is one of the highest-stakes calls in software — get it wrong and you lose two years of productivity or two more years of compounding debt.
Traffic spikes 10x at 8 AM on Black Friday. Auto-scaling triggers but takes 4 minutes to add instances. The database connection pool is exhausted at minute 2. The checkout flow is down for your highest-traffic day of the year.
You rename a column. The new service version uses the new name. The old version, still running during the rolling deploy, tries to use the old name. Database error. The migration that passed all your tests breaks production because both old and new code run simultaneously during deployment.
You split into microservices but all of them share the same PostgreSQL database. You have the operational overhead of microservices with none of the independent scalability. A schema migration blocks all teams. A bad query in Service A slows down Service B.
The database has a replica. The app has multiple pods. You think you''re resilient. Then the single Redis instance goes down, and every service that depended on it — auth, sessions, rate limiting, caching — stops working simultaneously. SPOFs hide in plain sight.
Network partition splits your 3-node cluster into two halves. Both halves think they''re the primary. Both accept writes. Network heals. You have two diverged databases with conflicting data. This is split brain — one of the most dangerous failure modes in distributed systems.
SQL injection persists in ORM applications. Learn why raw(), $executeRaw(), and stored procedures are injection vectors, and how to defend with parameterization.
Your server is in UTC. Your database is in UTC. Your cron job runs at "9 AM" — but 9 AM where? Customer in Tokyo and customer in New York both get charged at your server''s 9 AM. Your "end of day" reports include data from tomorrow. Timezone bugs are invisible until they''re expensive.
Sessions table. Events table. Audit log. Each row is small. But with 100,000 active users writing events every minute, it''s 5 million rows per day. No one added a purge job. Six months later the disk is full and the database crashes.
The t3.micro database that "works fine in staging" OOMs under real load. The single-AZ deployment that''s been fine for two years fails the week of your biggest launch. Underprovisioning is the other edge of the cost/reliability tradeoff — and it has a much higher price.
Most loggers are synchronous — they block your event loop writing to disk or a remote service. logixia is async-first, with non-blocking transports for PostgreSQL, MySQL, MongoDB, SQLite, file rotation, Kafka, WebSocket, log search, field redaction, and OpenTelemetry request tracing via AsyncLocalStorage.