- Published on
Underprovisioned Infrastructure Causing Downtime — When "Good Enough" Isn't
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Underprovisioning is the silent counterpart to overprovisioning. Overprovisioned infrastructure wastes money gradually; underprovisioned infrastructure fails suddenly. The t3.micro RDS that works perfectly in development will OOM under a production JOIN on a 10M-row table. The single-AZ instance that's "never had issues" in two years will have its first issue on the Friday of your product launch. The cost of underprovisioning isn't measured in dollars per month — it's measured in downtime, revenue loss, and customer trust.
- How Underprovisioning Kills You
- Fix 1: Load Test Before You Go Live
- Fix 2: Never Use Burstable Instances for Production Databases
- Fix 3: Minimum Viable Redundancy for Every Stateful Resource
- Fix 4: Connection Pool Sizing That Matches Instance Limits
- Fix 5: Capacity Planning for Known Growth
- Underprovisioning Prevention Checklist
- Conclusion
How Underprovisioning Kills You
Underprovisioning failure patterns:
1. Memory exhaustion under real load
→ Dev: t3.micro with 1GB RAM, 100 rows/table
→ Prod: same instance, 10M rows, JOINs use 3GB RAM
→ OOM killer terminates PostgreSQL mid-query
→ Connection pool error → app returns 500s
2. CPU throttling at critical moments
→ t3/t4g instances use burstable CPU (CPU credits)
→ Credits depleted during traffic spike
→ CPU throttled to 5% baseline → requests time out
3. Single-AZ downtime
→ AZ failure: 2-4 hours of RDS unavailability
→ "It's never happened" is not an architecture
4. Disk I/O bottleneck
→ gp2 volume: 100 IOPS baseline (burst to 3000)
→ Burst credits run out → sustained 100 IOPS
→ Write-heavy workload: queries queue, latency spikes
5. Connection exhaustion
→ RDS db.t3.micro: max_connections = 85
→ 10 app instances × 10 connections each = 100 → FATAL: too many connections
Fix 1: Load Test Before You Go Live
# Don't discover underprovisioning during a real incident
# Discover it in staging with production-scale data
# k6 load test to find resource limits
k6 run --vus 100 --duration 10m loadtest.js
# Watch these during the test:
# RDS CloudWatch:
# - CPUUtilization (should stay < 70%)
# - FreeableMemory (should stay > 20% of instance RAM)
# - DatabaseConnections (should stay under max_connections * 0.8)
# ECS CloudWatch:
# - CPUUtilization (should stay < 70% to leave room for spikes)
# - MemoryUtilization (should stay < 80%)
// Establish resource baselines BEFORE a traffic spike hits
async function captureResourceBaseline() {
const metrics = await Promise.all([
getMetric('AWS/RDS', 'CPUUtilization', 'myapp-prod'),
getMetric('AWS/RDS', 'FreeableMemory', 'myapp-prod'),
getMetric('AWS/RDS', 'DatabaseConnections', 'myapp-prod'),
getMetric('AWS/ECS', 'CPUUtilization', 'production/myapp-api'),
getMetric('AWS/ECS', 'MemoryUtilization', 'production/myapp-api'),
])
const [rdscpu, rdsRam, rdsConns, ecsCpu, ecsMem] = metrics
console.log('Current resource utilization:')
console.log(` RDS CPU: ${rdscu}% (warn at 70%)`)
console.log(` RDS RAM: ${((1 - rdsRam / INSTANCE_RAM) * 100).toFixed(0)}% used`)
console.log(` RDS Conns: ${rdsConns} / ${MAX_CONNECTIONS}`)
console.log(` ECS CPU: ${ecsCpu}%`)
console.log(` ECS Memory: ${ecsMem}%`)
// If any are above 60%, you have little headroom for traffic spikes
}
Fix 2: Never Use Burstable Instances for Production Databases
# ❌ Burstable instances (t3, t4g) for RDS:
# - CPU credit system: earns credits when idle, spends during load
# - When credits run out: throttled to 5-20% of base CPU
# - Happens at the worst time: sustained production load
# ✅ Use General Purpose (m) or Memory Optimized (r) for production:
# m6g.large: 2 vCPU, 8GB RAM → $0.156/hr (no burst limits)
# r6g.large: 2 vCPU, 16GB RAM → $0.192/hr (better for DB workloads)
# vs
# t4g.medium: 2 vCPU, 4GB RAM → $0.068/hr (looks cheaper, fails under load)
# The $0.124/hr difference ($89/month) is the cost of never having
# a CPU throttling incident on your primary database.
# Check if your instance is running out of CPU credits:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUCreditBalance \
--dimensions Name=DBInstanceIdentifier,Value=myapp-prod \
--start-time "$(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 3600 \
--statistics Minimum
# If minimum ever hits 0: you're getting throttled
Fix 3: Minimum Viable Redundancy for Every Stateful Resource
# Every production stateful component needs:
# 1. Multi-AZ or multi-region (not single point of failure)
# 2. Sufficient instance type (not burstable)
# 3. Enough connection headroom (PgBouncer)
# terraform/production.tf — minimum production configuration
resource "aws_db_instance" "postgres" {
# ✅ NOT t3/t4g for production
instance_class = "db.r6g.large"
# ✅ Multi-AZ for automatic failover
multi_az = true
# ✅ Backup retention
backup_retention_period = 14
# ✅ Storage type that doesn't burst-and-throttle
storage_type = "gp3"
iops = 3000 # Dedicated IOPS, not burst
# ✅ Storage alarm before you run out
allocated_storage = 100
# Set alarm at 80% usage: 80GB
}
resource "aws_cloudwatch_metric_alarm" "rds_storage_high" {
alarm_name = "rds-storage-high"
namespace = "AWS/RDS"
metric_name = "FreeStorageSpace"
dimensions = {
DBInstanceIdentifier = aws_db_instance.postgres.id
}
comparison_operator = "LessThanThreshold"
threshold = 20 * 1024 * 1024 * 1024 # 20GB free
evaluation_periods = 2
period = 300
statistic = "Average"
alarm_actions = [aws_sns_topic.alerts.arn]
}
Fix 4: Connection Pool Sizing That Matches Instance Limits
// Database connection pool must be sized for the instance, not just the app
// RDS max_connections formula (rough):
// t3.micro (1GB RAM): max_connections ≈ 85
// t3.small (2GB RAM): max_connections ≈ 170
// r6g.large (16GB RAM): max_connections ≈ 1330
// With 10 app instances, each needing 20 connections:
// Total connections needed: 200
// t3.micro can handle: 85 → YOU WILL HIT FATAL: too many connections
// ✅ Solution 1: Use PgBouncer
// App connects to PgBouncer (handles unlimited connections)
// PgBouncer maintains small pool to RDS (e.g., 50 connections)
// ✅ Solution 2: Size instance for connection count
// If you need 200 connections: use at least r6g.large or m6g.large
// ✅ Solution 3: Reduce pool size per instance
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Not 20 or 50 — calculate: (max_connections * 0.8) / num_instances
min: 2,
idleTimeoutMillis: 30_000,
connectionTimeoutMillis: 10_000,
})
// Alert when pool is near capacity
pool.on('connect', () => {
if (pool.totalCount > pool.options.max! * 0.8) {
logger.warn({ poolSize: pool.totalCount, max: pool.options.max },
'Connection pool near capacity')
}
})
Fix 5: Capacity Planning for Known Growth
// capacity-planner.ts — model resource needs before they become incidents
function calculateMinCapacity(params: {
peakRequestsPerSecond: number
avgResponseTimeMs: number
cpuPerRequestMs: number // CPU-ms consumed per request
safetyFactor: number // 2x = 50% headroom at peak
}) {
const { peakRequestsPerSecond, avgResponseTimeMs, cpuPerRequestMs, safetyFactor } = params
// CPU cores needed at peak
const cpuCoresNeeded =
(peakRequestsPerSecond * cpuPerRequestMs) / 1000 * safetyFactor
// Memory rough estimate
const concurrentRequests = peakRequestsPerSecond * (avgResponseTimeMs / 1000)
console.log('Capacity requirements at peak load:')
console.log(` Peak RPS: ${peakRequestsPerSecond}`)
console.log(` CPU cores: ${cpuCoresNeeded.toFixed(1)} (with ${safetyFactor}x safety factor)`)
console.log(` Concurrent requests: ${concurrentRequests.toFixed(0)}`)
}
// Example: 500 RPS, 100ms avg response, 10ms CPU per request, 2x safety
calculateMinCapacity({
peakRequestsPerSecond: 500,
avgResponseTimeMs: 100,
cpuPerRequestMs: 10,
safetyFactor: 2,
})
// Output: 10 CPU cores needed → 5 x 2-vCPU instances minimum
Underprovisioning Prevention Checklist
- ✅ Load tested at 2x expected peak BEFORE launch, with production-scale data
- ✅ No burstable instances (t3/t4g) for production databases
- ✅ All stateful resources are multi-AZ with automatic failover
- ✅ Database connection limits calculated and respected — PgBouncer if needed
- ✅ gp3 storage with dedicated IOPS (not burst credits) for write-heavy workloads
- ✅ CloudWatch alarms: CPU > 70%, memory > 80%, storage < 20% free
- ✅ Capacity planning done before major launches or expected traffic growth
Conclusion
Underprovisioning is a false economy. The t3.micro database saves 50,000 in lost revenue and customer churn. The minimum viable production configuration for any stateful component is: a non-burstable instance type, multi-AZ redundancy, and CloudWatch alarms that fire before resources are exhausted. The cost difference between "will survive a traffic spike" and "might survive a traffic spike" is usually less than $200/month. That's not a cost optimization decision — it's a reliability decision.