Published on

Unbounded Table Growth — When Your Database Fills the Disk at 3 AM

Authors

Introduction

Unbounded table growth is one of those "it'll never be a problem" problems that becomes a 3 AM incident. The table exists for a legitimate reason. The rows are small. But no one planned for data retention, and the data just keeps accumulating until the disk fills up — at which point the database stops accepting writes entirely.

Tables That Commonly Grow Without Bounds

- sessions / auth_tokens
- events / analytics
- audit_logs / activity_logs
- notifications (read and unread)
- email_queue / job_queue (processed jobs not deleted)
- cache_entries (TTL not enforced at DB level)
- webhook_deliveries
- api_request_logs

These tables share a pattern: rows are written frequently, old rows are rarely needed, and no one added a cleanup job.

Fix 1: Scheduled Purge Job

import cron from 'node-cron'

// Run daily at 2 AM — delete old data in small batches
cron.schedule('0 2 * * *', async () => {
  await purgeOldSessions()
  await purgeOldEvents()
  await purgeProcessedJobs()
})

async function purgeOldSessions() {
  const cutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)  // 30 days ago

  let deleted = 0
  let batch: number

  do {
    // Delete in small batches to avoid long-running transactions
    const result = await db.query(`
      DELETE FROM sessions
      WHERE id IN (
        SELECT id FROM sessions
        WHERE expires_at < $1 OR last_active < $2
        LIMIT 1000
      )
    `, [new Date(), cutoff])

    batch = result.rowCount ?? 0
    deleted += batch

    if (batch > 0) {
      await sleep(100)  // brief pause between batches
    }
  } while (batch > 0)

  console.log(`Purged ${deleted} expired sessions`)
}

Fix 2: PostgreSQL Table Partitioning by Time

For very large tables, partitioning lets you drop old data instantly without a DELETE:

-- Create a partitioned table (range partition by month)
CREATE TABLE events (
  id UUID DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL,
  event_type VARCHAR(100) NOT NULL,
  meta JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2026_01 PARTITION OF events
  FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 PARTITION OF events
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

CREATE TABLE events_2026_03 PARTITION OF events
  FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Dropping an old partition is INSTANT (unlike DELETE)
-- No need to scan rows — just removes the file
DROP TABLE events_2024_01;  -- Instantly frees all space for that month
// Auto-create next month's partition
async function ensureNextPartition() {
  const nextMonth = new Date()
  nextMonth.setMonth(nextMonth.getMonth() + 1)

  const year = nextMonth.getFullYear()
  const month = String(nextMonth.getMonth() + 1).padStart(2, '0')
  const nextMonthStr = `${year}-${month}-01`

  const twoMonthsLater = new Date(nextMonth)
  twoMonthsLater.setMonth(twoMonthsLater.getMonth() + 1)
  const twoMonthsStr = `${twoMonthsLater.getFullYear()}-${String(twoMonthsLater.getMonth() + 1).padStart(2, '0')}-01`

  await db.query(`
    CREATE TABLE IF NOT EXISTS events_${year}_${month}
    PARTITION OF events
    FOR VALUES FROM ('${nextMonthStr}') TO ('${twoMonthsStr}')
  `)
}

// Run monthly — create next month's partition in advance
cron.schedule('0 0 1 * *', ensureNextPartition)

// Drop old partitions
async function dropOldPartitions(keepMonths = 6) {
  const cutoff = new Date()
  cutoff.setMonth(cutoff.getMonth() - keepMonths)

  const year = cutoff.getFullYear()
  const month = String(cutoff.getMonth() + 1).padStart(2, '0')

  await db.query(`DROP TABLE IF EXISTS events_${year}_${month}`)
}

Fix 3: TTL Column with a Purge Index

When you can't partition, a purge index makes batch deletes fast:

-- Add expires_at column
ALTER TABLE sessions ADD COLUMN expires_at TIMESTAMPTZ NOT NULL
  DEFAULT NOW() + INTERVAL '30 days';

-- Index for efficient purge queries
CREATE INDEX ON sessions (expires_at) WHERE expires_at < NOW();

-- Purge query is fast because it uses the index
DELETE FROM sessions WHERE expires_at < NOW();

Fix 4: Monitor Table Size

Catch growth before it becomes a crisis:

-- Table sizes with row counts
SELECT
  relname AS table_name,
  n_live_tup AS row_count,
  pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
  pg_size_pretty(pg_relation_size(oid)) AS table_size,
  pg_size_pretty(pg_indexes_size(oid)) AS index_size
FROM pg_class
JOIN pg_stat_user_tables ON relid = oid
WHERE relkind = 'r'
ORDER BY pg_total_relation_size(oid) DESC
LIMIT 20;
// Alert when any table exceeds a size threshold
async function checkTableSizes() {
  const result = await db.query(`
    SELECT relname, pg_total_relation_size(oid) AS size_bytes
    FROM pg_class
    JOIN pg_stat_user_tables ON relid = oid
    WHERE relkind = 'r'
  `)

  for (const row of result.rows) {
    const gbSize = row.size_bytes / (1024 ** 3)
    if (gbSize > 10) {  // alert if any table exceeds 10GB
      await alerting.warn(`Table ${row.relname} is ${gbSize.toFixed(1)}GB`)
    }
  }
}

cron.schedule('0 * * * *', checkTableSizes)  // check hourly

Data Retention Policy

Before writing cleanup code, define the retention policy per table:

const RETENTION_POLICY = {
  sessions: { keepDays: 30, purgeColumn: 'expires_at' },
  events: { keepDays: 90, purgeColumn: 'created_at' },
  audit_logs: { keepDays: 365, purgeColumn: 'created_at' },
  notifications: { keepDays: 60, readOnly: true, purgeColumn: 'created_at' },
  job_queue: { keepCompleted: 7, keepFailed: 30, purgeColumn: 'completed_at' },
}

// Job queue: keep failed jobs longer for debugging
async function purgeJobQueue() {
  // Delete completed jobs older than 7 days
  await db.query(`
    DELETE FROM jobs WHERE status = 'complete'
      AND completed_at < NOW() - INTERVAL '7 days'
      AND id IN (SELECT id FROM jobs WHERE status = 'complete' LIMIT 5000)
  `)

  // Delete failed jobs older than 30 days
  await db.query(`
    DELETE FROM jobs WHERE status = 'failed'
      AND completed_at < NOW() - INTERVAL '30 days'
      AND id IN (SELECT id FROM jobs WHERE status = 'failed' LIMIT 5000)
  `)
}

Unbounded Growth Checklist

  • ✅ Define data retention policy for every high-write table
  • ✅ Add expires_at or created_at columns with indexes for purge queries
  • ✅ Schedule daily purge jobs — delete in batches of 1,000-10,000
  • ✅ Use table partitioning for tables exceeding 100GB
  • ✅ Monitor table sizes hourly, alert at 50% and 80% disk usage
  • ✅ Test purge jobs in staging with production-sized data
  • ✅ Set PostgreSQL disk alerts before they reach 100%

Conclusion

Unbounded table growth is entirely preventable with a data retention policy and a scheduled cleanup job. The time to add both is when you create the table, not when the disk is 95% full. For high-volume tables, partitioning by time is the cleanest solution: dropping an old partition is instant (no row-by-row delete), and queries that filter by date automatically use partition pruning. At minimum, add a created_at index and schedule a daily purge job — that's all it takes to prevent a 3 AM disk-full incident.