- Published on
Dead Letter Queue Ignored for Months — The Silent Data Graveyard
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
The dead letter queue (DLQ) is where messages go when processing fails repeatedly. It's meant to be a safety net — a place to investigate failures without losing data.
In practice, most DLQs are configured, forgotten, and slowly fill with critical business events that nobody knows are failing.
- What Ends Up in the DLQ
- Fix 1: Monitor and Alert on DLQ Depth
- Fix 2: Enrich DLQ Messages with Failure Context
- Fix 3: Automatic DLQ Replay
- Fix 4: Classify DLQ Messages Automatically
- Fix 5: Set DLQ Expiry — Don't Keep Failed Messages Forever
- DLQ Health Checklist
- Conclusion
What Ends Up in the DLQ
Messages are sent to the DLQ when they:
- Fail processing and exceed the max retry count
- Exceed the message TTL (time to live)
- Have a malformed payload the consumer can't parse
- Trigger an unhandled exception every time
- Are rejected explicitly by the consumer
// RabbitMQ: configure DLQ on your main queue
channel.assertExchange('dlx', 'direct')
channel.assertQueue('orders.dlq', { durable: true })
channel.bindQueue('orders.dlq', 'dlx', 'orders')
channel.assertQueue('orders', {
durable: true,
arguments: {
'x-dead-letter-exchange': 'dlx',
'x-dead-letter-routing-key': 'orders',
'x-message-ttl': 300_000, // Messages expire after 5 minutes
'x-max-retries': 3, // Dead-letter after 3 nacks
}
})
Fix 1: Monitor and Alert on DLQ Depth
// The most important fix: know when things land in the DLQ
import * as amqp from 'amqplib'
async function monitorDLQ(connection: amqp.Connection) {
const channel = await connection.createChannel()
setInterval(async () => {
const queueInfo = await channel.checkQueue('orders.dlq')
const depth = queueInfo.messageCount
metrics.gauge('dlq.depth', depth, { queue: 'orders' })
if (depth > 0) {
// DLQ should always be empty in a healthy system
alerting.warning(`DLQ has ${depth} failed messages in orders.dlq`)
}
if (depth > 100) {
alerting.critical(`DLQ critical: ${depth} failed orders — immediate investigation needed`)
// Page on-call
pagerduty.trigger({
summary: `Orders DLQ has ${depth} failed messages`,
severity: 'critical',
})
}
}, 60_000) // Check every minute
}
// SQS: monitor via CloudWatch (alarm when ApproximateNumberOfMessagesVisible > 0)
// Terraform:
// resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
// alarm_name = "orders-dlq-not-empty"
// metric_name = "ApproximateNumberOfMessagesVisible"
// namespace = "AWS/SQS"
// dimensions = { QueueName = aws_sqs_queue.dlq.name }
// comparison_operator = "GreaterThanThreshold"
// threshold = 0
// alarm_actions = [aws_sns_topic.alerts.arn]
// }
Fix 2: Enrich DLQ Messages with Failure Context
// When dead-lettering a message, include WHY it failed
// Makes debugging much faster when you finally look at it
async function handleWithDLQ(
channel: amqp.Channel,
message: amqp.ConsumeMessage,
handler: () => Promise<void>
) {
const maxRetries = 3
const headers = message.properties.headers ?? {}
const retryCount = (headers['x-retry-count'] ?? 0) as number
const failures = (headers['x-failures'] ?? []) as string[]
try {
await handler()
channel.ack(message)
} catch (err) {
const error = err as Error
const enrichedFailures = [...failures, `Attempt ${retryCount + 1}: ${error.message}`]
if (retryCount < maxRetries) {
// Retry with backoff
const delay = Math.min(1000 * Math.pow(2, retryCount), 30_000)
setTimeout(() => {
channel.publish('', message.fields.routingKey, message.content, {
headers: {
...headers,
'x-retry-count': retryCount + 1,
'x-failures': enrichedFailures,
'x-first-failed-at': headers['x-first-failed-at'] ?? Date.now(),
},
})
}, delay)
channel.ack(message) // Ack original, republish delayed copy
} else {
// Dead-letter with full failure context
channel.publish('dlx', 'orders', message.content, {
headers: {
...headers,
'x-failures': enrichedFailures,
'x-dead-lettered-at': Date.now(),
'x-original-queue': message.fields.routingKey,
'x-error-type': error.constructor.name,
'x-final-error': error.message,
'x-stack': error.stack,
},
})
channel.ack(message)
}
}
}
Fix 3: Automatic DLQ Replay
// Build a replay tool to reprocess DLQ messages after fixing the bug
class DLQReplayer {
constructor(
private connection: amqp.Connection,
private sourceDLQ: string,
private targetQueue: string
) {}
async replay(options: {
maxMessages?: number
dryRun?: boolean
filter?: (message: any) => boolean
} = {}) {
const channel = await this.connection.createChannel()
channel.prefetch(1)
let processed = 0
let skipped = 0
let errors = 0
const maxMessages = options.maxMessages ?? Infinity
await new Promise<void>((resolve) => {
channel.consume(this.sourceDLQ, async (msg) => {
if (!msg || processed >= maxMessages) {
resolve()
return
}
const payload = JSON.parse(msg.content.toString())
const headers = msg.properties.headers ?? {}
// Apply filter if provided
if (options.filter && !options.filter(payload)) {
channel.nack(msg, false, false) // Dead-letter again (or discard)
skipped++
return
}
if (options.dryRun) {
console.log('DRY RUN — would replay:', JSON.stringify(payload, null, 2))
console.log('Original failures:', headers['x-failures'])
channel.nack(msg, false, true) // Put back
processed++
return
}
// Republish to original queue for reprocessing
channel.publish('', this.targetQueue, msg.content, {
persistent: true,
headers: {
...headers,
'x-replayed-at': Date.now(),
'x-replay-count': (headers['x-replay-count'] ?? 0) + 1,
},
})
channel.ack(msg)
processed++
console.log(`Replayed ${processed} messages`)
})
})
console.log(`Replay complete: ${processed} replayed, ${skipped} skipped, ${errors} errors`)
return { processed, skipped, errors }
}
}
// Usage: after deploying a bug fix
const replayer = new DLQReplayer(connection, 'orders.dlq', 'orders')
// Dry run first
await replayer.replay({ dryRun: true, maxMessages: 10 })
// Replay in batches
await replayer.replay({ maxMessages: 1000 })
Fix 4: Classify DLQ Messages Automatically
// Not all DLQ messages are equal
// Auto-classify to triage faster
enum DLQCategory {
PARSING_ERROR = 'parsing_error', // Malformed JSON — fix producer
VALIDATION_ERROR = 'validation_error', // Schema mismatch — fix producer or consumer
DEPENDENCY_ERROR = 'dependency_error', // DB/API was down — safe to replay
BUSINESS_ERROR = 'business_error', // Logic error — needs investigation
UNKNOWN = 'unknown',
}
function classifyDLQMessage(message: any, headers: any): DLQCategory {
const errorType = headers['x-error-type'] ?? ''
const finalError = headers['x-final-error'] ?? ''
const failures: string[] = headers['x-failures'] ?? []
if (errorType === 'SyntaxError' || finalError.includes('JSON')) {
return DLQCategory.PARSING_ERROR
}
if (errorType === 'ValidationError' || finalError.includes('schema')) {
return DLQCategory.VALIDATION_ERROR
}
if (
finalError.includes('ECONNREFUSED') ||
finalError.includes('timeout') ||
finalError.includes('Connection')
) {
return DLQCategory.DEPENDENCY_ERROR // Safe to replay after service recovery
}
if (failures.length > 0 && failures.every(f => f.includes('Cannot read'))) {
return DLQCategory.BUSINESS_ERROR
}
return DLQCategory.UNKNOWN
}
// Dashboard: show DLQ breakdown by category
// DEPENDENCY_ERROR → auto-replay after service recovery
// PARSING_ERROR → alert producer team
// BUSINESS_ERROR → create JIRA ticket
Fix 5: Set DLQ Expiry — Don't Keep Failed Messages Forever
// DLQ messages older than X days are unrecoverable anyway
// Auto-expire to prevent storage bloat
channel.assertQueue('orders.dlq', {
durable: true,
arguments: {
'x-message-ttl': 7 * 24 * 60 * 60 * 1000, // 7 days
// After 7 days, messages are deleted
// This forces teams to investigate before they expire
}
})
// Better: archive to S3 before expiry
// AWS SQS DLQ → Lambda → S3 for long-term audit trail
DLQ Health Checklist
- ✅ Every queue has a DLQ configured
- ✅ CloudWatch/Prometheus alert fires when DLQ depth > 0
- ✅ On-call rotation receives DLQ alerts
- ✅ DLQ messages enriched with failure reason + stack trace
- ✅ Replay tool built and tested before it's needed
- ✅ DLQ messages classified by category (parsing vs dependency vs business)
- ✅ DLQ has TTL (7-14 days) with S3 archive for audit trail
- ✅ Weekly DLQ review in engineering standup
Conclusion
A DLQ without monitoring is worse than no DLQ — it creates a false sense of safety while silently losing business-critical data. The fix is simple: alert immediately when anything lands in the DLQ, enrich messages with failure context so debugging is fast, build a replay tool before you need it, and classify messages so your team can triage quickly. A healthy system should have an empty DLQ. Anything in it represents a real failure that needs attention today, not in three months.