- Published on
Log Aggregation at Scale — Structured Logging, Loki, and Querying Millions of Log Lines
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Logs are the first place engineers dig when systems fail. Unstructured logs are unsearchable. Querying millions of lines is slow. Retention costs explode. This post walks through structured logging from code to ingestion, Grafana Loki's approach to log aggregation, smart sampling strategies for high-volume services, and correlation IDs that tie logs across services together.
- Structured Logging with Pino
- Log Levels and When to Use Them
- Grafana Loki Setup
- LogQL Queries for Debugging
- Log Sampling for High-Volume Services
- Correlation IDs Across Services
- Log Retention Cost Optimization
- Production Logging Checklist
- Conclusion
Structured Logging with Pino
Pino is a fast JSON logger for Node.js (versus Winston or Bunyan):
Installation:
npm install pino pino-http pino-pretty
Basic setup:
// src/logger.ts
import pino from 'pino';
// Development: pretty-printed
// Production: JSON to stdout (for log aggregator)
const logger = pino(
{
level: process.env.LOG_LEVEL || 'info',
base: {
service: 'my-api',
environment: process.env.NODE_ENV || 'development',
version: process.env.VERSION || '1.0.0',
},
},
process.env.NODE_ENV === 'production'
? pino.destination({ sync: false }) // Async for performance
: pino.transport({
target: 'pino-pretty',
options: {
colorize: true,
translateTime: 'HH:MM:ss Z',
ignore: 'pid,hostname',
},
}),
);
export default logger;
Production JSON output (what Loki ingests):
{
"level": 30,
"time": "2026-03-15T10:30:45.123Z",
"pid": 12345,
"hostname": "api-pod-1",
"service": "my-api",
"environment": "production",
"version": "2.1.0",
"msg": "Payment processed",
"orderId": "order_123",
"customerId": "cust_456",
"amount": 99.99,
"duration_ms": 234,
"processor": "stripe"
}
HTTP middleware with context:
// src/middleware/logging.ts
import pino from 'pino-http';
import logger from '../logger';
export const httpLogger = pino(
{
logger,
customLogLevel: (req, res, err) => {
// More context-aware log levels
if (res.statusCode >= 500) return 'error';
if (res.statusCode >= 400) return 'warn';
if (res.statusCode >= 300) return 'info';
return 'debug';
},
customSuccessMessage: (req, res) => {
return `${req.method} ${req.url} ${res.statusCode}`;
},
customErrorMessage: (req, res, err) => {
return `${req.method} ${req.url} ${res.statusCode} - ${err.message}`;
},
},
);
// Express middleware
app.use(httpLogger);
Per-request context logging:
// src/routes/payments.ts
import logger from '../logger';
import { generateRequestId } from '../utils/correlation';
app.post('/api/payments', (req, res) => {
const requestId = generateRequestId();
const childLogger = logger.child({
requestId,
customerId: req.body.customerId,
orderId: req.body.orderId,
});
// All logs from this handler include requestId
childLogger.info('Processing payment request');
try {
const result = processPayment(req.body);
childLogger.info(
{
event: 'payment_success',
chargeId: result.chargeId,
duration_ms: Date.now() - startTime,
},
'Payment processed successfully',
);
res.json(result);
} catch (error) {
childLogger.error(
{
event: 'payment_failed',
error: error.message,
errorCode: error.code,
},
'Payment processing failed',
);
res.status(500).json({ error: error.message });
}
});
Log Levels and When to Use Them
Log level hierarchy:
TRACE (10): Very detailed debugging
DEBUG (20): Debugging information for development
INFO (30): General informational messages
WARN (40): Warnings for recoverable issues
ERROR (50): Errors for failures
FATAL (60): Fatal errors causing shutdown
When to use each:
const logger = pino();
// FATAL: System cannot continue
logger.fatal('Database connection lost, exiting');
// ERROR: Operation failed, but system continues
logger.error({ error }, 'Failed to send email');
// WARN: Something unexpected, but recoverable
logger.warn({ statusCode: 503 }, 'Stripe API slow');
// INFO: Important business events (payment, login)
logger.info({ orderId, amount }, 'Payment processed');
// DEBUG: Development diagnostics (disabled in production)
logger.debug({ userId }, 'User session created');
// TRACE: Extreme detail (rarely used)
logger.trace({ memory }, 'Memory state snapshot');
Cost-cutting: Choose log level appropriately
// ❌ BAD: Log everything
app.get('/api/users/:id', (req, res) => {
logger.info('Received request');
logger.info('Database query started');
logger.info('Query returned', rows);
logger.info('Response sent');
// 4 logs per request × 1000 req/s = 4000 logs/s stored
// Cost: $4000/month
});
// ✓ GOOD: Log events, not steps
app.get('/api/users/:id', (req, res) => {
const start = Date.now();
const user = await db.query('SELECT * FROM users WHERE id = ?', [req.params.id]);
logger.info(
{ userId: req.params.id, duration_ms: Date.now() - start },
'User fetched',
);
// 1 log per request × 1000 req/s = 1000 logs/s stored
// Cost: $1000/month (4x cheaper)
});
Grafana Loki Setup
Loki is a log aggregation system optimized for Kubernetes:
Kubernetes deployment (DaemonSet for log collection):
# Add Loki Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki stack (Loki + Promtail)
helm install loki grafana/loki-stack \
--namespace observability \
--create-namespace \
--values loki-values.yaml
Loki configuration:
# loki-values.yaml
loki:
auth_enabled: true
ingester:
max_chunk_age: 1h
chunk_idle_period: 15m
limits_config:
# Prevent abuse
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 720h # 30 days
max_streams_per_user: 10000
max_global_streams_per_user: 100000
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
# Promtail: Log shipper (runs on every node)
promtail:
enabled: true
config:
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Extract pod labels as Loki labels
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Parse JSON logs to extract fields
pipeline_stages:
- json:
expressions:
level: level
message: msg
requestId: requestId
- labels:
level:
message:
requestId:
Verify Loki is receiving logs:
# Port forward
kubectl port-forward -n observability svc/loki 3100:3100
# Query Loki API
curl -s 'http://localhost:3100/loki/api/v1/query' \
--data-urlencode 'query={app="api", environment="production"}' | jq
# Access Loki UI (via Grafana)
# http://grafana.company.com → Explore → Select Loki data source
LogQL Queries for Debugging
LogQL is Loki's query language (similar to PromQL):
Basic queries:
# All logs from api service in production
{app="api", environment="production"}
# Logs with specific label value
{app="api"} | json | level="error"
# Search by pattern
{app="api"} |= "payment"
# Regex search
{app="api"} |~ "Error: .*timeout"
# Exclude pattern
{app="api"} != "health check"
Time-series aggregations:
# Count error logs per minute (last 1 hour)
count_over_time({app="api", level="error"}[1m])
# Rate of errors (errors per second)
rate({app="api", level="error"}[5m])
# Average response time (from duration_ms field)
avg(duration_ms) by (endpoint) | every 5m
Complex debugging queries:
# Find slow requests (P99 latency > 500ms)
quantile_over_time(0.99, duration_ms[5m])
by (endpoint)
| . > 500
# Compare error rate before/after deployment
sum(rate({app="api", level="error"}[5m]))
/ on() group_left()
sum(rate({app="api"}[5m]))
| . > 0.05 # Alert if > 5% error rate
# Find requests from specific customer
{app="api"} | json | customerId="cust_123" | requestId
# Trace a request across services
{requestId="req_abc123"}
# Find payment failures with context
{app="api"}
| json
| event="payment_failed"
| line_format "{{ .timestamp }} {{ .customerId }} {{ .error }}"
Log Sampling for High-Volume Services
Logging everything is expensive. Sample strategically:
Sampling by log level:
// Sample low-importance logs
const shouldSampleDebugLog = Math.random() < 0.1; // 10% of DEBUG logs
if (level === 'DEBUG' && !shouldSampleDebugLog) {
return; // Don't log this
}
logger.debug({ ...data }, 'Message');
Sampling by request path:
// Sample health checks (high volume, low value)
const healthCheckRoutes = ['/health', '/readiness', '/liveness'];
const shouldLog = !healthCheckRoutes.includes(req.path) ||
Math.random() < 0.01; // 1% of health checks
if (shouldLog) {
logger.info({ path: req.path }, 'HTTP request');
}
Probabilistic sampling (built-in to some loggers):
// Pino sampling extension
import pino from 'pino';
const logger = pino({
transport: {
target: 'pino-sampling-transport',
options: {
samplingConfig: {
'debug': 0.01, // 1% of DEBUG
'info': 1.0, // 100% of INFO
'warn': 1.0, // 100% of WARN
'error': 1.0, // 100% of ERROR
},
},
},
});
Error-aware sampling (sample errors at higher rate):
// Keep all ERROR and FATAL
// Sample INFO, DEBUG based on context
const sampleRate = {
'error': 1.0,
'warn': 0.5,
'info': 0.1,
'debug': 0.01,
};
const shouldLog = Math.random() < sampleRate[level];
if (shouldLog) {
logger.log(level, message);
}
Cost estimation:
Service: 5000 req/s
Log target: 2 logs per request (one start, one end)
= 10,000 logs/second
Loki pricing (example): $0.50 per GB
Average log size: 1KB
10,000 logs/s × 1KB × 86,400s/day × 30 days/month
= 25,920 GB/month (unsampled)
= $12,960/month (unsampled)
With sampling:
INFO logs: 10% (1000/s)
ERROR logs: 100% (50/s)
= 1,050 logs/s
= ~2.7 GB/month
= $1.35/month (100x cheaper!)
Correlation IDs Across Services
Correlation IDs tie logs across microservices:
Generate correlation ID:
// src/middleware/correlation-id.ts
import { v4 as uuidv4 } from 'uuid';
export function generateCorrelationId() {
return uuidv4();
}
// Express middleware
app.use((req, res, next) => {
// Use existing correlation ID if provided
const correlationId =
req.headers['x-correlation-id'] || generateCorrelationId();
// Add to request object
req.correlationId = correlationId;
// Add to response headers (for client to use)
res.setHeader('X-Correlation-ID', correlationId);
// Add to all logs from this request
const childLogger = logger.child({ correlationId });
req.logger = childLogger;
next();
});
Propagate across services:
// Service A: Make request to Service B
async function callServiceB(userId: string) {
const response = await fetch('http://service-b/api/users', {
headers: {
'X-Correlation-ID': req.correlationId, // Pass along
},
});
return response.json();
}
// Service B: Receives correlation ID
app.get('/api/users', (req, res) => {
const correlationId = req.headers['x-correlation-id'];
const logger = pino.child({ correlationId });
logger.info('Received request from Service A');
// All logs include same correlationId
});
Query across services:
# Find all logs for a specific request (all services)
{correlationId="550e8400-e29b-41d4-a716-446655440000"}
| json
# Result:
# Service A: 10:30:00.001 Received request
# Service A: 10:30:00.050 Calling Service B
# Service B: 10:30:00.052 Received request from Service A
# Service B: 10:30:00.150 Database query completed
# Service A: 10:30:00.155 Response received
# Service A: 10:30:00.160 Response sent to client
# Total latency: 159ms, visible across both services
Log Retention Cost Optimization
Calculate retention needs:
Service: 10,000 req/s
Logs per request: 2
= 20,000 logs/second × 86,400 seconds/day = 1.73 GB/day
= 52 GB/month (unsampled)
At $0.50/GB = $26/month per 30-day retention
Different retention policies:
- INFO logs: 7 days = 12 GB × $0.50 = $6
- WARN logs: 30 days = 50 GB × $0.50 = $25
- ERROR logs: 90 days = 150 GB × $0.50 = $75
- Total: $106/month for mixed retention
Optimize by log level:
# Loki retention config
limits_config:
retention_period: 30d # Default for INFO/DEBUG
# Per-stream retention (different by label)
retention_by_label:
# Keep errors longer
level=error: 90d
# Keep health checks shorter
endpoint=/health: 1d
# Keep DEBUG shorter
level=debug: 3d
# Compression
chunk_encoding: snappy # Reduce storage 40-50%
Archive old logs (further cost reduction):
# Move old logs to S3 (cheaper storage)
loki:
schema_config:
configs:
- object_store: s3 # Archive old data
storage_config:
s3:
s3: s3://logs-archive/loki
# S3 costs: $0.023/GB/month (vs $0.50/GB for hot storage)
Production Logging Checklist
# deployment/logging-readiness.yaml
application:
- "✓ Structured logging (pino with JSON)"
- "✓ Correlation IDs in all logs"
- "✓ Appropriate log levels (no DEBUG in production)"
- "✓ Sensitive data redacted"
- "✓ Error context captured (stack trace, error code)"
infrastructure:
- "✓ Loki deployed with persistent storage"
- "✓ Promtail collecting all pod logs"
- "✓ Log parsing configured (JSON extraction)"
- "✓ Retention policy set (tiered by level)"
observability:
- "✓ Error rate dashboard created"
- "✓ Slow request queries built"
- "✓ Alert on error rate spike"
- "✓ Runbook for common errors"
optimization:
- "✓ Sampling configured (low-value logs)"
- "✓ Cost estimated and baseline established"
- "✓ Compression enabled"
- "✓ Archive strategy for old logs"
Conclusion
Structured logging is the foundation of production observability. Pino makes JSON logging fast and ergonomic. Loki aggregates these logs at scale. LogQL queries pinpoint issues across microservices. Correlation IDs tie everything together. Sampling and retention policies keep costs linear instead of exponential. Combined, these practices transform logs from noise into a precise debugging tool that reduces mean-time-to-resolution from hours to minutes.