Log Aggregation at Scale — Structured Logging, Loki, and Querying Millions of Log Lines

Introduction

Logs are the first place engineers dig when systems fail. Unstructured logs are unsearchable. Querying millions of lines is slow. Retention costs explode. This post walks through structured logging from code to ingestion, Grafana Loki's approach to log aggregation, smart sampling strategies for high-volume services, and correlation IDs that tie logs across services together.

Structured Logging with Pino
Log Levels and When to Use Them
Grafana Loki Setup
LogQL Queries for Debugging
Log Sampling for High-Volume Services
Correlation IDs Across Services
Log Retention Cost Optimization
Production Logging Checklist
Conclusion

Structured Logging with Pino

Pino is a fast JSON logger for Node.js (versus Winston or Bunyan):

Installation:

npm install pino pino-http pino-pretty

Basic setup:

// src/logger.ts
import pino from 'pino';

// Development: pretty-printed
// Production: JSON to stdout (for log aggregator)
const logger = pino(
  {
    level: process.env.LOG_LEVEL || 'info',
    base: {
      service: 'my-api',
      environment: process.env.NODE_ENV || 'development',
      version: process.env.VERSION || '1.0.0',
    },
  },
  process.env.NODE_ENV === 'production'
    ? pino.destination({ sync: false })  // Async for performance
    : pino.transport({
        target: 'pino-pretty',
        options: {
          colorize: true,
          translateTime: 'HH:MM:ss Z',
          ignore: 'pid,hostname',
        },
      }),
);

export default logger;

Production JSON output (what Loki ingests):

{
  "level": 30,
  "time": "2026-03-15T10:30:45.123Z",
  "pid": 12345,
  "hostname": "api-pod-1",
  "service": "my-api",
  "environment": "production",
  "version": "2.1.0",
  "msg": "Payment processed",
  "orderId": "order_123",
  "customerId": "cust_456",
  "amount": 99.99,
  "duration_ms": 234,
  "processor": "stripe"
}

HTTP middleware with context:

// src/middleware/logging.ts
import pino from 'pino-http';
import logger from '../logger';

export const httpLogger = pino(
  {
    logger,
    customLogLevel: (req, res, err) => {
      // More context-aware log levels
      if (res.statusCode >= 500) return 'error';
      if (res.statusCode >= 400) return 'warn';
      if (res.statusCode >= 300) return 'info';
      return 'debug';
    },
    customSuccessMessage: (req, res) => {
      return `${req.method} ${req.url} ${res.statusCode}`;
    },
    customErrorMessage: (req, res, err) => {
      return `${req.method} ${req.url} ${res.statusCode} - ${err.message}`;
    },
  },
);

// Express middleware
app.use(httpLogger);

Per-request context logging:

// src/routes/payments.ts
import logger from '../logger';
import { generateRequestId } from '../utils/correlation';

app.post('/api/payments', (req, res) => {
  const requestId = generateRequestId();
  const childLogger = logger.child({
    requestId,
    customerId: req.body.customerId,
    orderId: req.body.orderId,
  });

  // All logs from this handler include requestId
  childLogger.info('Processing payment request');

  try {
    const result = processPayment(req.body);

    childLogger.info(
      {
        event: 'payment_success',
        chargeId: result.chargeId,
        duration_ms: Date.now() - startTime,
      },
      'Payment processed successfully',
    );

    res.json(result);
  } catch (error) {
    childLogger.error(
      {
        event: 'payment_failed',
        error: error.message,
        errorCode: error.code,
      },
      'Payment processing failed',
    );

    res.status(500).json({ error: error.message });
  }
});

Log Levels and When to Use Them

Log level hierarchy:

TRACE (10): Very detailed debugging
DEBUG (20): Debugging information for development
INFO  (30): General informational messages
WARN  (40): Warnings for recoverable issues
ERROR (50): Errors for failures
FATAL (60): Fatal errors causing shutdown

When to use each:

const logger = pino();

// FATAL: System cannot continue
logger.fatal('Database connection lost, exiting');

// ERROR: Operation failed, but system continues
logger.error({ error }, 'Failed to send email');

// WARN: Something unexpected, but recoverable
logger.warn({ statusCode: 503 }, 'Stripe API slow');

// INFO: Important business events (payment, login)
logger.info({ orderId, amount }, 'Payment processed');

// DEBUG: Development diagnostics (disabled in production)
logger.debug({ userId }, 'User session created');

// TRACE: Extreme detail (rarely used)
logger.trace({ memory }, 'Memory state snapshot');

Cost-cutting: Choose log level appropriately

// ❌ BAD: Log everything
app.get('/api/users/:id', (req, res) => {
  logger.info('Received request');
  logger.info('Database query started');
  logger.info('Query returned', rows);
  logger.info('Response sent');
  // 4 logs per request × 1000 req/s = 4000 logs/s stored
  // Cost: $4000/month
});

// ✓ GOOD: Log events, not steps
app.get('/api/users/:id', (req, res) => {
  const start = Date.now();
  const user = await db.query('SELECT * FROM users WHERE id = ?', [req.params.id]);
  logger.info(
    { userId: req.params.id, duration_ms: Date.now() - start },
    'User fetched',
  );
  // 1 log per request × 1000 req/s = 1000 logs/s stored
  // Cost: $1000/month (4x cheaper)
});

Grafana Loki Setup

Loki is a log aggregation system optimized for Kubernetes:

Kubernetes deployment (DaemonSet for log collection):

# Add Loki Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki stack (Loki + Promtail)
helm install loki grafana/loki-stack \
  --namespace observability \
  --create-namespace \
  --values loki-values.yaml

Loki configuration:

# loki-values.yaml
loki:
  auth_enabled: true

  ingester:
    max_chunk_age: 1h
    chunk_idle_period: 15m

  limits_config:
    # Prevent abuse
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 720h  # 30 days
    max_streams_per_user: 10000
    max_global_streams_per_user: 100000

  schema_config:
    configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

# Promtail: Log shipper (runs on every node)
promtail:
  enabled: true

  config:
    clients:
    - url: http://loki:3100/loki/api/v1/push

    scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      # Extract pod labels as Loki labels
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

      # Parse JSON logs to extract fields
      pipeline_stages:
      - json:
          expressions:
            level: level
            message: msg
            requestId: requestId
      - labels:
          level:
          message:
          requestId:

Verify Loki is receiving logs:

# Port forward
kubectl port-forward -n observability svc/loki 3100:3100

# Query Loki API
curl -s 'http://localhost:3100/loki/api/v1/query' \
  --data-urlencode 'query={app="api", environment="production"}' | jq

# Access Loki UI (via Grafana)
# http://grafana.company.com → Explore → Select Loki data source

LogQL Queries for Debugging

LogQL is Loki's query language (similar to PromQL):

Basic queries:

# All logs from api service in production
{app="api", environment="production"}

# Logs with specific label value
{app="api"} | json | level="error"

# Search by pattern
{app="api"} |= "payment"

# Regex search
{app="api"} |~ "Error: .*timeout"

# Exclude pattern
{app="api"} != "health check"

Time-series aggregations:

# Count error logs per minute (last 1 hour)
count_over_time({app="api", level="error"}[1m])

# Rate of errors (errors per second)
rate({app="api", level="error"}[5m])

# Average response time (from duration_ms field)
avg(duration_ms) by (endpoint) | every 5m

Complex debugging queries:

# Find slow requests (P99 latency > 500ms)
quantile_over_time(0.99, duration_ms[5m])
  by (endpoint)
  | . > 500

# Compare error rate before/after deployment
sum(rate({app="api", level="error"}[5m]))
  / on() group_left()
  sum(rate({app="api"}[5m]))
  | . > 0.05  # Alert if > 5% error rate

# Find requests from specific customer
{app="api"} | json | customerId="cust_123" | requestId

# Trace a request across services
{requestId="req_abc123"}

# Find payment failures with context
{app="api"}
| json
| event="payment_failed"
| line_format "{{ .timestamp }} {{ .customerId }} {{ .error }}"

Log Sampling for High-Volume Services

Logging everything is expensive. Sample strategically:

Sampling by log level:

// Sample low-importance logs
const shouldSampleDebugLog = Math.random() < 0.1;  // 10% of DEBUG logs

if (level === 'DEBUG' && !shouldSampleDebugLog) {
  return;  // Don't log this
}

logger.debug({ ...data }, 'Message');

Sampling by request path:

// Sample health checks (high volume, low value)
const healthCheckRoutes = ['/health', '/readiness', '/liveness'];

const shouldLog = !healthCheckRoutes.includes(req.path) ||
  Math.random() < 0.01;  // 1% of health checks

if (shouldLog) {
  logger.info({ path: req.path }, 'HTTP request');
}

Probabilistic sampling (built-in to some loggers):

// Pino sampling extension
import pino from 'pino';

const logger = pino({
  transport: {
    target: 'pino-sampling-transport',
    options: {
      samplingConfig: {
        'debug': 0.01,     // 1% of DEBUG
        'info': 1.0,       // 100% of INFO
        'warn': 1.0,       // 100% of WARN
        'error': 1.0,      // 100% of ERROR
      },
    },
  },
});

Error-aware sampling (sample errors at higher rate):

// Keep all ERROR and FATAL
// Sample INFO, DEBUG based on context
const sampleRate = {
  'error': 1.0,
  'warn': 0.5,
  'info': 0.1,
  'debug': 0.01,
};

const shouldLog = Math.random() < sampleRate[level];

if (shouldLog) {
  logger.log(level, message);
}

Cost estimation:

Service: 5000 req/s
Log target: 2 logs per request (one start, one end)
= 10,000 logs/second

Loki pricing (example): $0.50 per GB
Average log size: 1KB
10,000 logs/s × 1KB × 86,400s/day × 30 days/month
= 25,920 GB/month (unsampled)
= $12,960/month (unsampled)

With sampling:
INFO logs: 10% (1000/s)
ERROR logs: 100% (50/s)
= 1,050 logs/s
= ~2.7 GB/month
= $1.35/month (100x cheaper!)

Correlation IDs Across Services

Correlation IDs tie logs across microservices:

Generate correlation ID:

// src/middleware/correlation-id.ts
import { v4 as uuidv4 } from 'uuid';

export function generateCorrelationId() {
  return uuidv4();
}

// Express middleware
app.use((req, res, next) => {
  // Use existing correlation ID if provided
  const correlationId =
    req.headers['x-correlation-id'] || generateCorrelationId();

  // Add to request object
  req.correlationId = correlationId;

  // Add to response headers (for client to use)
  res.setHeader('X-Correlation-ID', correlationId);

  // Add to all logs from this request
  const childLogger = logger.child({ correlationId });
  req.logger = childLogger;

  next();
});

Propagate across services:

// Service A: Make request to Service B
async function callServiceB(userId: string) {
  const response = await fetch('http://service-b/api/users', {
    headers: {
      'X-Correlation-ID': req.correlationId,  // Pass along
    },
  });
  return response.json();
}

// Service B: Receives correlation ID
app.get('/api/users', (req, res) => {
  const correlationId = req.headers['x-correlation-id'];
  const logger = pino.child({ correlationId });

  logger.info('Received request from Service A');
  // All logs include same correlationId
});

Query across services:

# Find all logs for a specific request (all services)
{correlationId="550e8400-e29b-41d4-a716-446655440000"}
| json

# Result:
# Service A: 10:30:00.001 Received request
# Service A: 10:30:00.050 Calling Service B
# Service B: 10:30:00.052 Received request from Service A
# Service B: 10:30:00.150 Database query completed
# Service A: 10:30:00.155 Response received
# Service A: 10:30:00.160 Response sent to client
# Total latency: 159ms, visible across both services

Log Retention Cost Optimization

Calculate retention needs:

Service: 10,000 req/s
Logs per request: 2
= 20,000 logs/second × 86,400 seconds/day = 1.73 GB/day
= 52 GB/month (unsampled)
At $0.50/GB = $26/month per 30-day retention

Different retention policies:
- INFO logs: 7 days   = 12 GB × $0.50 = $6
- WARN logs: 30 days  = 50 GB × $0.50 = $25
- ERROR logs: 90 days = 150 GB × $0.50 = $75
- Total: $106/month for mixed retention

Optimize by log level:

# Loki retention config
limits_config:
  retention_period: 30d  # Default for INFO/DEBUG

  # Per-stream retention (different by label)
  retention_by_label:
    # Keep errors longer
    level=error: 90d

    # Keep health checks shorter
    endpoint=/health: 1d

    # Keep DEBUG shorter
    level=debug: 3d

  # Compression
  chunk_encoding: snappy  # Reduce storage 40-50%

Archive old logs (further cost reduction):

# Move old logs to S3 (cheaper storage)
loki:
  schema_config:
    configs:
    - object_store: s3  # Archive old data

  storage_config:
    s3:
      s3: s3://logs-archive/loki
      # S3 costs: $0.023/GB/month (vs $0.50/GB for hot storage)

Production Logging Checklist

# deployment/logging-readiness.yaml
application:
  - "✓ Structured logging (pino with JSON)"
  - "✓ Correlation IDs in all logs"
  - "✓ Appropriate log levels (no DEBUG in production)"
  - "✓ Sensitive data redacted"
  - "✓ Error context captured (stack trace, error code)"

infrastructure:
  - "✓ Loki deployed with persistent storage"
  - "✓ Promtail collecting all pod logs"
  - "✓ Log parsing configured (JSON extraction)"
  - "✓ Retention policy set (tiered by level)"

observability:
  - "✓ Error rate dashboard created"
  - "✓ Slow request queries built"
  - "✓ Alert on error rate spike"
  - "✓ Runbook for common errors"

optimization:
  - "✓ Sampling configured (low-value logs)"
  - "✓ Cost estimated and baseline established"
  - "✓ Compression enabled"
  - "✓ Archive strategy for old logs"

Conclusion

Structured logging is the foundation of production observability. Pino makes JSON logging fast and ergonomic. Loki aggregates these logs at scale. LogQL queries pinpoint issues across microservices. Correlation IDs tie everything together. Sampling and retention policies keep costs linear instead of exponential. Combined, these practices transform logs from noise into a precise debugging tool that reduces mean-time-to-resolution from hours to minutes.