- Published on
OpenTelemetry in Node.js — Distributed Tracing From Zero to Production
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Distributed tracing answers questions no metrics dashboard can: "Why is this request slow?" across 15 microservices. OpenTelemetry standardizes tracing, eliminating vendor lock-in. This post walks through production-ready setup, auto-instrumentation that requires zero code changes, custom spans for business context, and the sampling strategies that keep costs sane.
- OTel SDK Setup with Auto-Instrumentation
- Custom Spans and Attributes
- Trace Context Propagation Across Services
- OTLP Exporter to Grafana Tempo/Jaeger
- Sampling Strategies
- Baggage for Business Context
- Trace-Based Alerting
- Production OpenTelemetry Checklist
- Conclusion
OTel SDK Setup with Auto-Instrumentation
OpenTelemetry's magic lies in auto-instrumentation—observe your code without modifying it:
# Install dependencies
npm install \
@opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/sdk-trace-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Entry point instrumentation (must be first import):
// src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ConsoleSpanExporter, SimpleSpanProcessor } from '@opentelemetry/sdk-trace-node';
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318',
headers: {
'Authorization': `Bearer ${process.env.OTEL_AUTH_TOKEN || ''}`,
},
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
environment: process.env.NODE_ENV,
region: process.env.REGION || 'us-east-1',
}),
traceExporter,
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-http': {
enabled: true,
requestHook: (span, req) => {
span.setAttributes({
'http.client_ip': req.ip,
'http.user_agent': req.get('user-agent'),
});
},
},
'@opentelemetry/instrumentation-pg': {
enabled: true,
responseHook: (span, response) => {
span.setAttribute('db.rows_affected', response.rowCount);
},
},
}),
],
});
// Start SDK before importing app
sdk.start();
console.log('Tracing initialized');
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((err) => console.error('Failed to shutdown tracing', err));
});
Application entry point:
// src/index.ts
// ⚠️ MUST come after instrumentation
import './instrumentation';
import express from 'express';
import { trace } from '@opentelemetry/api';
const app = express();
const tracer = trace.getTracer('my-service');
app.get('/api/users/:id', async (req, res) => {
// Auto-instrumentation captures HTTP metadata
// Your code runs with tracing automatically
const user = await getUserFromDb(req.params.id);
res.json(user);
});
app.listen(3000, () => {
console.log('Server running on :3000');
});
Environment configuration:
# .env.production
OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo.observability.svc.cluster.local:4318
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer%20secret-token
NODE_ENV=production
VERSION=2.1.0
REGION=us-west-2
Auto-instrumentation captures:
- HTTP requests/responses (method, URL, status)
- Database queries (SQL, execution time)
- DNS lookups
- File system operations
- External HTTP calls
- Promise/async-await chains
Custom Spans and Attributes
Auto-instrumentation covers standard operations. Add custom spans for business logic:
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(orderId: string, amount: number) {
// Create explicit span
const span = tracer.startSpan('processPayment', {
attributes: {
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
},
});
return context.with(trace.setSpan(context.active(), span), async () => {
try {
// Validate payment
const validationSpan = tracer.startSpan('validatePayment');
validationSpan.addEvent('Checking fraud rules', {
'fraud.score': 0.15,
});
validationSpan.end();
// Charge customer
const chargeSpan = tracer.startSpan('chargeCustomer', {
attributes: {
'customer.id': 'cust_abc123',
'processor': 'stripe',
},
});
try {
const result = await stripe.charges.create({
amount: amount * 100, // cents
currency: 'usd',
});
chargeSpan.addEvent('payment_success', {
'charge.id': result.id,
'charge.status': result.status,
});
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
chargeSpan.recordException(error as Error);
chargeSpan.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
chargeSpan.end();
}
return { success: true, chargeId: result.id };
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: 'Payment processing failed',
});
throw error;
} finally {
span.end();
}
});
}
Span attributes best practices:
// ✓ DO: Use semantic conventions
span.setAttributes({
'http.method': 'POST',
'http.url': url,
'http.status_code': 200,
'db.system': 'postgresql',
'db.name': 'payments',
'messaging.system': 'kafka',
});
// ✓ DO: Add business context
span.setAttributes({
'customer.id': customerId,
'subscription.tier': 'premium',
'experiment.variant': 'treatment',
});
// ✗ DON'T: PII or secrets
// span.setAttributes({
// 'user.password': '...', // NEVER
// 'api.key': '...', // NEVER
// });
Trace Context Propagation Across Services
Traces span services. Context must propagate across boundaries:
// Service A: Initiates request
import { trace, context } from '@opentelemetry/api';
async function callDownstream(serviceB_url: string) {
const span = tracer.startSpan('callServiceB');
try {
// Propagate trace context
const headers = {};
// This is handled automatically by HTTP instrumentation
// But you can explicitly propagate if needed:
const response = await fetch(serviceB_url, {
method: 'POST',
headers,
body: JSON.stringify({ data: 'test' }),
});
span.addEvent('service_b_response', {
'http.status_code': response.status,
});
return response.json();
} finally {
span.end();
}
}
// Service B: Receives and continues trace
import express from 'express';
const app = express();
app.post('/api/process', (req, res) => {
// Auto-instrumentation extracts trace context from headers
// Span continues the parent trace automatically
const activeSpan = trace.getActiveSpan();
activeSpan?.addEvent('Processing request from upstream');
// Work happens in this span
const result = processData(req.body);
res.json(result);
});
W3C Trace Context headers (automatic):
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value
baggage: userId=alice,serverNode=DF:28,isProduction=false
OpenTelemetry HTTP instrumentation automatically:
- Extracts
traceparentfrom incoming requests - Injects
traceparentinto outgoing requests - Continues the trace chain transparently
OTLP Exporter to Grafana Tempo/Jaeger
Export traces to your backend:
Docker Compose setup (local development):
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200"
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./tempo-config.yaml:/etc/tempo/config.yaml
command: -config.file=/etc/tempo/config.yaml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
Production Kubernetes deployment:
# k8s/otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 3
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-k8s:latest
ports:
- containerPort: 4317 # OTLP gRPC
name: otlp-grpc
- containerPort: 4318 # OTLP HTTP
name: otlp-http
env:
- name: GOGC
value: "80"
- name: MEMORY_LIMITER_LIMIT_MIB
value: "512"
volumeMounts:
- name: config
mountPath: /etc/otel/config.yaml
subPath: config.yaml
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "100m"
memory: "256Mi"
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
tail_sampling:
policies:
- name: error-traces
type: status_code
status_code:
status_codes: [ERROR]
exporters:
otlp:
endpoint: tempo.observability:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp]
Sampling Strategies
Sampling controls cost without losing signal:
Head-based sampling (at SDK):
import { ProbabilitySampler } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
sampler: new ProbabilitySampler(0.1), // Sample 10% of traces
// ...
});
Tail-based sampling (in collector):
# Collector config
processors:
tail_sampling:
policies:
# Always sample errors
- name: error-traces
type: status_code
status_code:
status_codes: [ERROR]
# Sample slow requests (>500ms)
- name: slow-traces
type: latency
latency:
threshold_ms: 500
# Sample specific services
- name: critical-service
type: string_attribute
string_attribute:
key: service.name
values: [payment-service, auth-service]
enabled_regex_matching: false
# Default: sample 5% of remaining
- name: default
type: probabilistic
probabilistic:
sampling_percentage: 5
Dynamic sampling based on business logic:
import { Sampler, SamplingResult, SamplingDecision } from '@opentelemetry/api';
class BusinessLogicSampler implements Sampler {
shouldSample(context, traceId, spanName, spanKind, attributes) {
// Always sample premium customers
if (attributes['customer.tier'] === 'premium') {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Always sample checkout flow
if (spanName.includes('checkout')) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample 10% of regular traffic
if (Math.random() < 0.1) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
return { decision: SamplingDecision.NOT_RECORD };
}
}
Baggage for Business Context
Baggage carries metadata across service boundaries:
import { context, propagation } from '@opentelemetry/api';
// Service A: Set baggage
const baggage = propagation.getBaggage(context.active());
const newBaggage = baggage?.setEntry('userId', { value: 'user123' })
.setEntry('experiment.group', { value: 'treatment' })
.setEntry('customer.tier', { value: 'premium' });
context.with(propagation.setBaggage(context.active(), newBaggage), async () => {
// Call downstream service
await fetch('http://service-b/api/process');
});
// Service B: Read baggage
app.post('/api/process', (req, res) => {
const baggage = propagation.getBaggage(context.active());
const userId = baggage?.getEntry('userId')?.value;
const experimentGroup = baggage?.getEntry('experiment.group')?.value;
console.log(`Processing for user: ${userId} in group: ${experimentGroup}`);
// Use baggage in business logic
const result = processUserRequest(userId, experimentGroup);
res.json(result);
});
Trace-Based Alerting
Convert traces to actionable alerts:
# Prometheus rules for trace metrics
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: trace-alerts
spec:
groups:
- name: traces
interval: 30s
rules:
# Alert on high error rate
- alert: HighTraceErrorRate
expr: |
rate(traces_error_total[5m]) > 0.05
for: 5m
annotations:
summary: "High error trace rate ({{ $value | humanizePercentage }})"
# Alert on slow p99 latency
- alert: HighTraceLatency
expr: |
histogram_quantile(0.99, rate(span_duration_ms_bucket[5m])) > 500
for: 10m
annotations:
summary: "P99 span latency exceeds 500ms ({{ $value | humanize }}ms)"
# Alert on sampling bias
- alert: LowSamplingRate
expr: |
rate(spans_sampled_total[5m]) / rate(spans_total[5m]) < 0.01
annotations:
summary: "Sampling rate critically low"
Grafana AlertManager integration:
// Query traces and create alerts
const HIGH_ERROR_RATE_QUERY = `
{
"datasourceUid": "tempo-uid",
"expression": "rate(traces_error_total[5m]) > 0.05"
}
`;
// Create alert based on trace analysis
fetch('/api/v1/alerts/create', {
method: 'POST',
body: JSON.stringify({
title: 'High Error Rate Detected',
query: HIGH_ERROR_RATE_QUERY,
condition: 'gt 0.05',
for: '5m',
}),
});
Production OpenTelemetry Checklist
# deployment/otel-readiness.yaml
infrastructure:
- "✓ OTLP collector deployed and accessible"
- "✓ Tempo or Jaeger backend operational"
- "✓ Network policies allow trace export"
- "✓ Resource limits set on collector"
application:
- "✓ Instrumentation runs before app code"
- "✓ Service name and version configured"
- "✓ Sampling strategy appropriate for volume"
- "✓ No PII in span attributes"
observability:
- "✓ Trace queries working in UI"
- "✓ Service dependencies visible"
- "✓ Slow trace detection configured"
- "✓ Alert rules created for regressions"
operations:
- "✓ Runbook for high trace volume"
- "✓ Sampling adjustments documented"
- "✓ Trace retention policy set"
- "✓ On-call team trained"
Conclusion
OpenTelemetry transforms debugging from guesswork to precision. Auto-instrumentation means zero code changes for standard operations. Custom spans add business context. Trace context propagation connects services into coherent stories. Intelligent sampling keeps costs linear instead of exponential. Combined with Grafana Tempo or Jaeger, you shift from reactive incident response to proactive understanding of system behavior. Start with auto-instrumentation, add custom spans where business logic matters, then refine sampling as you scale.