LLM Observability in Production — Tracing Every Token From Request to Response
Advertisement
Introduction
When your LLM application hits production, visibility becomes survival. A single slow model call cascades into timeouts. An unexpected cost spike destroys margins. Tokens disappear into a black box, and you have no way to correlate which user triggered which LLM call or tool invocation.
This post walks you through production-grade LLM observability: tracing every token, tracking costs per request, and alerting on anomalies before they become incidents.
- OpenTelemetry Spans for LLM Calls
- Trace Correlation Across Systems
- Sampling Strategy for High-Volume Applications
- Alerting on Latency and Cost Anomalies
- Conclusion
OpenTelemetry Spans for LLM Calls
OpenTelemetry provides the standard instrumentation layer. Here's how to wrap every LLM API call in a span that captures tokens, latency, and model metadata:
import {
trace,
context,
SpanStatusCode,
Attributes,
} from "@opentelemetry/api";
import { Anthropic } from "@anthropic-ai/sdk";
const tracer = trace.getTracer("llm-service");
interface LLMCallMetrics {
inputTokens: number;
outputTokens: number;
totalTokens: number;
costUSD: number;
latencyMs: number;
model: string;
}
async function callLLMWithTracing(
prompt: string,
model: string = "claude-3-5-sonnet-20241022",
userId: string = "unknown"
): Promise<{ text: string; metrics: LLMCallMetrics }> {
const span = tracer.startSpan("llm.completion", {
attributes: {
"llm.model": model,
"llm.user_id": userId,
"llm.prompt_length": prompt.length,
"span.kind": "internal",
},
});
return context.with(trace.setSpan(context.active(), span), async () => {
const startTime = Date.now();
try {
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const message = await client.messages.create({
model,
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
const latencyMs = Date.now() - startTime;
const inputTokens = message.usage.input_tokens;
const outputTokens = message.usage.output_tokens;
const totalTokens = inputTokens + outputTokens;
// Cost calculation (Sonnet pricing as example)
const costUSD =
(inputTokens * 0.003 + outputTokens * 0.015) / 1000;
span.setAttributes({
"llm.input_tokens": inputTokens,
"llm.output_tokens": outputTokens,
"llm.total_tokens": totalTokens,
"llm.cost_usd": costUSD,
"llm.latency_ms": latencyMs,
"llm.finish_reason": message.stop_reason,
});
const responseText =
message.content[0].type === "text" ? message.content[0].text : "";
span.setStatus({ code: SpanStatusCode.OK });
return {
text: responseText,
metrics: {
inputTokens,
outputTokens,
totalTokens,
costUSD,
latencyMs,
model,
},
};
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end();
}
});
}
export { callLLMWithTracing, LLMCallMetrics };
Trace Correlation Across Systems
The real power of observability appears when you correlate user sessions with LLM calls and tool invocations. Use a request ID that flows through your entire system:
import { context, trace } from "@opentelemetry/api";
import { v4 as uuidv4 } from "uuid";
interface RequestContext {
requestId: string;
userId: string;
sessionId: string;
startTime: number;
}
function createRequestContext(
userId: string,
sessionId: string
): RequestContext {
return {
requestId: uuidv4(),
userId,
sessionId,
startTime: Date.now(),
};
}
function withRequestContext<T>(
ctx: RequestContext,
fn: () => Promise<T>
): Promise<T> {
const span = trace.getActiveSpan();
if (span) {
span.setAttributes({
"request.id": ctx.requestId,
"user.id": ctx.userId,
"session.id": ctx.sessionId,
});
}
return fn();
}
async function orchestrateLLMAndTools(
ctx: RequestContext,
userQuery: string
): Promise<string> {
const span = trace.getTracer("service").startSpan("user_request", {
attributes: {
"request.id": ctx.requestId,
"user.id": ctx.userId,
},
});
return context.with(
trace.setSpan(context.active(), span),
async () => {
try {
const llmResult = await callLLMWithTracing(
userQuery,
undefined,
ctx.userId
);
// Tool calls, database queries, etc. all inherit this span context
const toolResult = await executeRelevantTools(ctx, llmResult.text);
span.addEvent("tools_executed", {
"tools.count": toolResult.length,
});
return toolResult.map((t) => t.result).join("\n");
} finally {
span.end();
}
}
);
}
async function executeRelevantTools(
ctx: RequestContext,
llmSuggestion: string
): Promise<Array<{ toolName: string; result: string }>> {
// Simulated tool execution with spans
const span = trace
.getTracer("tools")
.startSpan("execute_tools", {
attributes: {
"request.id": ctx.requestId,
},
});
return context.with(trace.setSpan(context.active(), span), async () => {
// Tool invocations inherit the request.id attribute
return [];
});
}
export { createRequestContext, withRequestContext, orchestrateLLMAndTools };
Sampling Strategy for High-Volume Applications
At scale, tracing every request becomes prohibitively expensive. Implement intelligent sampling that increases fidelity when problems occur:
import { Sampler, SamplingDecision, Context } from "@opentelemetry/api";
interface SamplingConfig {
defaultRate: number; // 0.1 = 10%
errorRate: number; // 1.0 = 100%
slowThresholdMs: number; // 5000ms
slowRate: number; // 0.5 = 50%
costThresholdUSD: number; // $0.10
costRate: number; // 1.0 = 100%
}
class AdaptiveLLMSampler implements Sampler {
constructor(private config: SamplingConfig) {}
shouldSample(
_context: Context,
_traceId: string,
_spanName: string,
_spanKind: number,
attributes: { [key: string]: unknown },
_links: Array<unknown>
): SamplingDecision {
// Always sample errors
if (attributes["error"]) {
return { decision: 2 }; // RECORD_AND_SAMPLE
}
// Always sample slow requests
const latency = attributes["llm.latency_ms"] as number | undefined;
if (latency && latency > this.config.slowThresholdMs) {
return { decision: 2 };
}
// Always sample high-cost requests
const cost = attributes["llm.cost_usd"] as number | undefined;
if (cost && cost > this.config.costThresholdUSD) {
return { decision: 2 };
}
// Default sampling rate for normal requests
if (Math.random() < this.config.defaultRate) {
return { decision: 2 };
}
return { decision: 1 }; // RECORD_ONLY (metrics but no trace)
}
getDescription(): string {
return "AdaptiveLLMSampler";
}
}
export { AdaptiveLLMSampler, SamplingConfig };
Alerting on Latency and Cost Anomalies
With spans flowing into your observability backend (Datadog, New Relic, Grafana Loki), set up alerts that trigger when token counts, costs, or latencies deviate from baseline:
interface AnomalyDetectionConfig {
latencyBaselineMs: number;
latencyThresholdStdDev: number; // 3.0 = 3 sigma
costBaseline: number;
costThresholdMultiplier: number; // 5.0 = 5x normal
tokenBaselinePer1kReqs: number;
tokenThreshold: number;
}
class LLMAnomalyDetector {
private latencyHistory: number[] = [];
private costHistory: number[] = [];
constructor(private config: AnomalyDetectionConfig) {}
checkLatencyAnomaly(latencyMs: number): boolean {
this.latencyHistory.push(latencyMs);
// Keep last 100 requests
if (this.latencyHistory.length > 100) {
this.latencyHistory.shift();
}
if (this.latencyHistory.length < 10) return false;
const mean =
this.latencyHistory.reduce((a, b) => a + b, 0) /
this.latencyHistory.length;
const variance =
this.latencyHistory.reduce((acc, val) => acc + Math.pow(val - mean, 2), 0) /
this.latencyHistory.length;
const stdDev = Math.sqrt(variance);
const zScore = Math.abs((latencyMs - mean) / stdDev);
return zScore > this.config.latencyThresholdStdDev;
}
checkCostAnomaly(costUSD: number): boolean {
this.costHistory.push(costUSD);
if (this.costHistory.length > 100) {
this.costHistory.shift();
}
if (this.costHistory.length < 10) return false;
const avgCost =
this.costHistory.reduce((a, b) => a + b, 0) /
this.costHistory.length;
const threshold = avgCost * this.config.costThresholdMultiplier;
return costUSD > threshold;
}
}
export { LLMAnomalyDetector, AnomalyDetectionConfig };
Conclusion
Production LLM observability requires three layers: tracing (OpenTelemetry spans), correlation (request IDs across systems), and anomaly detection (costs, latencies, tokens). Without these, you''re flying blind when production breaks.
Start with OpenTelemetry spans around LLM calls, add request context correlation, then layer on sampling and alerting as your volume scales. The earlier you instrument, the faster you''ll debug production incidents.
Advertisement