- Published on
OpenTelemetry for AI Systems — Tracing LLM Calls, Token Usage, and Agent Loops
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
LLM inference is a black box without observability. Token counts aren't logged. Agent loops spiral into recursion silently. RAG retrieval latency hides in the call stack. OpenTelemetry's generic semantic conventions for LLMs (gen_ai.* attributes) standardize observation across models, frameworks, and vendors. Traces reveal cost bottlenecks, latency culprits, and agent misbehavior instantly.
- Standard OTel Semantic Conventions for LLM
- Tracing LLM Calls With Token Counts/Latency/Model
- Tracing Across Agent Tool Calls
- Custom Spans for RAG Pipeline Steps
- GenAI Metrics (Token Throughput, Cache Hit Rate, Error Rate)
- Langfuse + OTel Integration
- Grafana Dashboard for AI Metrics
- Alerting on LLM Error Rates and Cost Spikes
- Sampling Strategy for High-Volume AI Traces
- Checklist
- Conclusion
Standard OTel Semantic Conventions for LLM
OpenTelemetry defines gen_ai.* attributes for LLM tracing:
import { trace, context } from '@opentelemetry/api';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const tracer = trace.getTracer('llm-service');
const span = tracer.startSpan('llm.completion', {
attributes: {
'gen_ai.system': 'openai', // LLM vendor
'gen_ai.request.model': 'gpt-4o', // Model identifier
'gen_ai.request.max_tokens': 1024, // Max output tokens
'gen_ai.request.temperature': 0.7, // Temperature
'gen_ai.usage.input_tokens': 250, // Input tokens
'gen_ai.usage.output_tokens': 187, // Output tokens
'gen_ai.response.finish_reason': 'stop', // Why generation stopped
'http.status_code': 200, // API status
'gen_ai.request.frequency_penalty': 0.5, // Frequency penalty
'gen_ai.request.presence_penalty': 0.5, // Presence penalty
},
});
span.end();
These attributes enable:
- Token cost tracking (
input_tokens * cost_per_1k + output_tokens * cost_per_1k) - Latency attribution (slow models vs API overhead)
- Model performance comparison (GPT-4 vs Claude vs Llama)
- Usage trends over time
Tracing LLM Calls With Token Counts/Latency/Model
Wrap LLM calls with comprehensive spans:
import Anthropic from '@anthropic-ai/sdk';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const tracer = trace.getTracer('llm-inference');
async function generateResponse(userPrompt: string): Promise<string> {
const span = tracer.startSpan('claude.completion');
const startTime = Date.now();
try {
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{ role: 'user', content: userPrompt }],
});
const latency = Date.now() - startTime;
const inputTokens = message.usage.input_tokens;
const outputTokens = message.usage.output_tokens;
span.setAttributes({
'gen_ai.system': 'anthropic',
'gen_ai.request.model': 'claude-3-5-sonnet-20241022',
'gen_ai.request.max_tokens': 1024,
'gen_ai.usage.input_tokens': inputTokens,
'gen_ai.usage.output_tokens': outputTokens,
'gen_ai.response.finish_reason': message.stop_reason,
'http.request.duration_ms': latency,
'gen_ai.usage.cost_usd': (inputTokens * 0.003 + outputTokens * 0.015) / 1000,
});
span.setStatus({ code: SpanStatusCode.OK });
return message.content[0].type === 'text' ? message.content[0].text : '';
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
Every LLM call is now observable: input/output tokens, latency, cost, and errors.
Tracing Across Agent Tool Calls
Instrument agents to track tool invocations:
import Anthropic from '@anthropic-ai/sdk';
import { trace } from '@opentelemetry/api';
const client = new Anthropic();
const tracer = trace.getTracer('agent');
const tools = [
{
name: 'database_query',
description: 'Query the database',
input_schema: { /* ... */ },
},
{
name: 'external_api',
description: 'Call external API',
input_schema: { /* ... */ },
},
];
async function runAgent(userQuery: string): Promise<string> {
const agentSpan = tracer.startSpan('agent.run', {
attributes: {
'agent.user_query': userQuery,
'agent.tool_count': tools.length,
},
});
let messages = [{ role: 'user', content: userQuery }];
let iteration = 0;
while (iteration < 10) {
iteration++;
const loopSpan = tracer.startSpan('agent.iteration', {
parent: agentSpan,
attributes: { 'agent.iteration': iteration },
});
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
tools,
messages,
});
messages.push({ role: 'assistant', content: response.content });
if (response.stop_reason === 'end_turn') {
loopSpan.end();
agentSpan.end();
return response.content[0].type === 'text' ? response.content[0].text : '';
}
// Process tool calls
for (const block of response.content) {
if (block.type === 'tool_use') {
const toolSpan = tracer.startSpan('agent.tool_call', {
parent: loopSpan,
attributes: {
'agent.tool_name': block.name,
'agent.tool_input': JSON.stringify(block.input),
},
});
try {
let toolResult;
if (block.name === 'database_query') {
toolResult = await queryDatabase(block.input);
} else if (block.name === 'external_api') {
toolResult = await callExternalAPI(block.input);
}
toolSpan.setAttributes({
'agent.tool_result': JSON.stringify(toolResult),
});
messages.push({
role: 'user',
content: [
{
type: 'tool_result',
tool_use_id: block.id,
content: JSON.stringify(toolResult),
},
],
});
} finally {
toolSpan.end();
}
}
}
loopSpan.end();
}
agentSpan.end();
throw new Error('Agent exceeded max iterations');
}
Agent spans reveal:
- Iteration count (spiraling loops detected)
- Tool latencies (which tool is slow?)
- Tool errors (which tools fail frequently?)
- Total cost (sum of all LLM calls)
Custom Spans for RAG Pipeline Steps
Instrument RAG pipelines step-by-step:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('rag-pipeline');
async function ragPipeline(userQuery: string): Promise<string> {
const pipelineSpan = tracer.startSpan('rag.pipeline', {
attributes: { 'rag.query': userQuery },
});
// Step 1: Embed query
const embeddingSpan = tracer.startSpan('rag.embedding', {
parent: pipelineSpan,
});
const queryVector = await embedModel.embed(userQuery);
embeddingSpan.setAttributes({
'rag.embedding.dimension': queryVector.length,
'rag.embedding.model': 'bge-large-en-v1.5',
});
embeddingSpan.end();
// Step 2: Retrieve documents
const retrievalSpan = tracer.startSpan('rag.retrieval', {
parent: pipelineSpan,
});
const documents = await vectorStore.search(queryVector, { topK: 5 });
retrievalSpan.setAttributes({
'rag.retrieval.document_count': documents.length,
'rag.retrieval.latency_ms': retrievalSpan.duration,
'rag.retrieval.top_score': documents[0]?.score,
});
retrievalSpan.end();
// Step 3: Reranking (optional)
const rerankSpan = tracer.startSpan('rag.reranking', {
parent: pipelineSpan,
});
const reranked = await reranker.rank(userQuery, documents);
rerankSpan.setAttributes({
'rag.rerank.document_count': reranked.length,
'rag.rerank.kept_docs': reranked.filter(d => d.score > 0.5).length,
});
rerankSpan.end();
// Step 4: Generation
const generationSpan = tracer.startSpan('rag.generation', {
parent: pipelineSpan,
});
const context = reranked.map(d => d.text).join('\n');
const response = await generateWithContext(userQuery, context);
generationSpan.setAttributes({
'gen_ai.usage.input_tokens': response.input_tokens,
'gen_ai.usage.output_tokens': response.output_tokens,
'rag.generation.latency_ms': generationSpan.duration,
});
generationSpan.end();
pipelineSpan.end();
return response.text;
}
RAG spans reveal:
- Which step is slowest (retrieval, reranking, or generation?)
- Document quality (top reranked score <0.3 means poor retrieval)
- Token efficiency (large context window with few output tokens = expensive inefficiency)
GenAI Metrics (Token Throughput, Cache Hit Rate, Error Rate)
Define custom metrics for AI systems:
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('llm-metrics');
// Token throughput
const tokenCounter = meter.createCounter('gen_ai.tokens.total', {
description: 'Total tokens processed',
unit: '{tokens}',
});
// LLM cache hit rate (for KV cache, semantic cache, etc.)
const cacheHitRatio = meter.createObservableGauge('gen_ai.cache.hit_ratio', {
description: 'Semantic cache hit ratio',
unit: '1',
});
// Model error rate
const modelErrors = meter.createCounter('gen_ai.errors.total', {
description: 'LLM API errors',
unit: '{errors}',
});
// Latency histogram
const latencyHistogram = meter.createHistogram('gen_ai.latency_ms', {
description: 'LLM response latency',
unit: 'ms',
});
// Usage in code
async function callLLM() {
const start = Date.now();
try {
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages,
});
const latency = Date.now() - start;
const totalTokens = response.usage.input_tokens + response.usage.output_tokens;
tokenCounter.add(totalTokens, {
'gen_ai.system': 'anthropic',
'gen_ai.request.model': 'claude-3-5-sonnet-20241022',
});
latencyHistogram.record(latency, {
'gen_ai.system': 'anthropic',
});
// Check if response came from cache
if (response.usage.cache_read_input_tokens) {
const hitRatio = response.usage.cache_read_input_tokens / response.usage.input_tokens;
cacheHitRatio.addCallback(observer => {
observer.observe(hitRatio, { 'gen_ai.system': 'anthropic' });
});
}
return response;
} catch (error) {
modelErrors.add(1, {
'gen_ai.system': 'anthropic',
'error.type': error.code,
});
throw error;
}
}
Metrics dashboard displays:
- Token throughput (K tokens/min)
- Cache efficiency (hit ratio trending up = optimization working)
- Error rates by model (identify unreliable models)
- P95 latency (SLA tracking)
Langfuse + OTel Integration
Langfuse specializes in LLM observability and integrates with OpenTelemetry:
import Langfuse from 'langfuse';
import { trace, metrics } from '@opentelemetry/api';
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
});
const tracer = trace.getTracer('llm');
async function trackWithLangfuse(userQuery: string) {
// Start Langfuse trace
const langfuseTrace = langfuse.trace({
name: 'user-query',
userId: 'user-123',
metadata: { model: 'claude-3-5-sonnet' },
});
// Also create OTel span for broader observability
const otelSpan = tracer.startSpan('llm.query', {
attributes: {
'gen_ai.system': 'anthropic',
'langfuse.trace_id': langfuseTrace.id,
},
});
const generation = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{ role: 'user', content: userQuery }],
});
// Log to both systems
langfuseTrace.generation({
name: 'claude-response',
model: 'claude-3-5-sonnet-20241022',
input: userQuery,
output: generation.content[0].type === 'text' ? generation.content[0].text : '',
usage: {
input: generation.usage.input_tokens,
output: generation.usage.output_tokens,
},
});
otelSpan.setAttributes({
'gen_ai.usage.input_tokens': generation.usage.input_tokens,
'gen_ai.usage.output_tokens': generation.usage.output_tokens,
});
otelSpan.end();
langfuseTrace.end();
}
Langfuse UI shows costs, latencies, and error traces. OTel metrics feed into Prometheus/Grafana for alerting.
Grafana Dashboard for AI Metrics
Create a Grafana dashboard for LLM observability:
{
"dashboard": {
"title": "LLM Observability",
"panels": [
{
"title": "Token Throughput",
"targets": [
{
"expr": "rate(gen_ai_tokens_total[5m])",
"legendFormat": "{{gen_ai_system}}"
}
]
},
{
"title": "P95 Latency (ms)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m]))",
"legendFormat": "{{gen_ai_system}}"
}
]
},
{
"title": "Cache Hit Ratio",
"targets": [
{
"expr": "gen_ai_cache_hit_ratio",
"legendFormat": "{{gen_ai_system}}"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(gen_ai_errors_total[5m])",
"legendFormat": "{{error_type}}"
}
]
},
{
"title": "Estimated Daily Cost",
"targets": [
{
"expr": "sum(rate(gen_ai_tokens_total[24h]) * 24 * pricing_per_token)",
"legendFormat": "{{gen_ai_system}}"
}
]
}
]
}
}
Alerting on LLM Error Rates and Cost Spikes
Define alerting rules:
groups:
- name: llm-alerts
rules:
- alert: HighLLMErrorRate
expr: rate(gen_ai_errors_total[5m]) > 0.05 # >5% error rate
for: 5m
annotations:
summary: "LLM error rate spike detected"
- alert: CostSpike
expr: rate(gen_ai_cost_usd[1h]) > 100 # >$100/hour
for: 5m
annotations:
summary: "LLM cost spike: {{ $value }}/hour"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m])) > 5000
for: 5m
annotations:
summary: "P95 LLM latency >5 seconds"
- alert: CacheEfficiencyDegrading
expr: gen_ai_cache_hit_ratio < 0.2 # <20% hit rate
for: 15m
annotations:
summary: "Semantic cache efficiency low"
Alerts fire to Slack/PagerDuty when LLM systems degrade.
Sampling Strategy for High-Volume AI Traces
Don't trace every LLM call in production (<1% sample rate):
import { ProbabilitySampler } from '@opentelemetry/sdk-trace-node';
const sampler = new ProbabilitySampler(0.01); // Sample 1% of traces
// Oversample errors
const customSampler = {
shouldSample: (context, traceId, spanName, spanKind, attributes, links) => {
if (attributes['gen_ai.response.finish_reason'] === 'error') {
return { decision: 'RECORD_AND_SAMPLE' };
}
return sampler.shouldSample(context, traceId, spanName, spanKind, attributes, links);
},
};
Sampling reduces storage costs while maintaining error visibility. High-cost operations (long-context RAG) oversample at 10%.
Checklist
- Define gen_ai.* semantic conventions for all LLM calls
- Create spans for each LLM invocation with input/output tokens
- Track agent iterations with child spans per tool call
- Instrument RAG pipelines with spans per step (retrieval, reranking, generation)
- Define custom metrics for token throughput, cache hit ratio, error rates
- Set up Langfuse or similar for LLM-specific observability
- Create Grafana dashboards for LLM metrics and costs
- Define alerting rules for error rate, latency, and cost spikes
- Implement trace sampling (1% baseline, oversample errors and high-cost calls)
- Log cost calculations with each span
- Monitor cache effectiveness (KV cache, semantic cache hit rates)
- Review traces monthly to identify optimization opportunities
Conclusion
OpenTelemetry semantic conventions bring structure to LLM observability. Traces reveal cost bottlenecks and agent loops. Metrics track token efficiency and cache effectiveness. Combined with Langfuse and Grafana dashboards, you gain complete visibility into LLM systems. The result: cost optimization, latency reduction, and confidence in agent behavior at scale.