Published on

OpenTelemetry for AI Systems — Tracing LLM Calls, Token Usage, and Agent Loops

Authors

Introduction

LLM inference is a black box without observability. Token counts aren't logged. Agent loops spiral into recursion silently. RAG retrieval latency hides in the call stack. OpenTelemetry's generic semantic conventions for LLMs (gen_ai.* attributes) standardize observation across models, frameworks, and vendors. Traces reveal cost bottlenecks, latency culprits, and agent misbehavior instantly.

Standard OTel Semantic Conventions for LLM

OpenTelemetry defines gen_ai.* attributes for LLM tracing:

import { trace, context } from '@opentelemetry/api';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const tracer = trace.getTracer('llm-service');

const span = tracer.startSpan('llm.completion', {
  attributes: {
    'gen_ai.system': 'openai',                          // LLM vendor
    'gen_ai.request.model': 'gpt-4o',                   // Model identifier
    'gen_ai.request.max_tokens': 1024,                  // Max output tokens
    'gen_ai.request.temperature': 0.7,                  // Temperature
    'gen_ai.usage.input_tokens': 250,                   // Input tokens
    'gen_ai.usage.output_tokens': 187,                  // Output tokens
    'gen_ai.response.finish_reason': 'stop',            // Why generation stopped
    'http.status_code': 200,                            // API status
    'gen_ai.request.frequency_penalty': 0.5,            // Frequency penalty
    'gen_ai.request.presence_penalty': 0.5,             // Presence penalty
  },
});

span.end();

These attributes enable:

  • Token cost tracking (input_tokens * cost_per_1k + output_tokens * cost_per_1k)
  • Latency attribution (slow models vs API overhead)
  • Model performance comparison (GPT-4 vs Claude vs Llama)
  • Usage trends over time

Tracing LLM Calls With Token Counts/Latency/Model

Wrap LLM calls with comprehensive spans:

import Anthropic from '@anthropic-ai/sdk';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const tracer = trace.getTracer('llm-inference');

async function generateResponse(userPrompt: string): Promise<string> {
  const span = tracer.startSpan('claude.completion');

  const startTime = Date.now();

  try {
    const message = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages: [{ role: 'user', content: userPrompt }],
    });

    const latency = Date.now() - startTime;
    const inputTokens = message.usage.input_tokens;
    const outputTokens = message.usage.output_tokens;

    span.setAttributes({
      'gen_ai.system': 'anthropic',
      'gen_ai.request.model': 'claude-3-5-sonnet-20241022',
      'gen_ai.request.max_tokens': 1024,
      'gen_ai.usage.input_tokens': inputTokens,
      'gen_ai.usage.output_tokens': outputTokens,
      'gen_ai.response.finish_reason': message.stop_reason,
      'http.request.duration_ms': latency,
      'gen_ai.usage.cost_usd': (inputTokens * 0.003 + outputTokens * 0.015) / 1000,
    });

    span.setStatus({ code: SpanStatusCode.OK });
    return message.content[0].type === 'text' ? message.content[0].text : '';
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Every LLM call is now observable: input/output tokens, latency, cost, and errors.

Tracing Across Agent Tool Calls

Instrument agents to track tool invocations:

import Anthropic from '@anthropic-ai/sdk';
import { trace } from '@opentelemetry/api';

const client = new Anthropic();
const tracer = trace.getTracer('agent');

const tools = [
  {
    name: 'database_query',
    description: 'Query the database',
    input_schema: { /* ... */ },
  },
  {
    name: 'external_api',
    description: 'Call external API',
    input_schema: { /* ... */ },
  },
];

async function runAgent(userQuery: string): Promise<string> {
  const agentSpan = tracer.startSpan('agent.run', {
    attributes: {
      'agent.user_query': userQuery,
      'agent.tool_count': tools.length,
    },
  });

  let messages = [{ role: 'user', content: userQuery }];
  let iteration = 0;

  while (iteration < 10) {
    iteration++;

    const loopSpan = tracer.startSpan('agent.iteration', {
      parent: agentSpan,
      attributes: { 'agent.iteration': iteration },
    });

    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      tools,
      messages,
    });

    messages.push({ role: 'assistant', content: response.content });

    if (response.stop_reason === 'end_turn') {
      loopSpan.end();
      agentSpan.end();
      return response.content[0].type === 'text' ? response.content[0].text : '';
    }

    // Process tool calls
    for (const block of response.content) {
      if (block.type === 'tool_use') {
        const toolSpan = tracer.startSpan('agent.tool_call', {
          parent: loopSpan,
          attributes: {
            'agent.tool_name': block.name,
            'agent.tool_input': JSON.stringify(block.input),
          },
        });

        try {
          let toolResult;
          if (block.name === 'database_query') {
            toolResult = await queryDatabase(block.input);
          } else if (block.name === 'external_api') {
            toolResult = await callExternalAPI(block.input);
          }

          toolSpan.setAttributes({
            'agent.tool_result': JSON.stringify(toolResult),
          });

          messages.push({
            role: 'user',
            content: [
              {
                type: 'tool_result',
                tool_use_id: block.id,
                content: JSON.stringify(toolResult),
              },
            ],
          });
        } finally {
          toolSpan.end();
        }
      }
    }

    loopSpan.end();
  }

  agentSpan.end();
  throw new Error('Agent exceeded max iterations');
}

Agent spans reveal:

  • Iteration count (spiraling loops detected)
  • Tool latencies (which tool is slow?)
  • Tool errors (which tools fail frequently?)
  • Total cost (sum of all LLM calls)

Custom Spans for RAG Pipeline Steps

Instrument RAG pipelines step-by-step:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('rag-pipeline');

async function ragPipeline(userQuery: string): Promise<string> {
  const pipelineSpan = tracer.startSpan('rag.pipeline', {
    attributes: { 'rag.query': userQuery },
  });

  // Step 1: Embed query
  const embeddingSpan = tracer.startSpan('rag.embedding', {
    parent: pipelineSpan,
  });

  const queryVector = await embedModel.embed(userQuery);

  embeddingSpan.setAttributes({
    'rag.embedding.dimension': queryVector.length,
    'rag.embedding.model': 'bge-large-en-v1.5',
  });
  embeddingSpan.end();

  // Step 2: Retrieve documents
  const retrievalSpan = tracer.startSpan('rag.retrieval', {
    parent: pipelineSpan,
  });

  const documents = await vectorStore.search(queryVector, { topK: 5 });

  retrievalSpan.setAttributes({
    'rag.retrieval.document_count': documents.length,
    'rag.retrieval.latency_ms': retrievalSpan.duration,
    'rag.retrieval.top_score': documents[0]?.score,
  });
  retrievalSpan.end();

  // Step 3: Reranking (optional)
  const rerankSpan = tracer.startSpan('rag.reranking', {
    parent: pipelineSpan,
  });

  const reranked = await reranker.rank(userQuery, documents);

  rerankSpan.setAttributes({
    'rag.rerank.document_count': reranked.length,
    'rag.rerank.kept_docs': reranked.filter(d => d.score > 0.5).length,
  });
  rerankSpan.end();

  // Step 4: Generation
  const generationSpan = tracer.startSpan('rag.generation', {
    parent: pipelineSpan,
  });

  const context = reranked.map(d => d.text).join('\n');
  const response = await generateWithContext(userQuery, context);

  generationSpan.setAttributes({
    'gen_ai.usage.input_tokens': response.input_tokens,
    'gen_ai.usage.output_tokens': response.output_tokens,
    'rag.generation.latency_ms': generationSpan.duration,
  });
  generationSpan.end();

  pipelineSpan.end();

  return response.text;
}

RAG spans reveal:

  • Which step is slowest (retrieval, reranking, or generation?)
  • Document quality (top reranked score <0.3 means poor retrieval)
  • Token efficiency (large context window with few output tokens = expensive inefficiency)

GenAI Metrics (Token Throughput, Cache Hit Rate, Error Rate)

Define custom metrics for AI systems:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('llm-metrics');

// Token throughput
const tokenCounter = meter.createCounter('gen_ai.tokens.total', {
  description: 'Total tokens processed',
  unit: '{tokens}',
});

// LLM cache hit rate (for KV cache, semantic cache, etc.)
const cacheHitRatio = meter.createObservableGauge('gen_ai.cache.hit_ratio', {
  description: 'Semantic cache hit ratio',
  unit: '1',
});

// Model error rate
const modelErrors = meter.createCounter('gen_ai.errors.total', {
  description: 'LLM API errors',
  unit: '{errors}',
});

// Latency histogram
const latencyHistogram = meter.createHistogram('gen_ai.latency_ms', {
  description: 'LLM response latency',
  unit: 'ms',
});

// Usage in code
async function callLLM() {
  const start = Date.now();

  try {
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages,
    });

    const latency = Date.now() - start;
    const totalTokens = response.usage.input_tokens + response.usage.output_tokens;

    tokenCounter.add(totalTokens, {
      'gen_ai.system': 'anthropic',
      'gen_ai.request.model': 'claude-3-5-sonnet-20241022',
    });

    latencyHistogram.record(latency, {
      'gen_ai.system': 'anthropic',
    });

    // Check if response came from cache
    if (response.usage.cache_read_input_tokens) {
      const hitRatio = response.usage.cache_read_input_tokens / response.usage.input_tokens;
      cacheHitRatio.addCallback(observer => {
        observer.observe(hitRatio, { 'gen_ai.system': 'anthropic' });
      });
    }

    return response;
  } catch (error) {
    modelErrors.add(1, {
      'gen_ai.system': 'anthropic',
      'error.type': error.code,
    });
    throw error;
  }
}

Metrics dashboard displays:

  • Token throughput (K tokens/min)
  • Cache efficiency (hit ratio trending up = optimization working)
  • Error rates by model (identify unreliable models)
  • P95 latency (SLA tracking)

Langfuse + OTel Integration

Langfuse specializes in LLM observability and integrates with OpenTelemetry:

import Langfuse from 'langfuse';
import { trace, metrics } from '@opentelemetry/api';

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

const tracer = trace.getTracer('llm');

async function trackWithLangfuse(userQuery: string) {
  // Start Langfuse trace
  const langfuseTrace = langfuse.trace({
    name: 'user-query',
    userId: 'user-123',
    metadata: { model: 'claude-3-5-sonnet' },
  });

  // Also create OTel span for broader observability
  const otelSpan = tracer.startSpan('llm.query', {
    attributes: {
      'gen_ai.system': 'anthropic',
      'langfuse.trace_id': langfuseTrace.id,
    },
  });

  const generation = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [{ role: 'user', content: userQuery }],
  });

  // Log to both systems
  langfuseTrace.generation({
    name: 'claude-response',
    model: 'claude-3-5-sonnet-20241022',
    input: userQuery,
    output: generation.content[0].type === 'text' ? generation.content[0].text : '',
    usage: {
      input: generation.usage.input_tokens,
      output: generation.usage.output_tokens,
    },
  });

  otelSpan.setAttributes({
    'gen_ai.usage.input_tokens': generation.usage.input_tokens,
    'gen_ai.usage.output_tokens': generation.usage.output_tokens,
  });
  otelSpan.end();

  langfuseTrace.end();
}

Langfuse UI shows costs, latencies, and error traces. OTel metrics feed into Prometheus/Grafana for alerting.

Grafana Dashboard for AI Metrics

Create a Grafana dashboard for LLM observability:

{
  "dashboard": {
    "title": "LLM Observability",
    "panels": [
      {
        "title": "Token Throughput",
        "targets": [
          {
            "expr": "rate(gen_ai_tokens_total[5m])",
            "legendFormat": "{{gen_ai_system}}"
          }
        ]
      },
      {
        "title": "P95 Latency (ms)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m]))",
            "legendFormat": "{{gen_ai_system}}"
          }
        ]
      },
      {
        "title": "Cache Hit Ratio",
        "targets": [
          {
            "expr": "gen_ai_cache_hit_ratio",
            "legendFormat": "{{gen_ai_system}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(gen_ai_errors_total[5m])",
            "legendFormat": "{{error_type}}"
          }
        ]
      },
      {
        "title": "Estimated Daily Cost",
        "targets": [
          {
            "expr": "sum(rate(gen_ai_tokens_total[24h]) * 24 * pricing_per_token)",
            "legendFormat": "{{gen_ai_system}}"
          }
        ]
      }
    ]
  }
}

Alerting on LLM Error Rates and Cost Spikes

Define alerting rules:

groups:
  - name: llm-alerts
    rules:
      - alert: HighLLMErrorRate
        expr: rate(gen_ai_errors_total[5m]) > 0.05  # &gt;5% error rate
        for: 5m
        annotations:
          summary: "LLM error rate spike detected"

      - alert: CostSpike
        expr: rate(gen_ai_cost_usd[1h]) > 100  # &gt;$100/hour
        for: 5m
        annotations:
          summary: "LLM cost spike: {{ $value }}/hour"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m])) > 5000
        for: 5m
        annotations:
          summary: "P95 LLM latency &gt;5 seconds"

      - alert: CacheEfficiencyDegrading
        expr: gen_ai_cache_hit_ratio &lt; 0.2  # &lt;20% hit rate
        for: 15m
        annotations:
          summary: "Semantic cache efficiency low"

Alerts fire to Slack/PagerDuty when LLM systems degrade.

Sampling Strategy for High-Volume AI Traces

Don't trace every LLM call in production (<1% sample rate):

import { ProbabilitySampler } from '@opentelemetry/sdk-trace-node';

const sampler = new ProbabilitySampler(0.01);  // Sample 1% of traces

// Oversample errors
const customSampler = {
  shouldSample: (context, traceId, spanName, spanKind, attributes, links) => {
    if (attributes['gen_ai.response.finish_reason'] === 'error') {
      return { decision: 'RECORD_AND_SAMPLE' };
    }
    return sampler.shouldSample(context, traceId, spanName, spanKind, attributes, links);
  },
};

Sampling reduces storage costs while maintaining error visibility. High-cost operations (long-context RAG) oversample at 10%.

Checklist

  • Define gen_ai.* semantic conventions for all LLM calls
  • Create spans for each LLM invocation with input/output tokens
  • Track agent iterations with child spans per tool call
  • Instrument RAG pipelines with spans per step (retrieval, reranking, generation)
  • Define custom metrics for token throughput, cache hit ratio, error rates
  • Set up Langfuse or similar for LLM-specific observability
  • Create Grafana dashboards for LLM metrics and costs
  • Define alerting rules for error rate, latency, and cost spikes
  • Implement trace sampling (1% baseline, oversample errors and high-cost calls)
  • Log cost calculations with each span
  • Monitor cache effectiveness (KV cache, semantic cache hit rates)
  • Review traces monthly to identify optimization opportunities

Conclusion

OpenTelemetry semantic conventions bring structure to LLM observability. Traces reveal cost bottlenecks and agent loops. Metrics track token efficiency and cache effectiveness. Combined with Langfuse and Grafana dashboards, you gain complete visibility into LLM systems. The result: cost optimization, latency reduction, and confidence in agent behavior at scale.