Published on

Evaluating AI Agents — Trajectory Testing, Tool Use Accuracy, and Task Completion

Authors

Introduction

How do you know if an agent is good? Simple metrics like "solved the task" miss important details: Did it use tools correctly? Did it take an efficient path? Did it hallucinate? This post covers comprehensive agent evaluation frameworks that measure not just outcomes but the quality of reasoning and tool use along the way.

Trajectory Evaluation

Trajectory evaluation judges not just the final answer, but the path taken to get there.

interface AgentTrajectory {
  taskId: string;
  startTime: number;
  endTime: number;
  steps: TrajectoryStep[];
  finalOutput: string;
  success: boolean;
}

interface TrajectoryStep {
  stepNumber: number;
  action: string;
  tool?: string;
  toolInput?: Record<string, unknown>;
  toolOutput?: string;
  reasoning: string;
  timestamp: number;
}

class TrajectoryEvaluator {
  async evaluate(trajectory: AgentTrajectory, groundTruth: string): Promise<EvaluationScore> {
    // Score 1: Efficiency (fewer steps is better)
    const efficiencyScore = this.scoreEfficiency(trajectory);

    // Score 2: Tool use appropriateness (used right tools?)
    const toolScore = this.scoreToolUse(trajectory);

    // Score 3: Reasoning quality (made sense at each step?)
    const reasoningScore = await this.scoreReasoning(trajectory);

    // Score 4: Correctness (did it answer right?)
    const correctnessScore = await this.scoreCorrectness(trajectory.finalOutput, groundTruth);

    // Weighted average
    const finalScore =
      efficiencyScore * 0.2 +
      toolScore * 0.25 +
      reasoningScore * 0.25 +
      correctnessScore * 0.3;

    return {
      overall: finalScore,
      efficiency: efficiencyScore,
      toolAccuracy: toolScore,
      reasoning: reasoningScore,
      correctness: correctnessScore,
      issues: this.identifyIssues(trajectory),
    };
  }

  private scoreEfficiency(trajectory: AgentTrajectory): number {
    // Fewer steps = higher score, but not linear (10 steps good, 50 steps bad)
    const stepCount = trajectory.steps.length;

    if (stepCount <= 3) return 1.0;
    if (stepCount <= 5) return 0.9;
    if (stepCount <= 10) return 0.8;
    if (stepCount <= 20) return 0.6;
    if (stepCount <= 50) return 0.3;

    return 0.1; // Way too many steps
  }

  private scoreToolUse(trajectory: AgentTrajectory): number {
    let score = 1.0;

    for (const step of trajectory.steps) {
      // Check 1: Did the tool exist?
      if (step.tool && !this.toolExists(step.tool)) {
        score -= 0.1;
      }

      // Check 2: Were the inputs valid?
      if (step.toolInput && !this.inputsAreValid(step.tool || '', step.toolInput)) {
        score -= 0.05;
      }

      // Check 3: Was the tool appropriate for the goal?
      if (!this.isToolAppropriate(step.tool || '', step.reasoning)) {
        score -= 0.1;
      }
    }

    return Math.max(0, score);
  }

  private async scoreReasoning(trajectory: AgentTrajectory): Promise<number> {
    // Use LLM to evaluate reasoning quality
    let score = 1.0;

    for (const step of trajectory.steps) {
      const isReasoningLogical = await this.checkReasoningLogic(step);

      if (!isReasoningLogical) {
        score -= 0.05;
      }
    }

    return Math.max(0, score);
  }

  private async scoreCorrectness(output: string, groundTruth: string): Promise<number> {
    // Use LLM to compare output with ground truth
    const prompt = `Compare these two answers:

Expected: ${groundTruth}
Got: ${output}

How correct is the "Got" answer? (0-1 score)`;

    const response = await this.llmCall(prompt);

    try {
      const score = parseFloat(response);
      return Math.min(1, Math.max(0, score));
    } catch {
      return 0;
    }
  }

  private toolExists(tool: string): boolean {
    return ['search', 'calculator', 'get_weather'].includes(tool);
  }

  private inputsAreValid(tool: string, inputs: Record<string, unknown>): boolean {
    // Check if inputs match tool schema
    return true;
  }

  private isToolAppropriate(tool: string, reasoning: string): boolean {
    // Does the reasoning justify using this tool?
    return true;
  }

  private async checkReasoningLogic(step: TrajectoryStep): Promise<boolean> {
    // Use LLM to verify reasoning is logical
    return true;
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }

  private identifyIssues(trajectory: AgentTrajectory): string[] {
    const issues: string[] = [];

    if (trajectory.steps.length > 20) {
      issues.push('Agent took too many steps');
    }

    const toolCalls = trajectory.steps.filter((s) => s.tool);
    if (toolCalls.length === 0) {
      issues.push('Agent never used tools');
    }

    return issues;
  }
}

interface EvaluationScore {
  overall: number; // 0-1
  efficiency: number;
  toolAccuracy: number;
  reasoning: number;
  correctness: number;
  issues: string[];
}

Trajectory evaluation reveals how agents think, not just what they conclude.

Tool Call Accuracy

Did the agent use tools correctly?

interface ToolCallAccuracy {
  toolCallCount: number;
  correctCalls: number;
  hallucinations: number;
  invalidArguments: number;
  accuracyRate: number;
}

class ToolAccuracyEvaluator {
  evaluate(trajectory: AgentTrajectory): ToolCallAccuracy {
    const toolSteps = trajectory.steps.filter((s) => s.tool);

    let correctCalls = 0;
    let hallucinations = 0;
    let invalidArguments = 0;

    for (const step of toolSteps) {
      const tool = step.tool!;

      // Check 1: Does tool exist?
      if (!this.toolExists(tool)) {
        hallucinations++;
        continue;
      }

      // Check 2: Are arguments valid?
      if (!this.validateToolArguments(tool, step.toolInput || {})) {
        invalidArguments++;
        continue;
      }

      // Check 3: Did tool produce output?
      if (!step.toolOutput) {
        invalidArguments++;
        continue;
      }

      correctCalls++;
    }

    return {
      toolCallCount: toolSteps.length,
      correctCalls,
      hallucinations,
      invalidArguments,
      accuracyRate: toolSteps.length > 0 ? correctCalls / toolSteps.length : 0,
    };
  }

  private toolExists(tool: string): boolean {
    return true;
  }

  private validateToolArguments(tool: string, args: Record<string, unknown>): boolean {
    // Validate against tool schema
    return true;
  }
}

Tool accuracy directly impacts agent reliability.

Task Completion Rate

What percentage of tasks does the agent succeed on?

interface TaskResult {
  taskId: string;
  completed: boolean;
  success: boolean;
  reason?: string;
}

class TaskCompletionEvaluator {
  async evaluateTestSuite(testCases: TestCase[]): Promise<CompletionMetrics> {
    const results: TaskResult[] = [];

    for (const testCase of testCases) {
      const result = await this.runTestCase(testCase);
      results.push(result);
    }

    const completed = results.filter((r) => r.completed).length;
    const successful = results.filter((r) => r.success).length;
    const failed = results.filter((r) => r.completed && !r.success).length;
    const timedOut = results.filter((r) => !r.completed).length;

    return {
      total: results.length,
      completed,
      successful,
      failed,
      timedOut,
      completionRate: completed / results.length,
      successRate: successful / completed,
      summary: this.summarizeFailures(results),
    };
  }

  private async runTestCase(testCase: TestCase): Promise<TaskResult> {
    const startTime = Date.now();
    const timeout = 120000; // 2 minutes

    try {
      const trajectory = await this.runAgent(testCase.input);
      const isCorrect = await this.checkCorrectness(trajectory.finalOutput, testCase.expectedOutput);

      return {
        taskId: testCase.id,
        completed: true,
        success: isCorrect,
      };
    } catch (error) {
      const elapsed = Date.now() - startTime;

      if (elapsed > timeout) {
        return {
          taskId: testCase.id,
          completed: false,
          success: false,
          reason: 'Timeout',
        };
      }

      return {
        taskId: testCase.id,
        completed: true,
        success: false,
        reason: (error as Error).message,
      };
    }
  }

  private summarizeFailures(results: TaskResult[]): string {
    const failures = results.filter((r) => !r.success);

    if (failures.length === 0) {
      return 'All tests passed';
    }

    const reasons = new Map<string, number>();

    for (const failure of failures) {
      const reason = failure.reason || 'Unknown';
      reasons.set(reason, (reasons.get(reason) || 0) + 1);
    }

    return Array.from(reasons.entries())
      .map(([reason, count]) => `${count} tests failed: ${reason}`)
      .join('; ');
  }

  private async runAgent(input: string): Promise<AgentTrajectory> {
    return {
      taskId: '',
      startTime: 0,
      endTime: 0,
      steps: [],
      finalOutput: '',
      success: false,
    };
  }

  private async checkCorrectness(output: string, expected: string): Promise<boolean> {
    return output === expected;
  }
}

interface CompletionMetrics {
  total: number;
  completed: number;
  successful: number;
  failed: number;
  timedOut: number;
  completionRate: number;
  successRate: number;
  summary: string;
}

interface TestCase {
  id: string;
  input: string;
  expectedOutput: string;
}

Completion rates measure agent effectiveness on real tasks.

Efficiency Scoring

How well does the agent use resources?

interface EfficiencyMetrics {
  stepCount: number;
  toolCallCount: number;
  tokensUsed: number;
  timeSpentMs: number;
  costInDollars: number;
  efficiency: number; // composite score
  wastedSteps: number;
}

class EfficiencyEvaluator {
  async evaluate(trajectory: AgentTrajectory): Promise<EfficiencyMetrics> {
    const optimalSteps = await this.estimateOptimalSteps(trajectory);

    const wastedSteps = Math.max(0, trajectory.steps.length - optimalSteps);

    const efficiency = Math.max(0, 1 - wastedSteps / trajectory.steps.length);

    return {
      stepCount: trajectory.steps.length,
      toolCallCount: trajectory.steps.filter((s) => s.tool).length,
      tokensUsed: this.estimateTokens(trajectory),
      timeSpentMs: trajectory.endTime - trajectory.startTime,
      costInDollars: this.estimateCost(trajectory),
      efficiency,
      wastedSteps,
    };
  }

  private async estimateOptimalSteps(trajectory: AgentTrajectory): Promise<number> {
    // Use LLM to estimate minimum steps needed
    const prompt = `What's the minimum number of steps to solve this task?

Task progression:
${trajectory.steps.map((s) => `${s.stepNumber}. ${s.action}`).join('\n')}`;

    const response = await this.llmCall(prompt);

    try {
      const match = response.match(/(\d+)\s+steps?/);
      return match ? parseInt(match[1]) : trajectory.steps.length;
    } catch {
      return trajectory.steps.length;
    }
  }

  private estimateTokens(trajectory: AgentTrajectory): number {
    // Estimate tokens used by agent
    const input = trajectory.steps.map((s) => s.reasoning).join(' ');

    return Math.ceil(input.length / 4);
  }

  private estimateCost(trajectory: AgentTrajectory): number {
    // Estimate cost based on tokens and model pricing
    const tokens = this.estimateTokens(trajectory);

    const costPerMToken = 5; // $5 per million tokens (rough estimate)

    return (tokens / 1000000) * costPerMToken;
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Efficiency metrics optimize for cost and latency.

Hallucination Rate

How often does the agent make up information?

interface HallucinationDetection {
  hallucinations: string[];
  totalClaims: number;
  hallucationRate: number;
}

class HallucinationDetector {
  async detect(trajectory: AgentTrajectory): Promise<HallucinationDetection> {
    const hallucinations: string[] = [];

    // Check 1: Tool hallucinations
    for (const step of trajectory.steps) {
      if (step.tool && !this.toolExists(step.tool)) {
        hallucinations.push(`Hallucinated tool: ${step.tool}`);
      }

      // Check 2: Invalid tool arguments
      if (step.toolInput && !this.inputsAreValid(step.tool || '', step.toolInput)) {
        hallucinations.push(
          `Invalid arguments to ${step.tool}: ${JSON.stringify(step.toolInput)}`,
        );
      }
    }

    // Check 3: Factual hallucinations in reasoning
    const claims = await this.extractClaims(trajectory);

    for (const claim of claims) {
      const isVerifiable = await this.verifyFact(claim);

      if (!isVerifiable) {
        hallucinations.push(`Unverifiable claim: ${claim}`);
      }
    }

    return {
      hallucinations,
      totalClaims: claims.length,
      hallucationRate: hallucinations.length / claims.length,
    };
  }

  private async extractClaims(trajectory: AgentTrajectory): Promise<string[]> {
    // Extract factual claims from agent's reasoning
    const reasoning = trajectory.steps.map((s) => s.reasoning).join(' ');

    const prompt = `Extract factual claims from this text:

${reasoning}

Return JSON: { "claims": ["claim 1", "claim 2"] }`;

    const response = await this.llmCall(prompt);

    try {
      const parsed = JSON.parse(response);
      return parsed.claims;
    } catch {
      return [];
    }
  }

  private async verifyFact(claim: string): Promise<boolean> {
    // Use search or knowledge base to verify
    return true;
  }

  private toolExists(tool: string): boolean {
    return true;
  }

  private inputsAreValid(tool: string, inputs: Record<string, unknown>): boolean {
    return true;
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Hallucination detection catches made-up tools and facts.

Test Case Design

What makes a good test suite for agents?

interface AgentTestSuite {
  tests: TestCase[];
  coverage: {
    tools: string[];
    reasoning_types: string[];
    complexity: string;
  };
}

class TestSuiteBuilder {
  buildSuite(): AgentTestSuite {
    const tests: TestCase[] = [
      // Easy: single tool call
      {
        id: 'simple-1',
        input: 'What is 2 + 2?',
        expectedOutput: '4',
      },

      // Medium: multiple tool calls
      {
        id: 'medium-1',
        input: 'What is the weather in SF and the capital of France?',
        expectedOutput: 'Weather in SF: [weather], Capital of France: Paris',
      },

      // Hard: reasoning and tool use
      {
        id: 'hard-1',
        input:
          'If the temperature is 32°F and it drops 5 degrees per hour, how long until it freezes?',
        expectedOutput:
          'Already at freezing point. Need to calculate based on...',
      },

      // Edge case: hallucination temptation
      {
        id: 'edge-1',
        input: 'Use the non_existent_tool to do something',
        expectedOutput: 'I don\'t have access to non_existent_tool',
      },

      // Edge case: ambiguous input
      {
        id: 'edge-2',
        input: 'What do you think?',
        expectedOutput: 'I need clarification...',
      },
    ];

    return {
      tests,
      coverage: {
        tools: ['calculator', 'search', 'get_weather'],
        reasoning_types: ['arithmetic', 'lookup', 'comparison'],
        complexity: 'beginner-to-expert',
      },
    };
  }
}

Good test suites cover simple tasks, complex reasoning, and edge cases.

LLM-as-Judge Evaluation

Use another LLM to judge agent performance.

class LLMJudge {
  async evaluateAgentResponse(
    task: string,
    agentOutput: string,
    expectedOutput?: string,
  ): Promise<JudgeScore> {
    const criteria = [
      'correctness',
      'completeness',
      'clarity',
      'reasoning_quality',
      'tool_use',
    ];

    const scores: Record<string, number> = {};

    for (const criterion of criteria) {
      const prompt = `Evaluate this agent's response on the criterion: "${criterion}"

Task: ${task}
Expected: ${expectedOutput || 'Not specified'}
Agent output: ${agentOutput}

Score from 0-1, where 0=poor and 1=excellent. Return only a number.`;

      const response = await this.llmCall(prompt);

      try {
        scores[criterion] = parseFloat(response);
      } catch {
        scores[criterion] = 0.5;
      }
    }

    const overall =
      Object.values(scores).reduce((a, b) => a + b, 0) / Object.keys(scores).length;

    return {
      overall,
      criteria: scores,
      feedback: await this.generateFeedback(task, agentOutput, scores),
    };
  }

  private async generateFeedback(
    task: string,
    output: string,
    scores: Record<string, number>,
  ): Promise<string> {
    const lowScores = Object.entries(scores)
      .filter(([_, score]) => score < 0.5)
      .map(([criterion]) => criterion);

    if (lowScores.length === 0) {
      return 'Agent performed well';
    }

    const prompt = `Provide concise feedback on why the agent scored low on: ${lowScores.join(', ')}

Agent output: ${output}`;

    return this.llmCall(prompt);
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

interface JudgeScore {
  overall: number;
  criteria: Record<string, number>;
  feedback: string;
}

LLM-as-judge evaluation captures nuanced quality that metrics miss.

Regression Suite for Agents

Prevent regressions in agent performance.

class AgentRegressionSuite {
  private baselineMetrics: Map<string, EvaluationScore> = new Map();

  async recordBaseline(testCase: TestCase, agent: any): Promise<void> {
    const trajectory = await agent.run(testCase.input);
    const evaluator = new TrajectoryEvaluator();
    const score = await evaluator.evaluate(trajectory, testCase.expectedOutput);

    this.baselineMetrics.set(testCase.id, score);
  }

  async checkForRegressions(testCase: TestCase, agent: any): Promise<RegressionCheck> {
    const trajectory = await agent.run(testCase.input);
    const evaluator = new TrajectoryEvaluator();
    const newScore = await evaluator.evaluate(trajectory, testCase.expectedOutput);

    const baseline = this.baselineMetrics.get(testCase.id);

    if (!baseline) {
      return {
        testCaseId: testCase.id,
        regressed: false,
        baselineScore: 0,
        newScore: newScore.overall,
        message: 'No baseline recorded',
      };
    }

    const regressed = newScore.overall < baseline.overall * 0.9; // 10% degradation threshold

    return {
      testCaseId: testCase.id,
      regressed,
      baselineScore: baseline.overall,
      newScore: newScore.overall,
      message: regressed
        ? `Regression: ${(baseline.overall * 100).toFixed(1)}% -> ${(newScore.overall * 100).toFixed(1)}%`
        : 'No regression',
    };
  }
}

interface RegressionCheck {
  testCaseId: string;
  regressed: boolean;
  baselineScore: number;
  newScore: number;
  message: string;
}

Regression suites catch performance degradation from changes.

Checklist

  • Trajectory: analyze path, not just outcome
  • Tool accuracy: measure correct tool calls vs hallucinations
  • Task completion: test on diverse, representative tasks
  • Efficiency: minimize steps, tokens, cost
  • Hallucination: detect made-up tools and facts
  • Test suite: cover easy, medium, hard, and edge cases
  • LLM-as-judge: nuanced evaluation of quality
  • Regression: catch performance degradation

Conclusion

Comprehensive agent evaluation looks beyond final answers. Measure trajectory quality, tool accuracy, task completion rates, and efficiency. Use LLM-as-judge for nuanced scoring and regression suites to catch degradation. Good evaluation drives continuous improvement.