Published on

AI Agent Error Recovery — When Agents Fail, Hallucinate, or Get Stuck

Authors

Introduction

Agents fail. They hallucinate tool arguments, get stuck in loops, contradict themselves, or hit rate limits. Production agents need sophisticated error recovery strategies: detecting failures, reflecting on what went wrong, recovering gracefully, and escalating to humans when necessary. This post covers practical recovery patterns for real-world agent systems.

Detecting Agent Failure

Detecting failure is the prerequisite for recovery. What signals indicate an agent failed?

interface FailureSignal {
  type:
    | 'max_retries_exceeded'
    | 'timeout'
    | 'contradiction'
    | 'hallucination'
    | 'loop'
    | 'resource_exhausted';
  severity: 'warning' | 'error' | 'critical';
  message: string;
  recoverable: boolean;
}

class FailureDetector {
  private maxIterations: number = 10;
  private maxTokens: number = 100000;
  private timeout: number = 300000; // 5 minutes

  detectFailure(state: AgentState): FailureSignal | null {
    // Check 1: Max iterations exceeded
    if (state.iterations >= this.maxIterations) {
      return {
        type: 'max_retries_exceeded',
        severity: 'error',
        message: `Agent exceeded max iterations (${this.maxIterations})`,
        recoverable: false,
      };
    }

    // Check 2: Timeout
    const elapsedMs = Date.now() - state.startTime;
    if (elapsedMs > this.timeout) {
      return {
        type: 'timeout',
        severity: 'error',
        message: `Agent exceeded timeout (${this.timeout}ms)`,
        recoverable: false,
      };
    }

    // Check 3: Token budget exceeded
    if (state.tokensUsed > this.maxTokens) {
      return {
        type: 'resource_exhausted',
        severity: 'critical',
        message: `Exceeded token budget (${this.maxTokens} tokens)`,
        recoverable: false,
      };
    }

    // Check 4: Contradiction detection
    const contradiction = this.detectContradiction(state.messages);
    if (contradiction) {
      return {
        type: 'contradiction',
        severity: 'warning',
        message: `Agent contradicted itself: ${contradiction}`,
        recoverable: true,
      };
    }

    // Check 5: Hallucination detection
    const hallucination = this.detectHallucination(state);
    if (hallucination) {
      return {
        type: 'hallucination',
        severity: 'error',
        message: `Agent hallucinated invalid tool call: ${hallucination}`,
        recoverable: true,
      };
    }

    // Check 6: Loop detection
    const loop = this.detectLoop(state);
    if (loop) {
      return {
        type: 'loop',
        severity: 'warning',
        message: `Agent appears stuck in loop: ${loop}`,
        recoverable: true,
      };
    }

    return null;
  }

  private detectContradiction(messages: any[]): string | null {
    // Simple heuristic: look for "I think X" followed by "I think not X"
    const text = messages.map((m) => m.content).join(' ');

    const patterns = [
      [/should\s+be\s+(.+?)[,.]/i, /should\s+not\s+be\s+\1/i],
      [/it'?s\s+(.+?)[,.]/i, /it'?s\s+not\s+\1/i],
    ];

    for (const [pattern1, pattern2] of patterns) {
      if (pattern1.test(text) && pattern2.test(text)) {
        const match1 = text.match(pattern1);
        return match1 ? match1[1] : 'Unknown contradiction';
      }
    }

    return null;
  }

  private detectHallucination(state: AgentState): string | null {
    // Check if tool calls reference tools that don't exist
    const availableTools = ['search', 'calculator', 'get_weather'];

    for (const toolCall of state.toolCalls.slice(-3)) {
      if (!availableTools.includes(toolCall.name)) {
        return `Tool does not exist: ${toolCall.name}`;
      }

      // Check if tool call has invalid arguments
      if (
        toolCall.name === 'calculator' &&
        typeof toolCall.input.expression !== 'string'
      ) {
        return `Invalid argument to calculator: ${JSON.stringify(toolCall.input)}`;
      }
    }

    return null;
  }

  private detectLoop(state: AgentState): string | null {
    // Look for repeated tool calls with identical inputs
    const recentTools = state.toolCalls.slice(-5);

    if (recentTools.length < 3) {
      return null;
    }

    const lastThree = recentTools.slice(-3);
    const inputs = lastThree.map((t) => JSON.stringify(t.input));

    if (inputs[0] === inputs[1] && inputs[1] === inputs[2]) {
      return `Calling ${lastThree[0].name} with identical inputs repeatedly`;
    }

    return null;
  }
}

interface AgentState {
  iterations: number;
  startTime: number;
  tokensUsed: number;
  messages: any[];
  toolCalls: any[];
}

Detect failures early so you can recover before the agent creates more problems.

Reflection Prompting on Failure

When an agent fails, ask it to reflect on what went wrong and how to fix it.

class ReflectingRecoveryAgent {
  async runWithReflection(task: string): Promise<string> {
    let state: AgentState = this.initializeState(task);
    let reflection = null;

    while (state.iterations < 10) {
      state.iterations++;

      // Try to complete task
      const response = await this.llmCall(state.messages);

      // Check for failure
      const failure = this.detector.detectFailure(state);

      if (!failure) {
        // Success
        return response;
      }

      if (!failure.recoverable) {
        // Unrecoverable, bail out
        throw new Error(failure.message);
      }

      // Recoverable: ask agent to reflect
      console.log(`Failure detected: ${failure.message}. Asking agent to reflect...`);

      const reflectionPrompt = `You encountered an issue: "${failure.message}"

Looking at your recent actions, what went wrong? How can you approach this differently?
Suggest an alternative strategy.`;

      state.messages.push({
        role: 'user',
        content: reflectionPrompt,
      });

      reflection = await this.llmCall(state.messages);

      state.messages.push({
        role: 'assistant',
        content: reflection,
      });

      // Continue with reflective understanding
      const continuePrompt = `Based on your reflection, try the task again with this new understanding. Take a different approach this time.`;

      state.messages.push({
        role: 'user',
        content: continuePrompt,
      });
    }

    throw new Error('Max iterations exceeded during recovery');
  }

  private initializeState(task: string): AgentState {
    return {
      iterations: 0,
      startTime: Date.now(),
      tokensUsed: 0,
      messages: [{ role: 'user', content: task }],
      toolCalls: [],
    };
  }

  private detector = new FailureDetector();

  private async llmCall(messages: any[]): Promise<string> {
    return '';
  }
}

Reflection prompting asks the agent to diagnose its own failure. Often it can self-correct.

Alternative Tool Selection

When a tool fails, try a different tool to accomplish the goal.

interface ToolAlternative {
  primary: string;
  alternatives: string[];
  purpose: string;
}

const toolAlternatives: ToolAlternative[] = [
  {
    primary: 'web_search',
    alternatives: ['academic_search', 'news_search', 'site_search'],
    purpose: 'Find information',
  },
  {
    primary: 'database_query',
    alternatives: ['cache_lookup', 'api_call'],
    purpose: 'Fetch data',
  },
  {
    primary: 'send_email',
    alternatives: ['send_slack', 'send_notification'],
    purpose: 'Notify user',
  },
];

class AdaptiveToolSelection {
  async executeGoal(goal: string, toolCalls: ToolCall[]): Promise<string | null> {
    // First try: use primary tool
    const toolCall = toolCalls[0];
    const primaryTool = toolCall.name;

    let result = await this.executeTool(primaryTool, toolCall.input);

    if (result.success) {
      return result.output;
    }

    // Find alternatives
    const alternatives = this.findAlternatives(primaryTool);

    if (alternatives.length === 0) {
      return null; // No alternatives
    }

    console.log(`Primary tool ${primaryTool} failed. Trying alternatives: ${alternatives.join(', ')}`);

    // Try each alternative
    for (const altTool of alternatives) {
      const adapted = this.adaptInputs(toolCall.input, primaryTool, altTool);

      result = await this.executeTool(altTool, adapted);

      if (result.success) {
        console.log(`Success with alternative: ${altTool}`);
        return result.output;
      }
    }

    // All tools failed
    return null;
  }

  private findAlternatives(tool: string): string[] {
    const alt = toolAlternatives.find((ta) => ta.primary === tool);
    return alt?.alternatives || [];
  }

  private adaptInputs(
    originalInputs: Record<string, unknown>,
    fromTool: string,
    toTool: string,
  ): Record<string, unknown> {
    // Adapt parameters for different tools
    const adapted = { ...originalInputs };

    // web_search &lt;-&gt; academic_search: same 'query' parameter
    if (
      (fromTool === 'web_search' && toTool === 'academic_search') ||
      (fromTool === 'academic_search' && toTool === 'web_search')
    ) {
      return adapted;
    }

    // database_query -&gt; api_call: convert SQL to API params
    if (fromTool === 'database_query' && toTool === 'api_call') {
      // Extract what data we need and map to API endpoint
      adapted['endpoint'] = '/api/data';
      adapted['method'] = 'GET';
    }

    return adapted;
  }

  private async executeTool(
    name: string,
    input: Record<string, unknown>,
  ): Promise<{ success: boolean; output?: string; error?: string }> {
    try {
      // Tool execution
      return { success: true, output: 'Result' };
    } catch (error) {
      return { success: false, error: (error as Error).message };
    }
  }
}

Build a tool fallback graph so agents can try alternatives when tools fail.

Human-in-the-Loop Escalation

Some failures require human judgment. Escalate gracefully.

interface EscalationRequest {
  escalationId: string;
  agentName: string;
  situation: string;
  agentState: Record<string, unknown>;
  options: string[];
  timestamp: number;
  timeout: number; // milliseconds to wait for human response
}

class EscalationManager {
  private escalations: Map<string, EscalationRequest> = new Map();
  private humanResponses: Map<string, string> = new Map();

  async escalateToHuman(
    agentName: string,
    situation: string,
    agentState: Record<string, unknown>,
    options: string[],
  ): Promise<string> {
    const request: EscalationRequest = {
      escalationId: `esc-${Date.now()}-${Math.random()}`,
      agentName,
      situation,
      agentState,
      options,
      timestamp: Date.now(),
      timeout: 300000, // 5 minute timeout
    };

    this.escalations.set(request.escalationId, request);

    // Notify human (via email, Slack, dashboard, etc.)
    await this.notifyHuman(request);

    // Wait for response with timeout
    const response = await this.waitForResponse(request.escalationId, request.timeout);

    return response;
  }

  private async notifyHuman(request: EscalationRequest): Promise<void> {
    const message = `
AI Agent Escalation Required
Agent: ${request.agentName}
Situation: ${request.situation}

Options:
${request.options.map((o, i) => `${i + 1}. ${o}`).join('\n')}

Response needed within ${request.timeout / 1000} seconds.
Escalation ID: ${request.escalationId}
`;

    console.log(message);

    // In production: send via Slack, email, or internal API
    // await sendToSlack(message);
  }

  private async waitForResponse(escalationId: string, timeout: number): Promise<string> {
    const startTime = Date.now();

    while (Date.now() - startTime < timeout) {
      const response = this.humanResponses.get(escalationId);
      if (response) {
        return response;
      }

      // Wait 1 second before checking again
      await new Promise((resolve) => setTimeout(resolve, 1000));
    }

    // Timeout: choose default action
    console.log(`Escalation ${escalationId} timed out. Using default action.`);
    return 'abort';
  }

  receiveHumanResponse(escalationId: string, response: string): void {
    this.humanResponses.set(escalationId, response);
  }
}

// Usage in agent
class HumanInTheLoopAgent {
  private escalationManager = new EscalationManager();

  async run(task: string): Promise<string> {
    try {
      // Try to complete task
      return await this.attemptTask(task);
    } catch (error) {
      const errorMsg = (error as Error).message;

      // Determine if human input is needed
      if (this.requiresHumanDecision(errorMsg)) {
        const decision = await this.escalationManager.escalateToHuman(
          'MyAgent',
          `Encountered error: ${errorMsg}`,
          { task },
          ['Retry with different approach', 'Skip this task', 'Abort entire process'],
        );

        if (decision === 'Retry with different approach') {
          return this.attemptTaskDifferently(task);
        } else if (decision === 'Skip this task') {
          return `Skipped task: ${task}`;
        } else {
          throw new Error('User chose to abort');
        }
      }

      throw error;
    }
  }

  private requiresHumanDecision(error: string): boolean {
    // Determine which errors need human input
    return error.includes('Permission required') || error.includes('Confirmation needed');
  }

  private async attemptTask(task: string): Promise<string> {
    return 'Result';
  }

  private async attemptTaskDifferently(task: string): Promise<string> {
    return 'Different result';
  }
}

Escalation prevents bad decisions. Use it for permission-required actions and ambiguous situations.

Agent Checkpoint and Resume

Save agent state so you can resume after failures.

interface AgentCheckpoint {
  checkpointId: string;
  sessionId: string;
  iteration: number;
  messages: any[];
  toolCalls: any[];
  timestamp: number;
  status: 'active' | 'failed' | 'completed';
}

class CheckpointingAgent {
  private checkpointStore: Map<string, AgentCheckpoint> = new Map();

  async runWithCheckpoints(sessionId: string, task: string): Promise<string> {
    // Try to resume from last checkpoint
    let state = this.loadLatestCheckpoint(sessionId) || this.initializeState(sessionId, task);

    while (state.iteration < 10) {
      state.iteration++;

      // Create checkpoint before iteration
      const checkpoint = this.saveCheckpoint(state);

      try {
        const response = await this.executeIteration(state);

        if (this.isComplete(response)) {
          checkpoint.status = 'completed';
          return response;
        }

        state.messages.push({ role: 'assistant', content: response });
      } catch (error) {
        checkpoint.status = 'failed';
        console.log(`Iteration ${state.iteration} failed. Checkpoint saved for recovery.`);

        throw error;
      }
    }

    throw new Error('Max iterations exceeded');
  }

  private saveCheckpoint(state: AgentState): AgentCheckpoint {
    const checkpoint: AgentCheckpoint = {
      checkpointId: `cp-${Date.now()}-${Math.random()}`,
      sessionId: state.sessionId,
      iteration: state.iteration,
      messages: JSON.parse(JSON.stringify(state.messages)), // Deep copy
      toolCalls: JSON.parse(JSON.stringify(state.toolCalls)),
      timestamp: Date.now(),
      status: 'active',
    };

    this.checkpointStore.set(checkpoint.checkpointId, checkpoint);
    return checkpoint;
  }

  private loadLatestCheckpoint(sessionId: string): AgentState | null {
    const checkpoints = Array.from(this.checkpointStore.values()).filter(
      (cp) => cp.sessionId === sessionId && cp.status !== 'completed',
    );

    if (checkpoints.length === 0) {
      return null;
    }

    const latest = checkpoints.sort((a, b) => b.timestamp - a.timestamp)[0];

    return {
      sessionId,
      iteration: latest.iteration,
      messages: latest.messages,
      toolCalls: latest.toolCalls,
    };
  }

  private initializeState(sessionId: string, task: string): AgentState {
    return {
      sessionId,
      iteration: 0,
      messages: [{ role: 'user', content: task }],
      toolCalls: [],
    };
  }

  private async executeIteration(state: AgentState): Promise<string> {
    return '';
  }

  private isComplete(response: string): boolean {
    return true;
  }
}

interface AgentState {
  sessionId: string;
  iteration: number;
  messages: any[];
  toolCalls: any[];
}

Checkpoints enable recovery without losing progress. Always save before expensive operations.

Root Cause Analysis

When agents fail, understand why.

interface RootCauseAnalysis {
  failureType: string;
  immediateReason: string;
  rootCause: string;
  contributingFactors: string[];
  recommendation: string;
}

class RootCauseAnalyzer {
  async analyze(failure: FailureSignal, state: AgentState): Promise<RootCauseAnalysis> {
    let rootCause = '';
    let factors: string[] = [];

    switch (failure.type) {
      case 'max_retries_exceeded':
        // Root cause: tool not working or wrong approach
        rootCause = this.analyzeRetryFailure(state);
        factors = [
          'Same tool used repeatedly',
          'No progress between iterations',
          'Agent not recognizing failure',
        ];
        break;

      case 'contradiction':
        // Root cause: LLM confused or reasoning is poor
        rootCause = 'Agent reasoning is inconsistent or hallucinating';
        factors = ['Temperature may be too high', 'Prompt is ambiguous'];
        break;

      case 'hallucination':
        // Root cause: Tool schema not clear enough
        rootCause = 'Agent hallucinated invalid tool call';
        factors = [
          'Tool description unclear',
          'Parameter validation missing',
          'Agent not familiar with tool',
        ];
        break;

      case 'timeout':
        // Root cause: too many iterations or slow tools
        rootCause = this.analyzeTimeoutFailure(state);
        factors = [
          'Tool calls taking too long',
          'Inefficient approach',
          'Network latency',
        ];
        break;
    }

    return {
      failureType: failure.type,
      immediateReason: failure.message,
      rootCause,
      contributingFactors: factors,
      recommendation: this.recommendFix(failure.type, rootCause),
    };
  }

  private analyzeRetryFailure(state: AgentState): string {
    const lastFew = state.toolCalls.slice(-3);
    const tools = lastFew.map((tc) => tc.name);

    if (tools[0] === tools[1] && tools[1] === tools[2]) {
      return `Agent is calling ${tools[0]} repeatedly without changing inputs`;
    }

    return 'Agent is not making progress toward goal';
  }

  private analyzeTimeoutFailure(state: AgentState): string {
    const totalTokens = state.messages.reduce(
      (sum, m) => sum + Math.ceil(m.content.length / 4),
      0,
    );

    if (totalTokens > 50000) {
      return 'Agent has used too many tokens (likely inefficient prompting)';
    }

    return 'Agent is taking too many iterations';
  }

  private recommendFix(failureType: string, rootCause: string): string {
    if (failureType === 'hallucination') {
      return 'Improve tool descriptions and add examples to prompt';
    }

    if (failureType === 'max_retries_exceeded') {
      return 'Add tool alternatives or simplify the task';
    }

    if (failureType === 'timeout') {
      return 'Reduce context window or use fewer tool calls per iteration';
    }

    return 'Review agent prompt and tool definitions';
  }
}

Root cause analysis prevents repeating the same mistakes. Log failures and patterns.

Graceful Degradation

When agents fail completely, degrade to simpler solutions.

class DegradingAgent {
  async runWithFallback(task: string): Promise<string> {
    // Try level 1: Full agent with tools
    try {
      return await this.runFullAgent(task);
    } catch (error) {
      console.log(`Full agent failed: ${(error as Error).message}. Degrading...`);
    }

    // Try level 2: Agent without tools (just reasoning)
    try {
      return await this.runAgentNoTools(task);
    } catch (error) {
      console.log(`Agent without tools failed. Degrading...`);
    }

    // Try level 3: Simple LLM prompt
    try {
      return await this.runSimplePrompt(task);
    } catch (error) {
      console.log(`Simple prompt failed. Returning error response.`);
    }

    // Level 4: Return error message to user
    return `Unable to complete task: "${task}". Please try again or contact support.`;
  }

  private async runFullAgent(task: string): Promise<string> {
    // Full agent with all tools
    return 'Full agent result';
  }

  private async runAgentNoTools(task: string): Promise<string> {
    // Agent without tool use, pure reasoning
    return 'Reasoning-only result';
  }

  private async runSimplePrompt(task: string): Promise<string> {
    // Simple LLM call with no agent loop
    return 'Simple result';
  }
}

Graceful degradation ensures users get a response even when advanced features fail.

Checklist

  • Detection: monitor for max iterations, timeouts, contradictions, hallucinations, loops
  • Reflection: ask agent what went wrong and suggest alternatives
  • Alternatives: maintain tool fallback chains
  • Escalation: human-in-the-loop for permission-required actions
  • Checkpoints: save state before expensive operations
  • Analysis: understand root causes of failures
  • Degradation: fall back to simpler solutions

Conclusion

Agent failures are inevitable. Build detection, recovery, and fallback mechanisms into every agent. Detect failures early, reflect and recover when possible, escalate to humans for critical decisions, and degrade gracefully. With proper error handling, agents become reliable systems, not unreliable toys.