- Published on
AI Agent Error Recovery — When Agents Fail, Hallucinate, or Get Stuck
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Agents fail. They hallucinate tool arguments, get stuck in loops, contradict themselves, or hit rate limits. Production agents need sophisticated error recovery strategies: detecting failures, reflecting on what went wrong, recovering gracefully, and escalating to humans when necessary. This post covers practical recovery patterns for real-world agent systems.
- Detecting Agent Failure
- Reflection Prompting on Failure
- Alternative Tool Selection
- Human-in-the-Loop Escalation
- Agent Checkpoint and Resume
- Root Cause Analysis
- Graceful Degradation
- Checklist
- Conclusion
Detecting Agent Failure
Detecting failure is the prerequisite for recovery. What signals indicate an agent failed?
interface FailureSignal {
type:
| 'max_retries_exceeded'
| 'timeout'
| 'contradiction'
| 'hallucination'
| 'loop'
| 'resource_exhausted';
severity: 'warning' | 'error' | 'critical';
message: string;
recoverable: boolean;
}
class FailureDetector {
private maxIterations: number = 10;
private maxTokens: number = 100000;
private timeout: number = 300000; // 5 minutes
detectFailure(state: AgentState): FailureSignal | null {
// Check 1: Max iterations exceeded
if (state.iterations >= this.maxIterations) {
return {
type: 'max_retries_exceeded',
severity: 'error',
message: `Agent exceeded max iterations (${this.maxIterations})`,
recoverable: false,
};
}
// Check 2: Timeout
const elapsedMs = Date.now() - state.startTime;
if (elapsedMs > this.timeout) {
return {
type: 'timeout',
severity: 'error',
message: `Agent exceeded timeout (${this.timeout}ms)`,
recoverable: false,
};
}
// Check 3: Token budget exceeded
if (state.tokensUsed > this.maxTokens) {
return {
type: 'resource_exhausted',
severity: 'critical',
message: `Exceeded token budget (${this.maxTokens} tokens)`,
recoverable: false,
};
}
// Check 4: Contradiction detection
const contradiction = this.detectContradiction(state.messages);
if (contradiction) {
return {
type: 'contradiction',
severity: 'warning',
message: `Agent contradicted itself: ${contradiction}`,
recoverable: true,
};
}
// Check 5: Hallucination detection
const hallucination = this.detectHallucination(state);
if (hallucination) {
return {
type: 'hallucination',
severity: 'error',
message: `Agent hallucinated invalid tool call: ${hallucination}`,
recoverable: true,
};
}
// Check 6: Loop detection
const loop = this.detectLoop(state);
if (loop) {
return {
type: 'loop',
severity: 'warning',
message: `Agent appears stuck in loop: ${loop}`,
recoverable: true,
};
}
return null;
}
private detectContradiction(messages: any[]): string | null {
// Simple heuristic: look for "I think X" followed by "I think not X"
const text = messages.map((m) => m.content).join(' ');
const patterns = [
[/should\s+be\s+(.+?)[,.]/i, /should\s+not\s+be\s+\1/i],
[/it'?s\s+(.+?)[,.]/i, /it'?s\s+not\s+\1/i],
];
for (const [pattern1, pattern2] of patterns) {
if (pattern1.test(text) && pattern2.test(text)) {
const match1 = text.match(pattern1);
return match1 ? match1[1] : 'Unknown contradiction';
}
}
return null;
}
private detectHallucination(state: AgentState): string | null {
// Check if tool calls reference tools that don't exist
const availableTools = ['search', 'calculator', 'get_weather'];
for (const toolCall of state.toolCalls.slice(-3)) {
if (!availableTools.includes(toolCall.name)) {
return `Tool does not exist: ${toolCall.name}`;
}
// Check if tool call has invalid arguments
if (
toolCall.name === 'calculator' &&
typeof toolCall.input.expression !== 'string'
) {
return `Invalid argument to calculator: ${JSON.stringify(toolCall.input)}`;
}
}
return null;
}
private detectLoop(state: AgentState): string | null {
// Look for repeated tool calls with identical inputs
const recentTools = state.toolCalls.slice(-5);
if (recentTools.length < 3) {
return null;
}
const lastThree = recentTools.slice(-3);
const inputs = lastThree.map((t) => JSON.stringify(t.input));
if (inputs[0] === inputs[1] && inputs[1] === inputs[2]) {
return `Calling ${lastThree[0].name} with identical inputs repeatedly`;
}
return null;
}
}
interface AgentState {
iterations: number;
startTime: number;
tokensUsed: number;
messages: any[];
toolCalls: any[];
}
Detect failures early so you can recover before the agent creates more problems.
Reflection Prompting on Failure
When an agent fails, ask it to reflect on what went wrong and how to fix it.
class ReflectingRecoveryAgent {
async runWithReflection(task: string): Promise<string> {
let state: AgentState = this.initializeState(task);
let reflection = null;
while (state.iterations < 10) {
state.iterations++;
// Try to complete task
const response = await this.llmCall(state.messages);
// Check for failure
const failure = this.detector.detectFailure(state);
if (!failure) {
// Success
return response;
}
if (!failure.recoverable) {
// Unrecoverable, bail out
throw new Error(failure.message);
}
// Recoverable: ask agent to reflect
console.log(`Failure detected: ${failure.message}. Asking agent to reflect...`);
const reflectionPrompt = `You encountered an issue: "${failure.message}"
Looking at your recent actions, what went wrong? How can you approach this differently?
Suggest an alternative strategy.`;
state.messages.push({
role: 'user',
content: reflectionPrompt,
});
reflection = await this.llmCall(state.messages);
state.messages.push({
role: 'assistant',
content: reflection,
});
// Continue with reflective understanding
const continuePrompt = `Based on your reflection, try the task again with this new understanding. Take a different approach this time.`;
state.messages.push({
role: 'user',
content: continuePrompt,
});
}
throw new Error('Max iterations exceeded during recovery');
}
private initializeState(task: string): AgentState {
return {
iterations: 0,
startTime: Date.now(),
tokensUsed: 0,
messages: [{ role: 'user', content: task }],
toolCalls: [],
};
}
private detector = new FailureDetector();
private async llmCall(messages: any[]): Promise<string> {
return '';
}
}
Reflection prompting asks the agent to diagnose its own failure. Often it can self-correct.
Alternative Tool Selection
When a tool fails, try a different tool to accomplish the goal.
interface ToolAlternative {
primary: string;
alternatives: string[];
purpose: string;
}
const toolAlternatives: ToolAlternative[] = [
{
primary: 'web_search',
alternatives: ['academic_search', 'news_search', 'site_search'],
purpose: 'Find information',
},
{
primary: 'database_query',
alternatives: ['cache_lookup', 'api_call'],
purpose: 'Fetch data',
},
{
primary: 'send_email',
alternatives: ['send_slack', 'send_notification'],
purpose: 'Notify user',
},
];
class AdaptiveToolSelection {
async executeGoal(goal: string, toolCalls: ToolCall[]): Promise<string | null> {
// First try: use primary tool
const toolCall = toolCalls[0];
const primaryTool = toolCall.name;
let result = await this.executeTool(primaryTool, toolCall.input);
if (result.success) {
return result.output;
}
// Find alternatives
const alternatives = this.findAlternatives(primaryTool);
if (alternatives.length === 0) {
return null; // No alternatives
}
console.log(`Primary tool ${primaryTool} failed. Trying alternatives: ${alternatives.join(', ')}`);
// Try each alternative
for (const altTool of alternatives) {
const adapted = this.adaptInputs(toolCall.input, primaryTool, altTool);
result = await this.executeTool(altTool, adapted);
if (result.success) {
console.log(`Success with alternative: ${altTool}`);
return result.output;
}
}
// All tools failed
return null;
}
private findAlternatives(tool: string): string[] {
const alt = toolAlternatives.find((ta) => ta.primary === tool);
return alt?.alternatives || [];
}
private adaptInputs(
originalInputs: Record<string, unknown>,
fromTool: string,
toTool: string,
): Record<string, unknown> {
// Adapt parameters for different tools
const adapted = { ...originalInputs };
// web_search <-> academic_search: same 'query' parameter
if (
(fromTool === 'web_search' && toTool === 'academic_search') ||
(fromTool === 'academic_search' && toTool === 'web_search')
) {
return adapted;
}
// database_query -> api_call: convert SQL to API params
if (fromTool === 'database_query' && toTool === 'api_call') {
// Extract what data we need and map to API endpoint
adapted['endpoint'] = '/api/data';
adapted['method'] = 'GET';
}
return adapted;
}
private async executeTool(
name: string,
input: Record<string, unknown>,
): Promise<{ success: boolean; output?: string; error?: string }> {
try {
// Tool execution
return { success: true, output: 'Result' };
} catch (error) {
return { success: false, error: (error as Error).message };
}
}
}
Build a tool fallback graph so agents can try alternatives when tools fail.
Human-in-the-Loop Escalation
Some failures require human judgment. Escalate gracefully.
interface EscalationRequest {
escalationId: string;
agentName: string;
situation: string;
agentState: Record<string, unknown>;
options: string[];
timestamp: number;
timeout: number; // milliseconds to wait for human response
}
class EscalationManager {
private escalations: Map<string, EscalationRequest> = new Map();
private humanResponses: Map<string, string> = new Map();
async escalateToHuman(
agentName: string,
situation: string,
agentState: Record<string, unknown>,
options: string[],
): Promise<string> {
const request: EscalationRequest = {
escalationId: `esc-${Date.now()}-${Math.random()}`,
agentName,
situation,
agentState,
options,
timestamp: Date.now(),
timeout: 300000, // 5 minute timeout
};
this.escalations.set(request.escalationId, request);
// Notify human (via email, Slack, dashboard, etc.)
await this.notifyHuman(request);
// Wait for response with timeout
const response = await this.waitForResponse(request.escalationId, request.timeout);
return response;
}
private async notifyHuman(request: EscalationRequest): Promise<void> {
const message = `
AI Agent Escalation Required
Agent: ${request.agentName}
Situation: ${request.situation}
Options:
${request.options.map((o, i) => `${i + 1}. ${o}`).join('\n')}
Response needed within ${request.timeout / 1000} seconds.
Escalation ID: ${request.escalationId}
`;
console.log(message);
// In production: send via Slack, email, or internal API
// await sendToSlack(message);
}
private async waitForResponse(escalationId: string, timeout: number): Promise<string> {
const startTime = Date.now();
while (Date.now() - startTime < timeout) {
const response = this.humanResponses.get(escalationId);
if (response) {
return response;
}
// Wait 1 second before checking again
await new Promise((resolve) => setTimeout(resolve, 1000));
}
// Timeout: choose default action
console.log(`Escalation ${escalationId} timed out. Using default action.`);
return 'abort';
}
receiveHumanResponse(escalationId: string, response: string): void {
this.humanResponses.set(escalationId, response);
}
}
// Usage in agent
class HumanInTheLoopAgent {
private escalationManager = new EscalationManager();
async run(task: string): Promise<string> {
try {
// Try to complete task
return await this.attemptTask(task);
} catch (error) {
const errorMsg = (error as Error).message;
// Determine if human input is needed
if (this.requiresHumanDecision(errorMsg)) {
const decision = await this.escalationManager.escalateToHuman(
'MyAgent',
`Encountered error: ${errorMsg}`,
{ task },
['Retry with different approach', 'Skip this task', 'Abort entire process'],
);
if (decision === 'Retry with different approach') {
return this.attemptTaskDifferently(task);
} else if (decision === 'Skip this task') {
return `Skipped task: ${task}`;
} else {
throw new Error('User chose to abort');
}
}
throw error;
}
}
private requiresHumanDecision(error: string): boolean {
// Determine which errors need human input
return error.includes('Permission required') || error.includes('Confirmation needed');
}
private async attemptTask(task: string): Promise<string> {
return 'Result';
}
private async attemptTaskDifferently(task: string): Promise<string> {
return 'Different result';
}
}
Escalation prevents bad decisions. Use it for permission-required actions and ambiguous situations.
Agent Checkpoint and Resume
Save agent state so you can resume after failures.
interface AgentCheckpoint {
checkpointId: string;
sessionId: string;
iteration: number;
messages: any[];
toolCalls: any[];
timestamp: number;
status: 'active' | 'failed' | 'completed';
}
class CheckpointingAgent {
private checkpointStore: Map<string, AgentCheckpoint> = new Map();
async runWithCheckpoints(sessionId: string, task: string): Promise<string> {
// Try to resume from last checkpoint
let state = this.loadLatestCheckpoint(sessionId) || this.initializeState(sessionId, task);
while (state.iteration < 10) {
state.iteration++;
// Create checkpoint before iteration
const checkpoint = this.saveCheckpoint(state);
try {
const response = await this.executeIteration(state);
if (this.isComplete(response)) {
checkpoint.status = 'completed';
return response;
}
state.messages.push({ role: 'assistant', content: response });
} catch (error) {
checkpoint.status = 'failed';
console.log(`Iteration ${state.iteration} failed. Checkpoint saved for recovery.`);
throw error;
}
}
throw new Error('Max iterations exceeded');
}
private saveCheckpoint(state: AgentState): AgentCheckpoint {
const checkpoint: AgentCheckpoint = {
checkpointId: `cp-${Date.now()}-${Math.random()}`,
sessionId: state.sessionId,
iteration: state.iteration,
messages: JSON.parse(JSON.stringify(state.messages)), // Deep copy
toolCalls: JSON.parse(JSON.stringify(state.toolCalls)),
timestamp: Date.now(),
status: 'active',
};
this.checkpointStore.set(checkpoint.checkpointId, checkpoint);
return checkpoint;
}
private loadLatestCheckpoint(sessionId: string): AgentState | null {
const checkpoints = Array.from(this.checkpointStore.values()).filter(
(cp) => cp.sessionId === sessionId && cp.status !== 'completed',
);
if (checkpoints.length === 0) {
return null;
}
const latest = checkpoints.sort((a, b) => b.timestamp - a.timestamp)[0];
return {
sessionId,
iteration: latest.iteration,
messages: latest.messages,
toolCalls: latest.toolCalls,
};
}
private initializeState(sessionId: string, task: string): AgentState {
return {
sessionId,
iteration: 0,
messages: [{ role: 'user', content: task }],
toolCalls: [],
};
}
private async executeIteration(state: AgentState): Promise<string> {
return '';
}
private isComplete(response: string): boolean {
return true;
}
}
interface AgentState {
sessionId: string;
iteration: number;
messages: any[];
toolCalls: any[];
}
Checkpoints enable recovery without losing progress. Always save before expensive operations.
Root Cause Analysis
When agents fail, understand why.
interface RootCauseAnalysis {
failureType: string;
immediateReason: string;
rootCause: string;
contributingFactors: string[];
recommendation: string;
}
class RootCauseAnalyzer {
async analyze(failure: FailureSignal, state: AgentState): Promise<RootCauseAnalysis> {
let rootCause = '';
let factors: string[] = [];
switch (failure.type) {
case 'max_retries_exceeded':
// Root cause: tool not working or wrong approach
rootCause = this.analyzeRetryFailure(state);
factors = [
'Same tool used repeatedly',
'No progress between iterations',
'Agent not recognizing failure',
];
break;
case 'contradiction':
// Root cause: LLM confused or reasoning is poor
rootCause = 'Agent reasoning is inconsistent or hallucinating';
factors = ['Temperature may be too high', 'Prompt is ambiguous'];
break;
case 'hallucination':
// Root cause: Tool schema not clear enough
rootCause = 'Agent hallucinated invalid tool call';
factors = [
'Tool description unclear',
'Parameter validation missing',
'Agent not familiar with tool',
];
break;
case 'timeout':
// Root cause: too many iterations or slow tools
rootCause = this.analyzeTimeoutFailure(state);
factors = [
'Tool calls taking too long',
'Inefficient approach',
'Network latency',
];
break;
}
return {
failureType: failure.type,
immediateReason: failure.message,
rootCause,
contributingFactors: factors,
recommendation: this.recommendFix(failure.type, rootCause),
};
}
private analyzeRetryFailure(state: AgentState): string {
const lastFew = state.toolCalls.slice(-3);
const tools = lastFew.map((tc) => tc.name);
if (tools[0] === tools[1] && tools[1] === tools[2]) {
return `Agent is calling ${tools[0]} repeatedly without changing inputs`;
}
return 'Agent is not making progress toward goal';
}
private analyzeTimeoutFailure(state: AgentState): string {
const totalTokens = state.messages.reduce(
(sum, m) => sum + Math.ceil(m.content.length / 4),
0,
);
if (totalTokens > 50000) {
return 'Agent has used too many tokens (likely inefficient prompting)';
}
return 'Agent is taking too many iterations';
}
private recommendFix(failureType: string, rootCause: string): string {
if (failureType === 'hallucination') {
return 'Improve tool descriptions and add examples to prompt';
}
if (failureType === 'max_retries_exceeded') {
return 'Add tool alternatives or simplify the task';
}
if (failureType === 'timeout') {
return 'Reduce context window or use fewer tool calls per iteration';
}
return 'Review agent prompt and tool definitions';
}
}
Root cause analysis prevents repeating the same mistakes. Log failures and patterns.
Graceful Degradation
When agents fail completely, degrade to simpler solutions.
class DegradingAgent {
async runWithFallback(task: string): Promise<string> {
// Try level 1: Full agent with tools
try {
return await this.runFullAgent(task);
} catch (error) {
console.log(`Full agent failed: ${(error as Error).message}. Degrading...`);
}
// Try level 2: Agent without tools (just reasoning)
try {
return await this.runAgentNoTools(task);
} catch (error) {
console.log(`Agent without tools failed. Degrading...`);
}
// Try level 3: Simple LLM prompt
try {
return await this.runSimplePrompt(task);
} catch (error) {
console.log(`Simple prompt failed. Returning error response.`);
}
// Level 4: Return error message to user
return `Unable to complete task: "${task}". Please try again or contact support.`;
}
private async runFullAgent(task: string): Promise<string> {
// Full agent with all tools
return 'Full agent result';
}
private async runAgentNoTools(task: string): Promise<string> {
// Agent without tool use, pure reasoning
return 'Reasoning-only result';
}
private async runSimplePrompt(task: string): Promise<string> {
// Simple LLM call with no agent loop
return 'Simple result';
}
}
Graceful degradation ensures users get a response even when advanced features fail.
Checklist
- Detection: monitor for max iterations, timeouts, contradictions, hallucinations, loops
- Reflection: ask agent what went wrong and suggest alternatives
- Alternatives: maintain tool fallback chains
- Escalation: human-in-the-loop for permission-required actions
- Checkpoints: save state before expensive operations
- Analysis: understand root causes of failures
- Degradation: fall back to simpler solutions
Conclusion
Agent failures are inevitable. Build detection, recovery, and fallback mechanisms into every agent. Detect failures early, reflect and recover when possible, escalate to humans for critical decisions, and degrade gracefully. With proper error handling, agents become reliable systems, not unreliable toys.