AI Agents in Backend Systems — Building Reliable Tool-Calling Architectures

Introduction

AI agents go beyond single LLM calls. They loop, call tools, make decisions, and persist state. Building reliable agent systems requires managing tool calling patterns, timeouts, parallel execution, human approvals, state management, and recovery from failures. This post covers production patterns for autonomous AI agents.

Tool/Function Calling Patterns
Agent Loop with Timeout
Parallel Tool Execution
Human-in-the-Loop Checkpoints
Agent State Persistence
Error Recovery Strategies
AI Agent Implementation Checklist
Conclusion

Tool/Function Calling Patterns

LLMs can invoke functions. Define tools carefully to enable agent autonomy while limiting damage.

interface Tool {
  name: string;
  description: string;
  parameters: Record<string, any>;
  requiredApprovals?: string[];
  rateLimitPerMinute?: number;
  timeout?: number;
}

interface ToolCall {
  id: string;
  name: string;
  arguments: Record<string, any>;
  status: 'pending' | 'approved' | 'executing' | 'completed' | 'failed';
  result?: any;
  error?: string;
  executedAt?: Date;
  approvedBy?: string;
}

class ToolRegistry {
  private tools: Map<string, Tool> = new Map();
  private toolExecutors: Map<string, (args: any) => Promise<any>> = new Map();
  private toolCallCounts: Map<string, number[]> = new Map();

  registerTool(tool: Tool, executor: (args: any) => Promise<any>): void {
    this.tools.set(tool.name, tool);
    this.toolExecutors.set(tool.name, executor);
    this.toolCallCounts.set(tool.name, []);
  }

  getToolDefinitionsForLLM(): Array<{ type: 'function'; function: Tool }> {
    return Array.from(this.tools.values()).map(tool => ({
      type: 'function' as const,
      function: {
        name: tool.name,
        description: tool.description,
        parameters: tool.parameters,
      },
    }));
  }

  async executeTool(toolCall: ToolCall, options?: { timeout?: number }): Promise<ToolCall> {
    const tool = this.tools.get(toolCall.name);
    if (!tool) {
      toolCall.status = 'failed';
      toolCall.error = `Tool ${toolCall.name} not found`;
      return toolCall;
    }

    // Check rate limits
    if (tool.rateLimitPerMinute) {
      const now = Date.now();
      const counts = this.toolCallCounts.get(toolCall.name) || [];
      const oneMinuteAgo = now - 60000;
      const recentCalls = counts.filter(t => t > oneMinuteAgo);

      if (recentCalls.length >= tool.rateLimitPerMinute) {
        toolCall.status = 'failed';
        toolCall.error = `Rate limit exceeded for ${toolCall.name}`;
        return toolCall;
      }
      recentCalls.push(now);
      this.toolCallCounts.set(toolCall.name, recentCalls);
    }

    const executor = this.toolExecutors.get(toolCall.name);
    if (!executor) {
      toolCall.status = 'failed';
      toolCall.error = `No executor for ${toolCall.name}`;
      return toolCall;
    }

    toolCall.status = 'executing';

    try {
      const timeoutMs = options?.timeout || tool.timeout || 30000;
      const result = await Promise.race([
        executor(toolCall.arguments),
        new Promise((_, reject) =>
          setTimeout(() => reject(new Error(`Tool execution timeout: ${timeoutMs}ms`)), timeoutMs)
        ),
      ]);

      toolCall.status = 'completed';
      toolCall.result = result;
      toolCall.executedAt = new Date();
    } catch (error) {
      toolCall.status = 'failed';
      toolCall.error = error instanceof Error ? error.message : String(error);
      toolCall.executedAt = new Date();
    }

    return toolCall;
  }
}

// Tool definitions
const tools: Tool[] = [
  {
    name: 'fetch_document',
    description: 'Fetch a document by ID and return its contents',
    parameters: {
      type: 'object' as const,
      properties: {
        documentId: { type: 'string', description: 'The document ID' },
      },
      required: ['documentId'],
    },
    rateLimitPerMinute: 60,
    timeout: 10000,
  },
  {
    name: 'update_user_profile',
    description: 'Update user profile (requires approval)',
    parameters: {
      type: 'object' as const,
      properties: {
        userId: { type: 'string' },
        updates: { type: 'object' },
      },
      required: ['userId', 'updates'],
    },
    requiredApprovals: ['admin'],
    rateLimitPerMinute: 10,
    timeout: 5000,
  },
  {
    name: 'send_email',
    description: 'Send an email (requires approval)',
    parameters: {
      type: 'object' as const,
      properties: {
        to: { type: 'string' },
        subject: { type: 'string' },
        body: { type: 'string' },
      },
      required: ['to', 'subject', 'body'],
    },
    requiredApprovals: ['manager'],
    rateLimitPerMinute: 5,
    timeout: 15000,
  },
];

Agent Loop with Timeout

Agents run iteratively: think → call tools → observe → repeat.

interface AgentState {
  id: string;
  userId: string;
  goal: string;
  messages: Array<{ role: 'user' | 'assistant' | 'system' | 'tool'; content: string }>;
  toolCalls: ToolCall[];
  iterations: number;
  maxIterations: number;
  status: 'running' | 'completed' | 'failed' | 'timeout' | 'awaiting_approval';
  createdAt: Date;
  lastActivityAt: Date;
}

class AgentExecutor {
  private toolRegistry: ToolRegistry;
  private openai: any; // OpenAI client

  constructor(toolRegistry: ToolRegistry) {
    this.toolRegistry = toolRegistry;
  }

  async runAgent(state: AgentState, timeout: number = 300000): Promise<AgentState> {
    const startTime = Date.now();
    state.status = 'running';

    try {
      while (state.iterations < state.maxIterations) {
        if (Date.now() - startTime > timeout) {
          state.status = 'timeout';
          state.messages.push({
            role: 'system',
            content: `Agent timeout after ${timeout}ms`,
          });
          break;
        }

        state.lastActivityAt = new Date();

        // 1. Call LLM to decide next action
        const llmResponse = await this.openai.chat.completions.create({
          model: 'gpt-4-turbo-preview',
          messages: state.messages,
          tools: this.toolRegistry.getToolDefinitionsForLLM(),
          tool_choice: 'auto',
        });

        const assistantMessage = llmResponse.choices[0].message;
        state.messages.push({
          role: 'assistant',
          content: assistantMessage.content || '',
        });

        // 2. Parse tool calls
        const toolCalls: ToolCall[] = [];
        if (assistantMessage.tool_calls) {
          for (const call of assistantMessage.tool_calls) {
            toolCalls.push({
              id: call.id,
              name: call.function.name,
              arguments: JSON.parse(call.function.arguments || '{}'),
              status: 'pending',
            });
          }
        }

        // 3. Execute tools in parallel
        const executedCalls = await Promise.all(
          toolCalls.map(call => this.toolRegistry.executeTool(call))
        );

        state.toolCalls.push(...executedCalls);

        // 4. Add tool results to messages
        for (const toolCall of executedCalls) {
          state.messages.push({
            role: 'tool',
            content: JSON.stringify({
              tool_call_id: toolCall.id,
              result: toolCall.result || toolCall.error,
            }),
          });
        }

        // 5. Check if agent is done
        if (!assistantMessage.tool_calls || assistantMessage.tool_calls.length === 0) {
          state.status = 'completed';
          break;
        }

        state.iterations++;
      }

      if (state.iterations >= state.maxIterations && state.status === 'running') {
        state.status = 'failed';
        state.messages.push({
          role: 'system',
          content: `Max iterations (${state.maxIterations}) reached`,
        });
      }
    } catch (error) {
      state.status = 'failed';
      state.messages.push({
        role: 'system',
        content: `Agent error: ${error instanceof Error ? error.message : String(error)}`,
      });
    }

    return state;
  }
}

Parallel Tool Execution

Execute independent tools concurrently to save latency.

interface ToolDependency {
  tool: Tool;
  dependsOn?: string[]; // Names of tools that must complete first
}

class ParallelToolExecutor {
  private toolRegistry: ToolRegistry;

  constructor(toolRegistry: ToolRegistry) {
    this.toolRegistry = toolRegistry;
  }

  async executeToolsInParallel(toolCalls: ToolCall[], dependencies?: ToolDependency[]): Promise<ToolCall[]> {
    const completed: Map<string, ToolCall> = new Map();
    const queue = new Map(toolCalls.map(tc => [tc.id, tc]));

    while (queue.size > 0) {
      const readyToExecute: ToolCall[] = [];

      for (const [id, toolCall] of queue) {
        const toolDep = dependencies?.find(d => d.tool.name === toolCall.name);

        // Check if dependencies are satisfied
        const dependenciesMet = !toolDep?.dependsOn || toolDep.dependsOn.every(dep =>
          Array.from(completed.values()).some(tc => tc.name === dep)
        );

        if (dependenciesMet) {
          readyToExecute.push(toolCall);
          queue.delete(id);
        }
      }

      if (readyToExecute.length === 0 && queue.size > 0) {
        throw new Error('Circular dependency detected in tool calls');
      }

      // Execute ready tools in parallel
      const results = await Promise.all(readyToExecute.map(tc => this.toolRegistry.executeTool(tc)));

      for (const result of results) {
        completed.set(result.id, result);
      }
    }

    return Array.from(completed.values());
  }
}

// Example: Fetch user data + fetch documents + fetch analytics in parallel
const parallelToolCalls: ToolCall[] = [
  {
    id: '1',
    name: 'fetch_user',
    arguments: { userId: 'user123' },
    status: 'pending',
  },
  {
    id: '2',
    name: 'fetch_documents',
    arguments: { userId: 'user123' },
    status: 'pending',
  },
  {
    id: '3',
    name: 'fetch_analytics',
    arguments: { userId: 'user123' },
    status: 'pending',
  },
];

const parallelExecutor = new ParallelToolExecutor(toolRegistry);
const results = await parallelExecutor.executeToolsInParallel(parallelToolCalls);
// All three tools execute concurrently

Human-in-the-Loop Checkpoints

Agents make mistakes. Require approval for critical actions.

interface ApprovalRequest {
  id: string;
  agentId: string;
  toolCall: ToolCall;
  requiredApprovers: string[];
  approvals: Map<string, { approver: string; approvedAt: Date; reason?: string }>;
  rejectedAt?: Date;
  rejectionReason?: string;
  expiresAt: Date;
}

class ApprovalManager {
  private requests: Map<string, ApprovalRequest> = new Map();
  private notificationService: any;

  async requestApproval(
    agentId: string,
    toolCall: ToolCall,
    requiredApprovers: string[],
    expiryMinutes: number = 30
  ): Promise<ApprovalRequest> {
    const request: ApprovalRequest = {
      id: `approval_${Date.now()}`,
      agentId,
      toolCall,
      requiredApprovers,
      approvals: new Map(),
      expiresAt: new Date(Date.now() + expiryMinutes * 60000),
    };

    this.requests.set(request.id, request);

    // Notify approvers
    for (const approver of requiredApprovers) {
      await this.notificationService.send({
        to: approver,
        type: 'approval_request',
        data: {
          requestId: request.id,
          toolName: toolCall.name,
          arguments: toolCall.arguments,
        },
      });
    }

    return request;
  }

  async approve(
    requestId: string,
    approver: string,
    reason?: string
  ): Promise<{ approved: boolean; allApprovalsReceived: boolean }> {
    const request = this.requests.get(requestId);
    if (!request) throw new Error(`Approval request ${requestId} not found`);

    if (!request.requiredApprovers.includes(approver)) {
      throw new Error(`${approver} is not an authorized approver`);
    }

    request.approvals.set(approver, { approver, approvedAt: new Date(), reason });

    const allApprovalsReceived = request.requiredApprovers.every(
      approver => request.approvals.has(approver)
    );

    return { approved: true, allApprovalsReceived };
  }

  async reject(requestId: string, reason: string): Promise<void> {
    const request = this.requests.get(requestId);
    if (!request) throw new Error(`Approval request ${requestId} not found`);

    request.rejectedAt = new Date();
    request.rejectionReason = reason;
  }

  isApproved(requestId: string): boolean {
    const request = this.requests.get(requestId);
    if (!request) return false;

    if (request.rejectedAt) return false;
    if (new Date() > request.expiresAt) return false;

    return request.requiredApprovers.every(approver => request.approvals.has(approver));
  }
}

Agent State Persistence

Save agent state to resume after crashes or long pauses.

interface AgentStateSnapshot {
  id: string;
  state: AgentState;
  savedAt: Date;
}

class AgentStateStore {
  private db: any;

  async saveState(state: AgentState): Promise<void> {
    const snapshot: AgentStateSnapshot = {
      id: state.id,
      state,
      savedAt: new Date(),
    };

    await this.db.collection('agent_states').updateOne(
      { id: state.id },
      { $set: snapshot },
      { upsert: true }
    );
  }

  async loadState(agentId: string): Promise<AgentState | null> {
    const snapshot = await this.db.collection('agent_states').findOne({ id: agentId });
    return snapshot?.state || null;
  }

  async deleteState(agentId: string): Promise<void> {
    await this.db.collection('agent_states').deleteOne({ id: agentId });
  }

  async listActiveAgents(): Promise<AgentState[]> {
    const snapshots = await this.db
      .collection('agent_states')
      .find({ 'state.status': 'running' })
      .toArray();

    return snapshots.map((s: any) => s.state);
  }
}

// Usage: Resume interrupted agent
const stateStore = new AgentStateStore(db);
let agentState = await stateStore.loadState('agent_123');

if (!agentState) {
  agentState = {
    id: 'agent_123',
    userId: 'user_456',
    goal: 'Analyze and summarize user activity',
    messages: [],
    toolCalls: [],
    iterations: 0,
    maxIterations: 10,
    status: 'running',
    createdAt: new Date(),
    lastActivityAt: new Date(),
  };
}

const executor = new AgentExecutor(toolRegistry);
agentState = await executor.runAgent(agentState);

// Persist before returning
await stateStore.saveState(agentState);

Error Recovery Strategies

Gracefully handle tool failures and agent crashes.

interface RetryStrategy {
  maxRetries: number;
  backoffMs: number;
  backoffMultiplier: number;
  retryableErrors: string[];
}

class ErrorRecovery {
  async retryToolCall(
    toolCall: ToolCall,
    executor: (tc: ToolCall) => Promise<ToolCall>,
    strategy: RetryStrategy
  ): Promise<ToolCall> {
    let lastError: Error | null = null;

    for (let attempt = 0; attempt <= strategy.maxRetries; attempt++) {
      try {
        return await executor(toolCall);
      } catch (error) {
        lastError = error as Error;

        const shouldRetry = strategy.retryableErrors.some(err =>
          lastError!.message.includes(err)
        );

        if (!shouldRetry || attempt === strategy.maxRetries) {
          throw lastError;
        }

        const delayMs = strategy.backoffMs * Math.pow(strategy.backoffMultiplier, attempt);
        await new Promise(resolve => setTimeout(resolve, delayMs));
      }
    }

    throw lastError;
  }

  async recoverAgentState(state: AgentState, error: Error): Promise<AgentState> {
    const errorMessage = error.message;

    // Categorize error
    if (errorMessage.includes('rate_limit')) {
      state.status = 'awaiting_approval'; // Wait before retrying
    } else if (errorMessage.includes('timeout')) {
      // Retry last operation
      state.iterations = Math.max(0, state.iterations - 1);
      state.status = 'running';
    } else if (errorMessage.includes('invalid_request')) {
      state.status = 'failed'; // Cannot recover
    } else {
      state.status = 'running'; // Try again
    }

    return state;
  }
}

AI Agent Implementation Checklist

Conclusion

Production AI agents require careful architecture: clear tool definitions, bounded loops with timeouts, parallel execution for efficiency, human checkpoints for safety, state persistence for reliability, and sophisticated error handling. These patterns enable autonomous systems that are both powerful and trustworthy.