- Published on
AI Agents in Backend Systems — Building Reliable Tool-Calling Architectures
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
AI agents go beyond single LLM calls. They loop, call tools, make decisions, and persist state. Building reliable agent systems requires managing tool calling patterns, timeouts, parallel execution, human approvals, state management, and recovery from failures. This post covers production patterns for autonomous AI agents.
- Tool/Function Calling Patterns
- Agent Loop with Timeout
- Parallel Tool Execution
- Human-in-the-Loop Checkpoints
- Agent State Persistence
- Error Recovery Strategies
- AI Agent Implementation Checklist
- Conclusion
Tool/Function Calling Patterns
LLMs can invoke functions. Define tools carefully to enable agent autonomy while limiting damage.
interface Tool {
name: string;
description: string;
parameters: Record<string, any>;
requiredApprovals?: string[];
rateLimitPerMinute?: number;
timeout?: number;
}
interface ToolCall {
id: string;
name: string;
arguments: Record<string, any>;
status: 'pending' | 'approved' | 'executing' | 'completed' | 'failed';
result?: any;
error?: string;
executedAt?: Date;
approvedBy?: string;
}
class ToolRegistry {
private tools: Map<string, Tool> = new Map();
private toolExecutors: Map<string, (args: any) => Promise<any>> = new Map();
private toolCallCounts: Map<string, number[]> = new Map();
registerTool(tool: Tool, executor: (args: any) => Promise<any>): void {
this.tools.set(tool.name, tool);
this.toolExecutors.set(tool.name, executor);
this.toolCallCounts.set(tool.name, []);
}
getToolDefinitionsForLLM(): Array<{ type: 'function'; function: Tool }> {
return Array.from(this.tools.values()).map(tool => ({
type: 'function' as const,
function: {
name: tool.name,
description: tool.description,
parameters: tool.parameters,
},
}));
}
async executeTool(toolCall: ToolCall, options?: { timeout?: number }): Promise<ToolCall> {
const tool = this.tools.get(toolCall.name);
if (!tool) {
toolCall.status = 'failed';
toolCall.error = `Tool ${toolCall.name} not found`;
return toolCall;
}
// Check rate limits
if (tool.rateLimitPerMinute) {
const now = Date.now();
const counts = this.toolCallCounts.get(toolCall.name) || [];
const oneMinuteAgo = now - 60000;
const recentCalls = counts.filter(t => t > oneMinuteAgo);
if (recentCalls.length >= tool.rateLimitPerMinute) {
toolCall.status = 'failed';
toolCall.error = `Rate limit exceeded for ${toolCall.name}`;
return toolCall;
}
recentCalls.push(now);
this.toolCallCounts.set(toolCall.name, recentCalls);
}
const executor = this.toolExecutors.get(toolCall.name);
if (!executor) {
toolCall.status = 'failed';
toolCall.error = `No executor for ${toolCall.name}`;
return toolCall;
}
toolCall.status = 'executing';
try {
const timeoutMs = options?.timeout || tool.timeout || 30000;
const result = await Promise.race([
executor(toolCall.arguments),
new Promise((_, reject) =>
setTimeout(() => reject(new Error(`Tool execution timeout: ${timeoutMs}ms`)), timeoutMs)
),
]);
toolCall.status = 'completed';
toolCall.result = result;
toolCall.executedAt = new Date();
} catch (error) {
toolCall.status = 'failed';
toolCall.error = error instanceof Error ? error.message : String(error);
toolCall.executedAt = new Date();
}
return toolCall;
}
}
// Tool definitions
const tools: Tool[] = [
{
name: 'fetch_document',
description: 'Fetch a document by ID and return its contents',
parameters: {
type: 'object' as const,
properties: {
documentId: { type: 'string', description: 'The document ID' },
},
required: ['documentId'],
},
rateLimitPerMinute: 60,
timeout: 10000,
},
{
name: 'update_user_profile',
description: 'Update user profile (requires approval)',
parameters: {
type: 'object' as const,
properties: {
userId: { type: 'string' },
updates: { type: 'object' },
},
required: ['userId', 'updates'],
},
requiredApprovals: ['admin'],
rateLimitPerMinute: 10,
timeout: 5000,
},
{
name: 'send_email',
description: 'Send an email (requires approval)',
parameters: {
type: 'object' as const,
properties: {
to: { type: 'string' },
subject: { type: 'string' },
body: { type: 'string' },
},
required: ['to', 'subject', 'body'],
},
requiredApprovals: ['manager'],
rateLimitPerMinute: 5,
timeout: 15000,
},
];
Agent Loop with Timeout
Agents run iteratively: think → call tools → observe → repeat.
interface AgentState {
id: string;
userId: string;
goal: string;
messages: Array<{ role: 'user' | 'assistant' | 'system' | 'tool'; content: string }>;
toolCalls: ToolCall[];
iterations: number;
maxIterations: number;
status: 'running' | 'completed' | 'failed' | 'timeout' | 'awaiting_approval';
createdAt: Date;
lastActivityAt: Date;
}
class AgentExecutor {
private toolRegistry: ToolRegistry;
private openai: any; // OpenAI client
constructor(toolRegistry: ToolRegistry) {
this.toolRegistry = toolRegistry;
}
async runAgent(state: AgentState, timeout: number = 300000): Promise<AgentState> {
const startTime = Date.now();
state.status = 'running';
try {
while (state.iterations < state.maxIterations) {
if (Date.now() - startTime > timeout) {
state.status = 'timeout';
state.messages.push({
role: 'system',
content: `Agent timeout after ${timeout}ms`,
});
break;
}
state.lastActivityAt = new Date();
// 1. Call LLM to decide next action
const llmResponse = await this.openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: state.messages,
tools: this.toolRegistry.getToolDefinitionsForLLM(),
tool_choice: 'auto',
});
const assistantMessage = llmResponse.choices[0].message;
state.messages.push({
role: 'assistant',
content: assistantMessage.content || '',
});
// 2. Parse tool calls
const toolCalls: ToolCall[] = [];
if (assistantMessage.tool_calls) {
for (const call of assistantMessage.tool_calls) {
toolCalls.push({
id: call.id,
name: call.function.name,
arguments: JSON.parse(call.function.arguments || '{}'),
status: 'pending',
});
}
}
// 3. Execute tools in parallel
const executedCalls = await Promise.all(
toolCalls.map(call => this.toolRegistry.executeTool(call))
);
state.toolCalls.push(...executedCalls);
// 4. Add tool results to messages
for (const toolCall of executedCalls) {
state.messages.push({
role: 'tool',
content: JSON.stringify({
tool_call_id: toolCall.id,
result: toolCall.result || toolCall.error,
}),
});
}
// 5. Check if agent is done
if (!assistantMessage.tool_calls || assistantMessage.tool_calls.length === 0) {
state.status = 'completed';
break;
}
state.iterations++;
}
if (state.iterations >= state.maxIterations && state.status === 'running') {
state.status = 'failed';
state.messages.push({
role: 'system',
content: `Max iterations (${state.maxIterations}) reached`,
});
}
} catch (error) {
state.status = 'failed';
state.messages.push({
role: 'system',
content: `Agent error: ${error instanceof Error ? error.message : String(error)}`,
});
}
return state;
}
}
Parallel Tool Execution
Execute independent tools concurrently to save latency.
interface ToolDependency {
tool: Tool;
dependsOn?: string[]; // Names of tools that must complete first
}
class ParallelToolExecutor {
private toolRegistry: ToolRegistry;
constructor(toolRegistry: ToolRegistry) {
this.toolRegistry = toolRegistry;
}
async executeToolsInParallel(toolCalls: ToolCall[], dependencies?: ToolDependency[]): Promise<ToolCall[]> {
const completed: Map<string, ToolCall> = new Map();
const queue = new Map(toolCalls.map(tc => [tc.id, tc]));
while (queue.size > 0) {
const readyToExecute: ToolCall[] = [];
for (const [id, toolCall] of queue) {
const toolDep = dependencies?.find(d => d.tool.name === toolCall.name);
// Check if dependencies are satisfied
const dependenciesMet = !toolDep?.dependsOn || toolDep.dependsOn.every(dep =>
Array.from(completed.values()).some(tc => tc.name === dep)
);
if (dependenciesMet) {
readyToExecute.push(toolCall);
queue.delete(id);
}
}
if (readyToExecute.length === 0 && queue.size > 0) {
throw new Error('Circular dependency detected in tool calls');
}
// Execute ready tools in parallel
const results = await Promise.all(readyToExecute.map(tc => this.toolRegistry.executeTool(tc)));
for (const result of results) {
completed.set(result.id, result);
}
}
return Array.from(completed.values());
}
}
// Example: Fetch user data + fetch documents + fetch analytics in parallel
const parallelToolCalls: ToolCall[] = [
{
id: '1',
name: 'fetch_user',
arguments: { userId: 'user123' },
status: 'pending',
},
{
id: '2',
name: 'fetch_documents',
arguments: { userId: 'user123' },
status: 'pending',
},
{
id: '3',
name: 'fetch_analytics',
arguments: { userId: 'user123' },
status: 'pending',
},
];
const parallelExecutor = new ParallelToolExecutor(toolRegistry);
const results = await parallelExecutor.executeToolsInParallel(parallelToolCalls);
// All three tools execute concurrently
Human-in-the-Loop Checkpoints
Agents make mistakes. Require approval for critical actions.
interface ApprovalRequest {
id: string;
agentId: string;
toolCall: ToolCall;
requiredApprovers: string[];
approvals: Map<string, { approver: string; approvedAt: Date; reason?: string }>;
rejectedAt?: Date;
rejectionReason?: string;
expiresAt: Date;
}
class ApprovalManager {
private requests: Map<string, ApprovalRequest> = new Map();
private notificationService: any;
async requestApproval(
agentId: string,
toolCall: ToolCall,
requiredApprovers: string[],
expiryMinutes: number = 30
): Promise<ApprovalRequest> {
const request: ApprovalRequest = {
id: `approval_${Date.now()}`,
agentId,
toolCall,
requiredApprovers,
approvals: new Map(),
expiresAt: new Date(Date.now() + expiryMinutes * 60000),
};
this.requests.set(request.id, request);
// Notify approvers
for (const approver of requiredApprovers) {
await this.notificationService.send({
to: approver,
type: 'approval_request',
data: {
requestId: request.id,
toolName: toolCall.name,
arguments: toolCall.arguments,
},
});
}
return request;
}
async approve(
requestId: string,
approver: string,
reason?: string
): Promise<{ approved: boolean; allApprovalsReceived: boolean }> {
const request = this.requests.get(requestId);
if (!request) throw new Error(`Approval request ${requestId} not found`);
if (!request.requiredApprovers.includes(approver)) {
throw new Error(`${approver} is not an authorized approver`);
}
request.approvals.set(approver, { approver, approvedAt: new Date(), reason });
const allApprovalsReceived = request.requiredApprovers.every(
approver => request.approvals.has(approver)
);
return { approved: true, allApprovalsReceived };
}
async reject(requestId: string, reason: string): Promise<void> {
const request = this.requests.get(requestId);
if (!request) throw new Error(`Approval request ${requestId} not found`);
request.rejectedAt = new Date();
request.rejectionReason = reason;
}
isApproved(requestId: string): boolean {
const request = this.requests.get(requestId);
if (!request) return false;
if (request.rejectedAt) return false;
if (new Date() > request.expiresAt) return false;
return request.requiredApprovers.every(approver => request.approvals.has(approver));
}
}
Agent State Persistence
Save agent state to resume after crashes or long pauses.
interface AgentStateSnapshot {
id: string;
state: AgentState;
savedAt: Date;
}
class AgentStateStore {
private db: any;
async saveState(state: AgentState): Promise<void> {
const snapshot: AgentStateSnapshot = {
id: state.id,
state,
savedAt: new Date(),
};
await this.db.collection('agent_states').updateOne(
{ id: state.id },
{ $set: snapshot },
{ upsert: true }
);
}
async loadState(agentId: string): Promise<AgentState | null> {
const snapshot = await this.db.collection('agent_states').findOne({ id: agentId });
return snapshot?.state || null;
}
async deleteState(agentId: string): Promise<void> {
await this.db.collection('agent_states').deleteOne({ id: agentId });
}
async listActiveAgents(): Promise<AgentState[]> {
const snapshots = await this.db
.collection('agent_states')
.find({ 'state.status': 'running' })
.toArray();
return snapshots.map((s: any) => s.state);
}
}
// Usage: Resume interrupted agent
const stateStore = new AgentStateStore(db);
let agentState = await stateStore.loadState('agent_123');
if (!agentState) {
agentState = {
id: 'agent_123',
userId: 'user_456',
goal: 'Analyze and summarize user activity',
messages: [],
toolCalls: [],
iterations: 0,
maxIterations: 10,
status: 'running',
createdAt: new Date(),
lastActivityAt: new Date(),
};
}
const executor = new AgentExecutor(toolRegistry);
agentState = await executor.runAgent(agentState);
// Persist before returning
await stateStore.saveState(agentState);
Error Recovery Strategies
Gracefully handle tool failures and agent crashes.
interface RetryStrategy {
maxRetries: number;
backoffMs: number;
backoffMultiplier: number;
retryableErrors: string[];
}
class ErrorRecovery {
async retryToolCall(
toolCall: ToolCall,
executor: (tc: ToolCall) => Promise<ToolCall>,
strategy: RetryStrategy
): Promise<ToolCall> {
let lastError: Error | null = null;
for (let attempt = 0; attempt <= strategy.maxRetries; attempt++) {
try {
return await executor(toolCall);
} catch (error) {
lastError = error as Error;
const shouldRetry = strategy.retryableErrors.some(err =>
lastError!.message.includes(err)
);
if (!shouldRetry || attempt === strategy.maxRetries) {
throw lastError;
}
const delayMs = strategy.backoffMs * Math.pow(strategy.backoffMultiplier, attempt);
await new Promise(resolve => setTimeout(resolve, delayMs));
}
}
throw lastError;
}
async recoverAgentState(state: AgentState, error: Error): Promise<AgentState> {
const errorMessage = error.message;
// Categorize error
if (errorMessage.includes('rate_limit')) {
state.status = 'awaiting_approval'; // Wait before retrying
} else if (errorMessage.includes('timeout')) {
// Retry last operation
state.iterations = Math.max(0, state.iterations - 1);
state.status = 'running';
} else if (errorMessage.includes('invalid_request')) {
state.status = 'failed'; // Cannot recover
} else {
state.status = 'running'; // Try again
}
return state;
}
}
AI Agent Implementation Checklist
- Define tools with clear descriptions, parameters, and timeouts
- Implement agent loop with iteration limits and timeout
- Enable parallel tool execution for independent tasks
- Require human approval for sensitive operations
- Persist agent state to enable resumption
- Implement error recovery and retry strategies
- Monitor tool success rates and latencies
- Track agent iterations and token usage
- Add observability for tool calls (what, when, why)
- Test agent behavior under failure conditions
Conclusion
Production AI agents require careful architecture: clear tool definitions, bounded loops with timeouts, parallel execution for efficiency, human checkpoints for safety, state persistence for reliability, and sophisticated error handling. These patterns enable autonomous systems that are both powerful and trustworthy.