Published on

Plan-and-Execute — Reducing LLM Costs by 90% With Heterogeneous Agent Fleets

Authors

Introduction

Running a frontier model (GPT-4o, Claude Opus) for every step of an agent''s execution is expensive. The Plan-and-Execute pattern solves this by using a high-capability model once to create a detailed plan, then dispatching each step to specialized, cheaper models. This post covers the pattern, cost calculations, failure modes, and production implementation.

The Cost Problem

A typical agent workflow processes a user request across multiple steps:

  1. Analyze request
  2. Search for context
  3. Synthesize findings
  4. Write response

Running Claude Opus for all four steps costs roughly 4x the price of running it once. If Opus costs $15 per million tokens and the Haiku model costs $0.80 per million tokens, using Opus everywhere is expensive at scale.

The Plan-and-Execute pattern uses cost-benefit analysis:

Frontier model cost: $15/1M tokens
Standard model cost: $3/1M tokens
Fast model cost: $0.80/1M tokens

Cost without Plan-and-Execute:
4 steps × (1000 tokens × $15/1M) = $0.06 per request

Cost with Plan-and-Execute:
Planning step (2000 tokens × $15/1M) = $0.03
Search step (800 tokens × $0.80/1M) = $0.0006
Synthesis step (1000 tokens × $3/1M) = $0.003
Writing step (1200 tokens × $3/1M) = $0.0036
Total = $0.0372 per request (38% savings)

At scale, with longer workflows:
100,000 requests/day:
Without optimization: $6,000/day
With Plan-and-Execute: $3,720/day
Annual savings: ~$800,000

Step Decomposition and Executor Selection

The planner breaks down the task and specifies which executor handles each step:

import Anthropic from '@anthropic-ai/sdk';

interface ExecutionStep {
  id: string;
  type: 'search' | 'analyze' | 'code' | 'reasoning' | 'writing';
  prompt: string;
  requiredModel: 'fast' | 'standard' | 'frontier';
  timeout: number;
}

interface ExecutionPlan {
  goal: string;
  steps: ExecutionStep[];
  expectedCost: number;
}

const plannerClient = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function planExecution(userRequest: string): Promise<ExecutionPlan> {
  const response = await plannerClient.messages.create({
    model: 'claude-opus-4-1',
    max_tokens: 2048,
    system: `You are an execution planner. Break down user requests into specific steps.
For each step, specify:
- What needs to happen
- Which model should handle it (fast/standard/frontier based on complexity)
- Exact prompt for the executor

Return valid JSON.`,
    messages: [
      {
        role: 'user',
        content: `Plan execution for: ${userRequest}

Return JSON:
{
  "goal": "...",
  "steps": [
    {
      "id": "step-1",
      "type": "search|analyze|code|reasoning|writing",
      "prompt": "...",
      "requiredModel": "fast|standard|frontier",
      "timeout": 30000
    }
  ],
  "expectedCost": 0.05
}`,
      },
    ],
  });

  const content = response.content[0];
  if (content.type !== 'text') {
    throw new Error('Planner returned non-text response');
  }

  const parsed = JSON.parse(content.text);
  return parsed as ExecutionPlan;
}

Executor Selection by Task Type

Each executor is optimized for its task type:

type ModelChoice = 'haiku' | 'sonnet' | 'opus';

function selectModelForType(stepType: string): ModelChoice {
  switch (stepType) {
    case 'search':
      return 'haiku'; // Fast retrieval, minimal reasoning
    case 'analyze':
      return 'sonnet'; // Moderate reasoning required
    case 'code':
      return 'opus'; // Complex syntax, edge cases
    case 'reasoning':
      return 'opus'; // Deep thought required
    case 'writing':
      return 'sonnet'; // Fluent but not frontier-level
    default:
      return 'sonnet';
  }
}

const modelConfig: Record<
  ModelChoice,
  { id: string; costPer1mTokens: number; speed: number }
> = {
  haiku: {
    id: 'claude-3-5-haiku-20241022',
    costPer1mTokens: 0.8,
    speed: 800, // ms estimate
  },
  sonnet: {
    id: 'claude-3-5-sonnet-20241022',
    costPer1mTokens: 3,
    speed: 1500,
  },
  opus: {
    id: 'claude-opus-4-1',
    costPer1mTokens: 15,
    speed: 2500,
  },
};

async function executeStep(
  step: ExecutionStep
): Promise<{ output: string; cost: number; tokensUsed: number }> {
  const modelChoice = selectModelForType(step.type);
  const model = modelConfig[modelChoice];

  const client = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
  });

  const response = await client.messages.create({
    model: model.id,
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: step.prompt,
      },
    ],
  });

  const text =
    response.content[0].type === 'text' ? response.content[0].text : '';
  const tokensUsed =
    (response.usage.input_tokens + response.usage.output_tokens) / 1000;
  const cost = (tokensUsed * model.costPer1mTokens) / 1000;

  return { output: text, cost, tokensUsed };
}

Handling Replanning on Failure

When an executor fails, the planner adjusts the strategy:

interface ExecutionResult {
  stepId: string;
  status: 'success' | 'failure';
  output?: string;
  error?: string;
  cost: number;
}

async function executeWithReplanning(
  plan: ExecutionPlan,
  maxRetries: number = 2
): Promise<{ outputs: Record<string, string>; totalCost: number }> {
  const outputs: Record<string, string> = {};
  let totalCost = 0;
  let currentPlan = plan;

  for (let retryCount = 0; retryCount &lt;= maxRetries; retryCount++) {
    let allStepsSucceeded = true;

    for (const step of currentPlan.steps) {
      try {
        const result = await executeStep(step);
        outputs[step.id] = result.output;
        totalCost += result.cost;
      } catch (error) {
        console.error(
          `Step ${step.id} failed: ${(error as Error).message}`
        );
        allStepsSucceeded = false;

        if (retryCount &lt; maxRetries) {
          console.log(
            `Replanning after step ${step.id} failure (attempt ${retryCount + 1})`
          );

          // Ask planner to adjust strategy
          const replanResponse = await plannerClient.messages.create({
            model: 'claude-opus-4-1',
            max_tokens: 2048,
            messages: [
              {
                role: 'user',
                content: `Step ${step.id} failed with error: ${(error as Error).message}
Current partial outputs: ${JSON.stringify(Object.entries(outputs).slice(0, 2))}

Replan the remaining steps with a different approach. Use stronger models if needed.`,
              },
            ],
          });

          const replanText =
            replanResponse.content[0].type === 'text'
              ? replanResponse.content[0].text
              : '';

          currentPlan = JSON.parse(replanText);
          break; // Retry with new plan
        } else {
          throw error;
        }
      }
    }

    if (allStepsSucceeded) {
      break;
    }
  }

  return { outputs, totalCost };
}

Real-World Cost Calculation

Track cost per request for analytics:

interface RequestMetrics {
  requestId: string;
  userInput: string;
  planningCost: number;
  executionCosts: Map<string, number>;
  totalCost: number;
  planningTime: number;
  executionTime: number;
  totalTime: number;
}

async function processRequestWithMetrics(
  userRequest: string
): Promise<{ result: string; metrics: RequestMetrics }> {
  const requestId = crypto.randomUUID();
  const startTime = Date.now();

  // Plan
  const planStart = Date.now();
  const plan = await planExecution(userRequest);
  const planningTime = Date.now() - planStart;

  // Execute
  const execStart = Date.now();
  const executionCosts = new Map<string, number>();
  const outputs: Record<string, string> = {};

  for (const step of plan.steps) {
    try {
      const result = await executeStep(step);
      outputs[step.id] = result.output;
      executionCosts.set(step.id, result.cost);
    } catch (error) {
      console.error(`Step ${step.id} failed:`, error);
      throw error;
    }
  }

  const executionTime = Date.now() - execStart;
  const totalTime = Date.now() - startTime;

  const planningCost = 0.03; // Estimated
  let executionCostTotal = 0;
  executionCosts.forEach((cost) => {
    executionCostTotal += cost;
  });

  const metrics: RequestMetrics = {
    requestId,
    userInput: userRequest,
    planningCost,
    executionCosts,
    totalCost: planningCost + executionCostTotal,
    planningTime,
    executionTime,
    totalTime,
  };

  return {
    result: JSON.stringify(outputs),
    metrics,
  };
}

Cost Benchmarks and Monitoring

Monitor actual vs. planned costs:

interface CostBudget {
  maxCostPerRequest: number;
  maxCostPerDay: number;
  currentDayCost: number;
  requestCount: number;
}

const budget: CostBudget = {
  maxCostPerRequest: 0.10,
  maxCostPerDay: 500,
  currentDayCost: 0,
  requestCount: 0,
};

async function checkBudgetAndProcess(
  userRequest: string
): Promise<string | null> {
  if (budget.currentDayCost &gt; budget.maxCostPerDay) {
    console.warn('Daily budget exceeded. Queuing request.');
    return null;
  }

  const { result, metrics } = await processRequestWithMetrics(userRequest);

  if (metrics.totalCost &gt; budget.maxCostPerRequest) {
    console.warn(
      `Request cost ${metrics.totalCost} exceeds budget ${budget.maxCostPerRequest}`
    );
  }

  budget.currentDayCost += metrics.totalCost;
  budget.requestCount += 1;

  // Log for analysis
  console.log(
    `Request ${metrics.requestId}: $${metrics.totalCost.toFixed(4)} (${metrics.totalTime}ms)`
  );

  return result;
}

Plan Quality Validation

Ensure the planner creates valid, executable plans:

function validatePlan(plan: ExecutionPlan): boolean {
  if (!plan.steps || plan.steps.length === 0) {
    console.error('Plan has no steps');
    return false;
  }

  for (const step of plan.steps) {
    if (!step.id || !step.type || !step.prompt) {
      console.error(`Step missing required fields: ${JSON.stringify(step)}`);
      return false;
    }

    if (
      !['search', 'analyze', 'code', 'reasoning', 'writing'].includes(
        step.type
      )
    ) {
      console.error(`Invalid step type: ${step.type}`);
      return false;
    }

    if (step.prompt.length &lt; 10) {
      console.error(
        `Step prompt too short: "${step.prompt.substring(0, 50)}"`
      );
      return false;
    }
  }

  if (plan.expectedCost &lt;= 0 || plan.expectedCost &gt; 1.0) {
    console.warn(`Suspicious estimated cost: $${plan.expectedCost}`);
  }

  return true;
}

Checklist

  • Understand frontier vs. standard vs. fast model trade-offs
  • Implement plan-and-execute pattern
  • Build executor selection logic by task type
  • Add cost tracking and budget enforcement
  • Implement replanning on executor failure
  • Validate plan quality before execution
  • Monitor actual vs. projected costs

Conclusion

Plan-and-Execute reduces costs by 60-90% compared to using frontier models for every step. The key insight is that planning is expensive but happens once, while execution is cheaper and happens many times. By routing steps to optimal models and replanning on failures, you can build cost-efficient AI systems that scale. Start by measuring your baseline costs, implement the pattern, and monitor the savings. As your usage grows, this approach becomes increasingly valuable.