LLM Token Economics — Counting, Budgeting, and Optimizing Your AI Costs

Sanjeev SharmaSanjeev Sharma
8 min read

Advertisement

Introduction

LLM costs scale with tokens, not requests. A single poorly optimized prompt can cost more than a thousand efficient ones. This guide teaches you to count, budget, and optimize tokens like a production engineer manages infrastructure.

Understanding Token Counting with Tiktoken

Tokens are not words. "hello" is one token, "antidisestablishmentarianism" is multiple. Counting accurately is critical to predicting costs.

import { encoding_for_model } from 'js-tiktoken';

interface TokenCountResult {
  prompt_tokens: number;
  completion_tokens: number;
  estimated_cost: number;
}

class TokenCounter {
  private encoder = encoding_for_model('gpt-4o');
  private pricing = {
    'gpt-4o': { input: 0.015, output: 0.06 }, // per 1K tokens
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
    'claude-3-opus': { input: 0.015, output: 0.075 }
  };

  countPromptTokens(text: string): number {
    return this.encoder.encode(text).length;
  }

  estimateCost(
    model: string,
    promptTokens: number,
    completionTokens: number
  ): number {
    const rates = this.pricing[model as keyof typeof this.pricing];
    if (!rates) throw new Error(`Unknown model: ${model}`);

    const inputCost = (promptTokens / 1000) * rates.input;
    const outputCost = (completionTokens / 1000) * rates.output;
    return inputCost + outputCost;
  }

  analyzePrompt(prompt: string, model: string): TokenCountResult {
    const promptTokens = this.countPromptTokens(prompt);
    // Estimate completion: typically 30-50% of prompt length for questions
    const completionTokens = Math.ceil(promptTokens * 0.35);
    const estimatedCost = this.estimateCost(
      model,
      promptTokens,
      completionTokens
    );

    return { prompt_tokens: promptTokens, completion_tokens: completionTokens, estimated_cost: estimatedCost };
  }
}

// Usage
const counter = new TokenCounter();
const analysis = counter.analyzePrompt(
  'Explain quantum computing to a 10-year-old',
  'gpt-4o'
);
console.log(`Cost estimate: $${analysis.estimated_cost.toFixed(6)}`);

Context Window Management

Each model has a maximum token limit. GPT-4o supports 128K, Claude 3 Opus supports 200K, but older models cap at 4K. Exceeding the limit causes failures.

interface ContextWindowConfig {
  model: string;
  max_tokens: number;
  reserved_completion: number;
}

class ContextWindowManager {
  private configs: Record<string, ContextWindowConfig> = {
    'gpt-4o': { model: 'gpt-4o', max_tokens: 128000, reserved_completion: 4096 },
    'gpt-3.5-turbo': { model: 'gpt-3.5-turbo', max_tokens: 16384, reserved_completion: 2048 },
    'claude-3-opus': { model: 'claude-3-opus', max_tokens: 200000, reserved_completion: 4096 }
  };

  getAvailablePromptSpace(model: string): number {
    const config = this.configs[model];
    if (!config) throw new Error(`Unknown model: ${model}`);
    return config.max_tokens - config.reserved_completion;
  }

  willFit(model: string, promptTokens: number, estimatedCompletion: number): boolean {
    const available = this.getAvailablePromptSpace(model);
    return promptTokens + estimatedCompletion <= available;
  }

  truncateConversation(
    model: string,
    messages: Array<{ role: string; content: string }>,
    encoder: any
  ): Array<{ role: string; content: string }> {
    const maxPrompt = this.getAvailablePromptSpace(model);
    let totalTokens = 0;
    const kept: Array<{ role: string; content: string }> = [];

    // Keep system message and last N messages
    for (let i = messages.length - 1; i >= 0; i--) {
      const msg = messages[i];
      const tokens = encoder.encode(msg.content).length;

      if (totalTokens + tokens <= maxPrompt * 0.8) {
        kept.unshift(msg);
        totalTokens += tokens;
      } else if (i === 0 || msg.role === 'system') {
        // Always keep system message
        kept.unshift(msg);
        totalTokens += tokens;
        break;
      }
    }

    return kept;
  }
}

Per-Request Cost Tracking Middleware

Track actual costs incurred per request, not estimates. Implement middleware that logs costs alongside responses.

import { Request, Response, NextFunction } from 'express';

interface CostMetrics {
  model: string;
  prompt_tokens: number;
  completion_tokens: number;
  actual_cost: number;
  user_id?: string;
  timestamp: Date;
}

class CostTrackingMiddleware {
  private metrics: CostMetrics[] = [];
  private dailyBudget: Map<string, number> = new Map();
  private pricing = {
    'gpt-4o': { input: 0.015, output: 0.06 },
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 }
  };

  calculateCost(
    model: string,
    promptTokens: number,
    completionTokens: number
  ): number {
    const rates = this.pricing[model as keyof typeof this.pricing];
    return (promptTokens / 1000) * rates.input + (completionTokens / 1000) * rates.output;
  }

  recordRequest(
    model: string,
    promptTokens: number,
    completionTokens: number,
    userId?: string
  ): void {
    const cost = this.calculateCost(model, promptTokens, completionTokens);
    this.metrics.push({
      model,
      prompt_tokens: promptTokens,
      completion_tokens: completionTokens,
      actual_cost: cost,
      user_id: userId,
      timestamp: new Date()
    });

    if (userId) {
      const key = `${userId}-${new Date().toDateString()}`;
      const current = this.dailyBudget.get(key) || 0;
      this.dailyBudget.set(key, current + cost);
    }
  }

  getUserDailySpend(userId: string): number {
    const key = `${userId}-${new Date().toDateString()}`;
    return this.dailyBudget.get(key) || 0;
  }

  getDailyReport(): Record<string, number> {
    const report: Record<string, number> = {};
    this.metrics
      .filter(m => m.timestamp.toDateString() === new Date().toDateString())
      .forEach(m => {
        const key = m.user_id || 'anonymous';
        report[key] = (report[key] || 0) + m.actual_cost;
      });
    return report;
  }
}

export const costTracking = new CostTrackingMiddleware();

Token Budget Enforcement

Prevent runaway costs by enforcing hard budgets at the user and system level.

class TokenBudgetEnforcer {
  private userBudgets: Map<string, number> = new Map();
  private systemBudget: number = 1000; // $1000/day system budget
  private systemSpent: number = 0;

  setUserBudget(userId: string, dailyLimit: number): void {
    this.userBudgets.set(userId, dailyLimit);
  }

  canUseTokens(
    userId: string,
    estimatedCost: number,
    currentDaySpend: number
  ): { allowed: boolean; reason?: string } {
    const userBudget = this.userBudgets.get(userId);

    if (userBudget && currentDaySpend + estimatedCost > userBudget) {
      return { allowed: false, reason: `Budget limit exceeded for user ${userId}` };
    }

    if (this.systemSpent + estimatedCost > this.systemBudget) {
      return { allowed: false, reason: 'System budget exhausted' };
    }

    return { allowed: true };
  }

  recordSpend(amount: number): void {
    this.systemSpent += amount;
    if (this.systemSpent > this.systemBudget * 0.8) {
      console.warn(`System spend at 80% of daily budget: $${this.systemSpent}`);
    }
  }

  resetDaily(): void {
    this.systemSpent = 0;
  }
}

Prompt Compression Techniques

LLMLingua and similar compression reduce tokens without losing semantic meaning, cutting costs by 20-40%.

interface CompressionResult {
  original_tokens: number;
  compressed_tokens: number;
  compression_ratio: number;
  compressed_text: string;
}

class PromptCompressor {
  // Simplified compression: removes redundant words and whitespace
  compressText(text: string): CompressionResult {
    const encoder = encoding_for_model('gpt-4o');
    const original = encoder.encode(text);

    // Remove common filler phrases and compress whitespace
    const compressed = text
      .replace(/\b(actually|basically|essentially|literally|really|very)\b/g, '')
      .replace(/\s+/g, ' ')
      .trim();

    const compressedTokens = encoder.encode(compressed);

    return {
      original_tokens: original.length,
      compressed_tokens: compressedTokens.length,
      compression_ratio: compressedTokens.length / original.length,
      compressed_text: compressed
    };
  }

  summarizeContext(text: string, targetTokens: number): string {
    const encoder = encoding_for_model('gpt-4o');
    const tokens = encoder.encode(text);

    if (tokens.length <= targetTokens) {
      return text;
    }

    // Simple truncation strategy
    const ratio = targetTokens / tokens.length;
    const charLimit = Math.floor(text.length * ratio);
    return text.substring(0, charLimit) + '...';
  }
}

Streaming vs Batch Cost Difference

Streaming costs the same but appears faster to users. Batch API offers <50% discounts for non-time-sensitive work.

interface CostComparison {
  model: string;
  requests: number;
  streaming_cost: number;
  batch_cost: number;
  savings: number;
}

class BatchOptimizer {
  private batchPricing = {
    'gpt-4o': { regular: 0.015, batch: 0.0075 }, // per 1K input tokens
  };

  compareCosts(
    model: string,
    totalRequests: number,
    avgPromptTokens: number,
    avgCompletionTokens: number
  ): CostComparison {
    const rates = this.batchPricing[model as keyof typeof this.batchPricing];
    if (!rates) throw new Error('Model not configured');

    const totalInputTokens = totalRequests * avgPromptTokens;
    const totalOutputTokens = totalRequests * avgCompletionTokens;

    const streamingCost = (totalInputTokens / 1000) * rates.regular +
                         (totalOutputTokens / 1000) * 0.06;
    const batchCost = (totalInputTokens / 1000) * rates.batch +
                     (totalOutputTokens / 1000) * 0.03;

    return {
      model,
      requests: totalRequests,
      streaming_cost: streamingCost,
      batch_cost: batchCost,
      savings: streamingCost - batchCost
    };
  }

  shouldUseBatch(
    cost: CostComparison,
    latencySensitivity: boolean
  ): boolean {
    if (latencySensitivity) return false;
    // Use batch if savings &gt; 25% and not time-sensitive
    return (cost.savings / cost.streaming_cost) &gt; 0.25;
  }
}

Monthly Cost Projection Formula

Forecast monthly spend based on current velocity to catch runaway costs early.

interface ProjectionMetrics {
  daily_spend: number;
  projected_monthly: number;
  daily_average_requests: number;
  avg_cost_per_request: number;
  projected_monthly_requests: number;
}

class CostProjector {
  private dailyHistory: number[] = [];
  private requestHistory: number[] = [];

  recordDay(spend: number, requests: number): void {
    this.dailyHistory.push(spend);
    this.requestHistory.push(requests);
  }

  project(): ProjectionMetrics {
    if (this.dailyHistory.length === 0) {
      return {
        daily_spend: 0,
        projected_monthly: 0,
        daily_average_requests: 0,
        avg_cost_per_request: 0,
        projected_monthly_requests: 0
      };
    }

    const avgDaily = this.dailyHistory.reduce((a, b) => a + b, 0) / this.dailyHistory.length;
    const avgRequests = this.requestHistory.reduce((a, b) => a + b, 0) / this.requestHistory.length;
    const costPerRequest = avgDaily / avgRequests;

    return {
      daily_spend: avgDaily,
      projected_monthly: avgDaily * 30,
      daily_average_requests: avgRequests,
      avg_cost_per_request: costPerRequest,
      projected_monthly_requests: avgRequests * 30
    };
  }

  getAlertThreshold(budget: number): number {
    const projected = this.project();
    return projected.projected_monthly &gt; budget * 0.8 ? projected.projected_monthly : 0;
  }
}

Response Length Control

Cap completion tokens to avoid surprise costs from verbose models.

interface ResponseControl {
  max_tokens: number;
  enforced: boolean;
}

class ResponseLengthController {
  enforceMaxTokens(
    model: string,
    maxTokens: number
  ): ResponseControl {
    // Different models have different sweet spots
    const limits: Record<string, number> = {
      'gpt-4o': Math.min(maxTokens, 4096),
      'gpt-3.5-turbo': Math.min(maxTokens, 2048),
      'claude-3-opus': Math.min(maxTokens, 4096)
    };

    return {
      max_tokens: limits[model] || maxTokens,
      enforced: true
    };
  }

  truncateResponse(text: string, maxTokens: number, encoder: any): string {
    const tokens = encoder.encode(text);
    if (tokens.length <= maxTokens) return text;

    // Decode back to text at token boundary
    const truncated = encoder.decode(tokens.slice(0, maxTokens));
    return truncated + '...';
  }
}

Checklist

  • Install and use js-tiktoken for accurate token counting
  • Implement per-request cost tracking middleware in your API layer
  • Set daily/monthly budgets per user and at system level
  • Use token-based rate limiting, not request-based
  • Track actual costs vs estimates to calibrate models
  • Compress prompts using LLMLingua or similar (20-40% savings)
  • Batch non-time-sensitive requests for 40-50% cost reduction
  • Monitor context window usage and truncate conversations at 80% capacity
  • Project monthly spend from rolling daily averages
  • Cap completion tokens to prevent runaway costs

Conclusion

Token economics is not optional—it's foundational to sustainable LLM applications. The engineers who win are those who treat tokens like a metered resource: count it, budget it, optimize it. Start with accurate token counting, add budget enforcement, then layer in compression and batching. Your CFO will thank you.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro