Published on

LLM Token Economics — Counting, Budgeting, and Optimizing Your AI Costs

Authors

Introduction

LLM costs scale with tokens, not requests. A single poorly optimized prompt can cost more than a thousand efficient ones. This guide teaches you to count, budget, and optimize tokens like a production engineer manages infrastructure.

Understanding Token Counting with Tiktoken

Tokens are not words. "hello" is one token, "antidisestablishmentarianism" is multiple. Counting accurately is critical to predicting costs.

import { encoding_for_model } from 'js-tiktoken';

interface TokenCountResult {
  prompt_tokens: number;
  completion_tokens: number;
  estimated_cost: number;
}

class TokenCounter {
  private encoder = encoding_for_model('gpt-4o');
  private pricing = {
    'gpt-4o': { input: 0.015, output: 0.06 }, // per 1K tokens
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
    'claude-3-opus': { input: 0.015, output: 0.075 }
  };

  countPromptTokens(text: string): number {
    return this.encoder.encode(text).length;
  }

  estimateCost(
    model: string,
    promptTokens: number,
    completionTokens: number
  ): number {
    const rates = this.pricing[model as keyof typeof this.pricing];
    if (!rates) throw new Error(`Unknown model: ${model}`);

    const inputCost = (promptTokens / 1000) * rates.input;
    const outputCost = (completionTokens / 1000) * rates.output;
    return inputCost + outputCost;
  }

  analyzePrompt(prompt: string, model: string): TokenCountResult {
    const promptTokens = this.countPromptTokens(prompt);
    // Estimate completion: typically 30-50% of prompt length for questions
    const completionTokens = Math.ceil(promptTokens * 0.35);
    const estimatedCost = this.estimateCost(
      model,
      promptTokens,
      completionTokens
    );

    return { prompt_tokens: promptTokens, completion_tokens: completionTokens, estimated_cost: estimatedCost };
  }
}

// Usage
const counter = new TokenCounter();
const analysis = counter.analyzePrompt(
  'Explain quantum computing to a 10-year-old',
  'gpt-4o'
);
console.log(`Cost estimate: $${analysis.estimated_cost.toFixed(6)}`);

Context Window Management

Each model has a maximum token limit. GPT-4o supports 128K, Claude 3 Opus supports 200K, but older models cap at 4K. Exceeding the limit causes failures.

interface ContextWindowConfig {
  model: string;
  max_tokens: number;
  reserved_completion: number;
}

class ContextWindowManager {
  private configs: Record<string, ContextWindowConfig> = {
    'gpt-4o': { model: 'gpt-4o', max_tokens: 128000, reserved_completion: 4096 },
    'gpt-3.5-turbo': { model: 'gpt-3.5-turbo', max_tokens: 16384, reserved_completion: 2048 },
    'claude-3-opus': { model: 'claude-3-opus', max_tokens: 200000, reserved_completion: 4096 }
  };

  getAvailablePromptSpace(model: string): number {
    const config = this.configs[model];
    if (!config) throw new Error(`Unknown model: ${model}`);
    return config.max_tokens - config.reserved_completion;
  }

  willFit(model: string, promptTokens: number, estimatedCompletion: number): boolean {
    const available = this.getAvailablePromptSpace(model);
    return promptTokens + estimatedCompletion <= available;
  }

  truncateConversation(
    model: string,
    messages: Array<{ role: string; content: string }>,
    encoder: any
  ): Array<{ role: string; content: string }> {
    const maxPrompt = this.getAvailablePromptSpace(model);
    let totalTokens = 0;
    const kept: Array<{ role: string; content: string }> = [];

    // Keep system message and last N messages
    for (let i = messages.length - 1; i >= 0; i--) {
      const msg = messages[i];
      const tokens = encoder.encode(msg.content).length;

      if (totalTokens + tokens <= maxPrompt * 0.8) {
        kept.unshift(msg);
        totalTokens += tokens;
      } else if (i === 0 || msg.role === 'system') {
        // Always keep system message
        kept.unshift(msg);
        totalTokens += tokens;
        break;
      }
    }

    return kept;
  }
}

Per-Request Cost Tracking Middleware

Track actual costs incurred per request, not estimates. Implement middleware that logs costs alongside responses.

import { Request, Response, NextFunction } from 'express';

interface CostMetrics {
  model: string;
  prompt_tokens: number;
  completion_tokens: number;
  actual_cost: number;
  user_id?: string;
  timestamp: Date;
}

class CostTrackingMiddleware {
  private metrics: CostMetrics[] = [];
  private dailyBudget: Map<string, number> = new Map();
  private pricing = {
    'gpt-4o': { input: 0.015, output: 0.06 },
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 }
  };

  calculateCost(
    model: string,
    promptTokens: number,
    completionTokens: number
  ): number {
    const rates = this.pricing[model as keyof typeof this.pricing];
    return (promptTokens / 1000) * rates.input + (completionTokens / 1000) * rates.output;
  }

  recordRequest(
    model: string,
    promptTokens: number,
    completionTokens: number,
    userId?: string
  ): void {
    const cost = this.calculateCost(model, promptTokens, completionTokens);
    this.metrics.push({
      model,
      prompt_tokens: promptTokens,
      completion_tokens: completionTokens,
      actual_cost: cost,
      user_id: userId,
      timestamp: new Date()
    });

    if (userId) {
      const key = `${userId}-${new Date().toDateString()}`;
      const current = this.dailyBudget.get(key) || 0;
      this.dailyBudget.set(key, current + cost);
    }
  }

  getUserDailySpend(userId: string): number {
    const key = `${userId}-${new Date().toDateString()}`;
    return this.dailyBudget.get(key) || 0;
  }

  getDailyReport(): Record<string, number> {
    const report: Record<string, number> = {};
    this.metrics
      .filter(m => m.timestamp.toDateString() === new Date().toDateString())
      .forEach(m => {
        const key = m.user_id || 'anonymous';
        report[key] = (report[key] || 0) + m.actual_cost;
      });
    return report;
  }
}

export const costTracking = new CostTrackingMiddleware();

Token Budget Enforcement

Prevent runaway costs by enforcing hard budgets at the user and system level.

class TokenBudgetEnforcer {
  private userBudgets: Map<string, number> = new Map();
  private systemBudget: number = 1000; // $1000/day system budget
  private systemSpent: number = 0;

  setUserBudget(userId: string, dailyLimit: number): void {
    this.userBudgets.set(userId, dailyLimit);
  }

  canUseTokens(
    userId: string,
    estimatedCost: number,
    currentDaySpend: number
  ): { allowed: boolean; reason?: string } {
    const userBudget = this.userBudgets.get(userId);

    if (userBudget && currentDaySpend + estimatedCost > userBudget) {
      return { allowed: false, reason: `Budget limit exceeded for user ${userId}` };
    }

    if (this.systemSpent + estimatedCost > this.systemBudget) {
      return { allowed: false, reason: 'System budget exhausted' };
    }

    return { allowed: true };
  }

  recordSpend(amount: number): void {
    this.systemSpent += amount;
    if (this.systemSpent > this.systemBudget * 0.8) {
      console.warn(`System spend at 80% of daily budget: $${this.systemSpent}`);
    }
  }

  resetDaily(): void {
    this.systemSpent = 0;
  }
}

Prompt Compression Techniques

LLMLingua and similar compression reduce tokens without losing semantic meaning, cutting costs by 20-40%.

interface CompressionResult {
  original_tokens: number;
  compressed_tokens: number;
  compression_ratio: number;
  compressed_text: string;
}

class PromptCompressor {
  // Simplified compression: removes redundant words and whitespace
  compressText(text: string): CompressionResult {
    const encoder = encoding_for_model('gpt-4o');
    const original = encoder.encode(text);

    // Remove common filler phrases and compress whitespace
    const compressed = text
      .replace(/\b(actually|basically|essentially|literally|really|very)\b/g, '')
      .replace(/\s+/g, ' ')
      .trim();

    const compressedTokens = encoder.encode(compressed);

    return {
      original_tokens: original.length,
      compressed_tokens: compressedTokens.length,
      compression_ratio: compressedTokens.length / original.length,
      compressed_text: compressed
    };
  }

  summarizeContext(text: string, targetTokens: number): string {
    const encoder = encoding_for_model('gpt-4o');
    const tokens = encoder.encode(text);

    if (tokens.length <= targetTokens) {
      return text;
    }

    // Simple truncation strategy
    const ratio = targetTokens / tokens.length;
    const charLimit = Math.floor(text.length * ratio);
    return text.substring(0, charLimit) + '...';
  }
}

Streaming vs Batch Cost Difference

Streaming costs the same but appears faster to users. Batch API offers <50% discounts for non-time-sensitive work.

interface CostComparison {
  model: string;
  requests: number;
  streaming_cost: number;
  batch_cost: number;
  savings: number;
}

class BatchOptimizer {
  private batchPricing = {
    'gpt-4o': { regular: 0.015, batch: 0.0075 }, // per 1K input tokens
  };

  compareCosts(
    model: string,
    totalRequests: number,
    avgPromptTokens: number,
    avgCompletionTokens: number
  ): CostComparison {
    const rates = this.batchPricing[model as keyof typeof this.batchPricing];
    if (!rates) throw new Error('Model not configured');

    const totalInputTokens = totalRequests * avgPromptTokens;
    const totalOutputTokens = totalRequests * avgCompletionTokens;

    const streamingCost = (totalInputTokens / 1000) * rates.regular +
                         (totalOutputTokens / 1000) * 0.06;
    const batchCost = (totalInputTokens / 1000) * rates.batch +
                     (totalOutputTokens / 1000) * 0.03;

    return {
      model,
      requests: totalRequests,
      streaming_cost: streamingCost,
      batch_cost: batchCost,
      savings: streamingCost - batchCost
    };
  }

  shouldUseBatch(
    cost: CostComparison,
    latencySensitivity: boolean
  ): boolean {
    if (latencySensitivity) return false;
    // Use batch if savings &gt; 25% and not time-sensitive
    return (cost.savings / cost.streaming_cost) &gt; 0.25;
  }
}

Monthly Cost Projection Formula

Forecast monthly spend based on current velocity to catch runaway costs early.

interface ProjectionMetrics {
  daily_spend: number;
  projected_monthly: number;
  daily_average_requests: number;
  avg_cost_per_request: number;
  projected_monthly_requests: number;
}

class CostProjector {
  private dailyHistory: number[] = [];
  private requestHistory: number[] = [];

  recordDay(spend: number, requests: number): void {
    this.dailyHistory.push(spend);
    this.requestHistory.push(requests);
  }

  project(): ProjectionMetrics {
    if (this.dailyHistory.length === 0) {
      return {
        daily_spend: 0,
        projected_monthly: 0,
        daily_average_requests: 0,
        avg_cost_per_request: 0,
        projected_monthly_requests: 0
      };
    }

    const avgDaily = this.dailyHistory.reduce((a, b) => a + b, 0) / this.dailyHistory.length;
    const avgRequests = this.requestHistory.reduce((a, b) => a + b, 0) / this.requestHistory.length;
    const costPerRequest = avgDaily / avgRequests;

    return {
      daily_spend: avgDaily,
      projected_monthly: avgDaily * 30,
      daily_average_requests: avgRequests,
      avg_cost_per_request: costPerRequest,
      projected_monthly_requests: avgRequests * 30
    };
  }

  getAlertThreshold(budget: number): number {
    const projected = this.project();
    return projected.projected_monthly &gt; budget * 0.8 ? projected.projected_monthly : 0;
  }
}

Response Length Control

Cap completion tokens to avoid surprise costs from verbose models.

interface ResponseControl {
  max_tokens: number;
  enforced: boolean;
}

class ResponseLengthController {
  enforceMaxTokens(
    model: string,
    maxTokens: number
  ): ResponseControl {
    // Different models have different sweet spots
    const limits: Record<string, number> = {
      'gpt-4o': Math.min(maxTokens, 4096),
      'gpt-3.5-turbo': Math.min(maxTokens, 2048),
      'claude-3-opus': Math.min(maxTokens, 4096)
    };

    return {
      max_tokens: limits[model] || maxTokens,
      enforced: true
    };
  }

  truncateResponse(text: string, maxTokens: number, encoder: any): string {
    const tokens = encoder.encode(text);
    if (tokens.length <= maxTokens) return text;

    // Decode back to text at token boundary
    const truncated = encoder.decode(tokens.slice(0, maxTokens));
    return truncated + '...';
  }
}

Checklist

  • Install and use js-tiktoken for accurate token counting
  • Implement per-request cost tracking middleware in your API layer
  • Set daily/monthly budgets per user and at system level
  • Use token-based rate limiting, not request-based
  • Track actual costs vs estimates to calibrate models
  • Compress prompts using LLMLingua or similar (20-40% savings)
  • Batch non-time-sensitive requests for 40-50% cost reduction
  • Monitor context window usage and truncate conversations at 80% capacity
  • Project monthly spend from rolling daily averages
  • Cap completion tokens to prevent runaway costs

Conclusion

Token economics is not optional—it's foundational to sustainable LLM applications. The engineers who win are those who treat tokens like a metered resource: count it, budget it, optimize it. Start with accurate token counting, add budget enforcement, then layer in compression and batching. Your CFO will thank you.