- Published on
LLM Token Economics — Counting, Budgeting, and Optimizing Your AI Costs
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
LLM costs scale with tokens, not requests. A single poorly optimized prompt can cost more than a thousand efficient ones. This guide teaches you to count, budget, and optimize tokens like a production engineer manages infrastructure.
- Understanding Token Counting with Tiktoken
- Context Window Management
- Per-Request Cost Tracking Middleware
- Token Budget Enforcement
- Prompt Compression Techniques
- Streaming vs Batch Cost Difference
- Monthly Cost Projection Formula
- Response Length Control
- Checklist
- Conclusion
Understanding Token Counting with Tiktoken
Tokens are not words. "hello" is one token, "antidisestablishmentarianism" is multiple. Counting accurately is critical to predicting costs.
import { encoding_for_model } from 'js-tiktoken';
interface TokenCountResult {
prompt_tokens: number;
completion_tokens: number;
estimated_cost: number;
}
class TokenCounter {
private encoder = encoding_for_model('gpt-4o');
private pricing = {
'gpt-4o': { input: 0.015, output: 0.06 }, // per 1K tokens
'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
'claude-3-opus': { input: 0.015, output: 0.075 }
};
countPromptTokens(text: string): number {
return this.encoder.encode(text).length;
}
estimateCost(
model: string,
promptTokens: number,
completionTokens: number
): number {
const rates = this.pricing[model as keyof typeof this.pricing];
if (!rates) throw new Error(`Unknown model: ${model}`);
const inputCost = (promptTokens / 1000) * rates.input;
const outputCost = (completionTokens / 1000) * rates.output;
return inputCost + outputCost;
}
analyzePrompt(prompt: string, model: string): TokenCountResult {
const promptTokens = this.countPromptTokens(prompt);
// Estimate completion: typically 30-50% of prompt length for questions
const completionTokens = Math.ceil(promptTokens * 0.35);
const estimatedCost = this.estimateCost(
model,
promptTokens,
completionTokens
);
return { prompt_tokens: promptTokens, completion_tokens: completionTokens, estimated_cost: estimatedCost };
}
}
// Usage
const counter = new TokenCounter();
const analysis = counter.analyzePrompt(
'Explain quantum computing to a 10-year-old',
'gpt-4o'
);
console.log(`Cost estimate: $${analysis.estimated_cost.toFixed(6)}`);
Context Window Management
Each model has a maximum token limit. GPT-4o supports 128K, Claude 3 Opus supports 200K, but older models cap at 4K. Exceeding the limit causes failures.
interface ContextWindowConfig {
model: string;
max_tokens: number;
reserved_completion: number;
}
class ContextWindowManager {
private configs: Record<string, ContextWindowConfig> = {
'gpt-4o': { model: 'gpt-4o', max_tokens: 128000, reserved_completion: 4096 },
'gpt-3.5-turbo': { model: 'gpt-3.5-turbo', max_tokens: 16384, reserved_completion: 2048 },
'claude-3-opus': { model: 'claude-3-opus', max_tokens: 200000, reserved_completion: 4096 }
};
getAvailablePromptSpace(model: string): number {
const config = this.configs[model];
if (!config) throw new Error(`Unknown model: ${model}`);
return config.max_tokens - config.reserved_completion;
}
willFit(model: string, promptTokens: number, estimatedCompletion: number): boolean {
const available = this.getAvailablePromptSpace(model);
return promptTokens + estimatedCompletion <= available;
}
truncateConversation(
model: string,
messages: Array<{ role: string; content: string }>,
encoder: any
): Array<{ role: string; content: string }> {
const maxPrompt = this.getAvailablePromptSpace(model);
let totalTokens = 0;
const kept: Array<{ role: string; content: string }> = [];
// Keep system message and last N messages
for (let i = messages.length - 1; i >= 0; i--) {
const msg = messages[i];
const tokens = encoder.encode(msg.content).length;
if (totalTokens + tokens <= maxPrompt * 0.8) {
kept.unshift(msg);
totalTokens += tokens;
} else if (i === 0 || msg.role === 'system') {
// Always keep system message
kept.unshift(msg);
totalTokens += tokens;
break;
}
}
return kept;
}
}
Per-Request Cost Tracking Middleware
Track actual costs incurred per request, not estimates. Implement middleware that logs costs alongside responses.
import { Request, Response, NextFunction } from 'express';
interface CostMetrics {
model: string;
prompt_tokens: number;
completion_tokens: number;
actual_cost: number;
user_id?: string;
timestamp: Date;
}
class CostTrackingMiddleware {
private metrics: CostMetrics[] = [];
private dailyBudget: Map<string, number> = new Map();
private pricing = {
'gpt-4o': { input: 0.015, output: 0.06 },
'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 }
};
calculateCost(
model: string,
promptTokens: number,
completionTokens: number
): number {
const rates = this.pricing[model as keyof typeof this.pricing];
return (promptTokens / 1000) * rates.input + (completionTokens / 1000) * rates.output;
}
recordRequest(
model: string,
promptTokens: number,
completionTokens: number,
userId?: string
): void {
const cost = this.calculateCost(model, promptTokens, completionTokens);
this.metrics.push({
model,
prompt_tokens: promptTokens,
completion_tokens: completionTokens,
actual_cost: cost,
user_id: userId,
timestamp: new Date()
});
if (userId) {
const key = `${userId}-${new Date().toDateString()}`;
const current = this.dailyBudget.get(key) || 0;
this.dailyBudget.set(key, current + cost);
}
}
getUserDailySpend(userId: string): number {
const key = `${userId}-${new Date().toDateString()}`;
return this.dailyBudget.get(key) || 0;
}
getDailyReport(): Record<string, number> {
const report: Record<string, number> = {};
this.metrics
.filter(m => m.timestamp.toDateString() === new Date().toDateString())
.forEach(m => {
const key = m.user_id || 'anonymous';
report[key] = (report[key] || 0) + m.actual_cost;
});
return report;
}
}
export const costTracking = new CostTrackingMiddleware();
Token Budget Enforcement
Prevent runaway costs by enforcing hard budgets at the user and system level.
class TokenBudgetEnforcer {
private userBudgets: Map<string, number> = new Map();
private systemBudget: number = 1000; // $1000/day system budget
private systemSpent: number = 0;
setUserBudget(userId: string, dailyLimit: number): void {
this.userBudgets.set(userId, dailyLimit);
}
canUseTokens(
userId: string,
estimatedCost: number,
currentDaySpend: number
): { allowed: boolean; reason?: string } {
const userBudget = this.userBudgets.get(userId);
if (userBudget && currentDaySpend + estimatedCost > userBudget) {
return { allowed: false, reason: `Budget limit exceeded for user ${userId}` };
}
if (this.systemSpent + estimatedCost > this.systemBudget) {
return { allowed: false, reason: 'System budget exhausted' };
}
return { allowed: true };
}
recordSpend(amount: number): void {
this.systemSpent += amount;
if (this.systemSpent > this.systemBudget * 0.8) {
console.warn(`System spend at 80% of daily budget: $${this.systemSpent}`);
}
}
resetDaily(): void {
this.systemSpent = 0;
}
}
Prompt Compression Techniques
LLMLingua and similar compression reduce tokens without losing semantic meaning, cutting costs by 20-40%.
interface CompressionResult {
original_tokens: number;
compressed_tokens: number;
compression_ratio: number;
compressed_text: string;
}
class PromptCompressor {
// Simplified compression: removes redundant words and whitespace
compressText(text: string): CompressionResult {
const encoder = encoding_for_model('gpt-4o');
const original = encoder.encode(text);
// Remove common filler phrases and compress whitespace
const compressed = text
.replace(/\b(actually|basically|essentially|literally|really|very)\b/g, '')
.replace(/\s+/g, ' ')
.trim();
const compressedTokens = encoder.encode(compressed);
return {
original_tokens: original.length,
compressed_tokens: compressedTokens.length,
compression_ratio: compressedTokens.length / original.length,
compressed_text: compressed
};
}
summarizeContext(text: string, targetTokens: number): string {
const encoder = encoding_for_model('gpt-4o');
const tokens = encoder.encode(text);
if (tokens.length <= targetTokens) {
return text;
}
// Simple truncation strategy
const ratio = targetTokens / tokens.length;
const charLimit = Math.floor(text.length * ratio);
return text.substring(0, charLimit) + '...';
}
}
Streaming vs Batch Cost Difference
Streaming costs the same but appears faster to users. Batch API offers <50% discounts for non-time-sensitive work.
interface CostComparison {
model: string;
requests: number;
streaming_cost: number;
batch_cost: number;
savings: number;
}
class BatchOptimizer {
private batchPricing = {
'gpt-4o': { regular: 0.015, batch: 0.0075 }, // per 1K input tokens
};
compareCosts(
model: string,
totalRequests: number,
avgPromptTokens: number,
avgCompletionTokens: number
): CostComparison {
const rates = this.batchPricing[model as keyof typeof this.batchPricing];
if (!rates) throw new Error('Model not configured');
const totalInputTokens = totalRequests * avgPromptTokens;
const totalOutputTokens = totalRequests * avgCompletionTokens;
const streamingCost = (totalInputTokens / 1000) * rates.regular +
(totalOutputTokens / 1000) * 0.06;
const batchCost = (totalInputTokens / 1000) * rates.batch +
(totalOutputTokens / 1000) * 0.03;
return {
model,
requests: totalRequests,
streaming_cost: streamingCost,
batch_cost: batchCost,
savings: streamingCost - batchCost
};
}
shouldUseBatch(
cost: CostComparison,
latencySensitivity: boolean
): boolean {
if (latencySensitivity) return false;
// Use batch if savings > 25% and not time-sensitive
return (cost.savings / cost.streaming_cost) > 0.25;
}
}
Monthly Cost Projection Formula
Forecast monthly spend based on current velocity to catch runaway costs early.
interface ProjectionMetrics {
daily_spend: number;
projected_monthly: number;
daily_average_requests: number;
avg_cost_per_request: number;
projected_monthly_requests: number;
}
class CostProjector {
private dailyHistory: number[] = [];
private requestHistory: number[] = [];
recordDay(spend: number, requests: number): void {
this.dailyHistory.push(spend);
this.requestHistory.push(requests);
}
project(): ProjectionMetrics {
if (this.dailyHistory.length === 0) {
return {
daily_spend: 0,
projected_monthly: 0,
daily_average_requests: 0,
avg_cost_per_request: 0,
projected_monthly_requests: 0
};
}
const avgDaily = this.dailyHistory.reduce((a, b) => a + b, 0) / this.dailyHistory.length;
const avgRequests = this.requestHistory.reduce((a, b) => a + b, 0) / this.requestHistory.length;
const costPerRequest = avgDaily / avgRequests;
return {
daily_spend: avgDaily,
projected_monthly: avgDaily * 30,
daily_average_requests: avgRequests,
avg_cost_per_request: costPerRequest,
projected_monthly_requests: avgRequests * 30
};
}
getAlertThreshold(budget: number): number {
const projected = this.project();
return projected.projected_monthly > budget * 0.8 ? projected.projected_monthly : 0;
}
}
Response Length Control
Cap completion tokens to avoid surprise costs from verbose models.
interface ResponseControl {
max_tokens: number;
enforced: boolean;
}
class ResponseLengthController {
enforceMaxTokens(
model: string,
maxTokens: number
): ResponseControl {
// Different models have different sweet spots
const limits: Record<string, number> = {
'gpt-4o': Math.min(maxTokens, 4096),
'gpt-3.5-turbo': Math.min(maxTokens, 2048),
'claude-3-opus': Math.min(maxTokens, 4096)
};
return {
max_tokens: limits[model] || maxTokens,
enforced: true
};
}
truncateResponse(text: string, maxTokens: number, encoder: any): string {
const tokens = encoder.encode(text);
if (tokens.length <= maxTokens) return text;
// Decode back to text at token boundary
const truncated = encoder.decode(tokens.slice(0, maxTokens));
return truncated + '...';
}
}
Checklist
- Install and use
js-tiktokenfor accurate token counting - Implement per-request cost tracking middleware in your API layer
- Set daily/monthly budgets per user and at system level
- Use token-based rate limiting, not request-based
- Track actual costs vs estimates to calibrate models
- Compress prompts using LLMLingua or similar (20-40% savings)
- Batch non-time-sensitive requests for 40-50% cost reduction
- Monitor context window usage and truncate conversations at 80% capacity
- Project monthly spend from rolling daily averages
- Cap completion tokens to prevent runaway costs
Conclusion
Token economics is not optional—it's foundational to sustainable LLM applications. The engineers who win are those who treat tokens like a metered resource: count it, budget it, optimize it. Start with accurate token counting, add budget enforcement, then layer in compression and batching. Your CFO will thank you.