- Published on
Intelligent LLM Model Routing — Sending the Right Query to the Right Model
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Not every query needs GPT-4. Simple questions cost 6x less on GPT-3.5. This guide covers intelligent routing: classify query complexity, match to model tiers, and A/B test routing decisions.
- Query Complexity Classification
- Cost Tiers (Nano/Micro/Standard/Premium)
- Router Model for Intelligent Routing
- Intent-Based Routing
- Confidence-Based Escalation
- Latency-Based Routing for SLA
- A/B Testing Across Models
- Routing Metrics and Monitoring
- Checklist
- Conclusion
Query Complexity Classification
Classify queries as simple, medium, or complex to route appropriately.
interface QueryClassification {
complexity: 'simple' | 'medium' | 'complex';
score: number;
reasons: string[];
}
class QueryComplexityClassifier {
private tokenCounter: any; // tiktoken encoder
classify(query: string): QueryClassification {
const reasons: string[] = [];
let score = 0;
// Length is a strong signal
const wordCount = query.split(/\s+/).length;
if (wordCount < 10) {
score += 10;
reasons.push('Very short query');
} else if (wordCount < 30) {
score += 20;
reasons.push('Short query');
} else if (wordCount < 100) {
score += 40;
reasons.push('Medium length query');
} else if (wordCount < 300) {
score += 60;
reasons.push('Long query');
} else {
score += 80;
reasons.push('Very long query');
}
// Code presence signals complexity
const codeBlocks = (query.match(/```/g) || []).length / 2;
if (codeBlocks > 0) {
score += 30 * codeBlocks;
reasons.push(`Contains ${codeBlocks} code block(s)`);
}
// Math/formula presence
const hasFormulas = /[\+\-\*\/\=\^]|integral|derivative|sqrt|matrix/i.test(query);
if (hasFormulas) {
score += 25;
reasons.push('Contains mathematical expressions');
}
// Multi-step reasoning indicators
const complexKeywords = /explain|why|how|design|architecture|optimize|refactor|debug|analyze/i;
if (complexKeywords.test(query)) {
score += 20;
reasons.push('Requires reasoning');
}
// Multiple questions
const questionCount = (query.match(/\?/g) || []).length;
if (questionCount > 1) {
score += 15 * questionCount;
reasons.push(`Contains ${questionCount} questions`);
}
// Determine category
let complexity: 'simple' | 'medium' | 'complex';
if (score < 30) complexity = 'simple';
else if (score < 70) complexity = 'medium';
else complexity = 'complex';
return { complexity, score, reasons };
}
}
Cost Tiers (Nano/Micro/Standard/Premium)
Define model tiers by cost and capability.
interface ModelTier {
name: 'nano' | 'micro' | 'standard' | 'premium';
models: string[];
cost_per_1k_input: number;
cost_per_1k_output: number;
latency_p95_ms: number;
capability_score: number;
}
class ModelTierRegistry {
private tiers: Record<string, ModelTier> = {
nano: {
name: 'nano',
models: ['mistral-7b', 'phi-2', 'gemma-7b'],
cost_per_1k_input: 0.0001,
cost_per_1k_output: 0.0003,
latency_p95_ms: 400,
capability_score: 2
},
micro: {
name: 'micro',
models: ['gpt-3.5-turbo', 'mistral-small', 'llama-2-13b'],
cost_per_1k_input: 0.0005,
cost_per_1k_output: 0.0015,
latency_p95_ms: 800,
capability_score: 5
},
standard: {
name: 'standard',
models: ['gpt-4', 'claude-3-sonnet', 'gemini-pro'],
cost_per_1k_input: 0.015,
cost_per_1k_output: 0.06,
latency_p95_ms: 2000,
capability_score: 8
},
premium: {
name: 'premium',
models: ['gpt-4o', 'claude-3-opus', 'gemini-ultra'],
cost_per_1k_input: 0.03,
cost_per_1k_output: 0.12,
latency_p95_ms: 3000,
capability_score: 10
}
};
getTierForComplexity(complexity: 'simple' | 'medium' | 'complex'): ModelTier {
const tierMap = {
simple: 'nano',
medium: 'standard',
complex: 'premium'
};
return this.tiers[tierMap[complexity]];
}
getCheapestModel(): string {
return this.tiers.nano.models[0];
}
getMostCapableModel(): string {
return this.tiers.premium.models[0];
}
estimateCost(model: string, inputTokens: number, outputTokens: number): number {
// Find tier containing model
for (const tier of Object.values(this.tiers)) {
if (tier.models.includes(model)) {
const inputCost = (inputTokens / 1000) * tier.cost_per_1k_input;
const outputCost = (outputTokens / 1000) * tier.cost_per_1k_output;
return inputCost + outputCost;
}
}
throw new Error(`Unknown model: ${model}`);
}
}
Router Model for Intelligent Routing
Use a cheap model to classify queries and select more expensive ones only when needed.
interface RoutingDecision {
selected_model: string;
tier: string;
confidence: number;
explanation: string;
}
class IntelligentRouter {
private classifier = new QueryComplexityClassifier();
private tierRegistry = new ModelTierRegistry();
private classifierClient: any; // OpenAI client
async routeQuery(query: string): Promise<RoutingDecision> {
// First: classify with heuristics (free)
const heuristicClassification = this.classifier.classify(query);
// Second: use cheap model to verify (cost: ~0.001)
const llmClassification = await this.classifyWithLLM(query, heuristicClassification);
// Select tier based on consensus
const tier = this.selectTier(
heuristicClassification.complexity,
llmClassification.confidence
);
// Pick cheapest model in tier
const selectedModel = tier.models[0];
return {
selected_model: selectedModel,
tier: tier.name,
confidence: llmClassification.confidence,
explanation: `Query is ${llmClassification.reasoning}. Routing to ${selectedModel}.`
};
}
private async classifyWithLLM(
query: string,
heuristic: QueryClassification
): Promise<{ complexity: string; confidence: number; reasoning: string }> {
const prompt = `Classify query complexity (simple/medium/complex). Query: "${query.substring(0, 100)}"`;
const response = await this.classifierClient.chat.completions.create({
model: 'gpt-3.5-turbo', // Cheap classifier
messages: [{ role: 'user', content: prompt }],
temperature: 0,
max_tokens: 50
});
const content = response.choices[0].message.content.toLowerCase();
let complexity = heuristic.complexity;
let confidence = 0.7;
if (content.includes('simple') || content.includes('easy')) {
complexity = 'simple';
confidence = 0.8;
} else if (content.includes('complex') || content.includes('hard')) {
complexity = 'complex';
confidence = 0.8;
}
return {
complexity,
confidence,
reasoning: content.substring(0, 50)
};
}
private selectTier(
heuristic: 'simple' | 'medium' | 'complex',
llmConfidence: number
): ModelTier {
// If LLM is confident, use its classification
if (llmConfidence > 0.75) {
return this.tierRegistry.getTierForComplexity(heuristic);
}
// Otherwise be conservative and go one tier up
const tiers = ['nano', 'micro', 'standard', 'premium'] as const;
const tierMap: Record<string, number> = {
simple: 0,
medium: 1,
complex: 2
};
const tierIndex = Math.min(tierMap[heuristic] + 1, 3);
return this.tierRegistry.tiers[tiers[tierIndex]];
}
}
Intent-Based Routing
Route based on detected intent (customer support, code generation, creative writing).
interface IntentClassification {
intent: 'support' | 'coding' | 'creative' | 'analysis' | 'general';
confidence: number;
}
class IntentBasedRouter {
private intentPatterns: Record<string, RegExp[]> = {
support: [
/help|support|issue|problem|error|bug|broken|fail/i,
/how to|how do i|how can i/i
],
coding: [
/code|function|class|script|implement|debug|refactor|api/i,
/write|generate|create.*code/i
],
creative: [
/write|story|poem|essay|article|novel|song|blog/i,
/creative|imagine|brainstorm/i
],
analysis: [
/analyze|analysis|explain|understand|why|research/i,
/data|statistics|trend|pattern/i
]
};
classify(query: string): IntentClassification {
const scores: Record<string, number> = {
support: 0,
coding: 0,
creative: 0,
analysis: 0,
general: 0
};
for (const [intent, patterns] of Object.entries(this.intentPatterns)) {
for (const pattern of patterns) {
if (pattern.test(query)) {
scores[intent]++;
}
}
}
let maxScore = 0;
let bestIntent: keyof typeof scores = 'general';
for (const [intent, score] of Object.entries(scores)) {
if (score > maxScore) {
maxScore = score;
bestIntent = intent as keyof typeof scores;
}
}
const totalScore = Object.values(scores).reduce((a, b) => a + b, 0);
const confidence = totalScore > 0 ? maxScore / totalScore : 0.5;
return { intent: bestIntent as any, confidence };
}
selectModelForIntent(intent: string): string {
const modelMap: Record<string, string> = {
support: 'gpt-3.5-turbo', // Fast, good at instructions
coding: 'gpt-4', // Strong at code
creative: 'claude-3-opus', // Strong at writing
analysis: 'gpt-4', // Strong at reasoning
general: 'gpt-3.5-turbo' // Default
};
return modelMap[intent] || 'gpt-3.5-turbo';
}
}
Confidence-Based Escalation
Start with cheap model, escalate to expensive one if confidence is low.
interface EscalationConfig {
initial_model: string;
escalation_threshold: number;
max_escalations: number;
}
class ConfidenceBasedEscalator {
private escalationChain = [
'gpt-3.5-turbo',
'gpt-4',
'gpt-4o',
'claude-3-opus'
];
async respondWithEscalation(
query: string,
config: EscalationConfig = {
initial_model: 'gpt-3.5-turbo',
escalation_threshold: 0.6,
max_escalations: 2
},
client: any
): Promise<{ response: string; model_used: string; escalations: number }> {
let currentModelIndex = this.escalationChain.indexOf(config.initial_model);
let escalations = 0;
for (let attempt = 0; attempt < config.max_escalations + 1; attempt++) {
const model = this.escalationChain[currentModelIndex];
const response = await client.chat.completions.create({
model,
messages: [{ role: 'user', content: query }],
temperature: 0.5
});
const content = response.choices[0].message.content;
// Estimate confidence (heuristic)
const confidence = this.estimateConfidence(content, query);
if (confidence >= config.escalation_threshold) {
return {
response: content,
model_used: model,
escalations
};
}
// Escalate to next model
if (currentModelIndex < this.escalationChain.length - 1) {
currentModelIndex++;
escalations++;
} else {
// Already at best model, return answer
return {
response: content,
model_used: model,
escalations
};
}
}
throw new Error('Failed to get confident response after max escalations');
}
private estimateConfidence(response: string, query: string): number {
let confidence = 0.5;
// Response too short = low confidence
if (response.length < 50) confidence -= 0.2;
// Uncertainty phrases = low confidence
if (/i'm not sure|i don't know|uncertain|unsure|likely|probably|maybe/i.test(response)) {
confidence -= 0.15;
}
// Specific details = high confidence
if (/\b\d{4}\b|specifically|exactly|definitely|certainly/i.test(response)) {
confidence += 0.15;
}
return Math.max(0, Math.min(1, confidence));
}
}
Latency-Based Routing for SLA
Route based on latency budget to meet SLA requirements.
interface LatencySLA {
p95_ms: number;
p99_ms: number;
}
class LatencyBasedRouter {
private modelLatencies: Record<string, { p95: number; p99: number }> = {
'gpt-3.5-turbo': { p95: 800, p99: 1200 },
'gpt-4': { p95: 2000, p99: 3000 },
'gpt-4o': { p95: 2500, p99: 3500 },
'claude-3-opus': { p95: 3000, p99: 4000 },
'mistral-7b': { p95: 200, p99: 400 } // Local
};
selectModelForSLA(sla: LatencySLA): string {
// Find models that meet SLA
const candidates = Object.entries(this.modelLatencies)
.filter(([, latency]) => latency.p95 <= sla.p95_ms && latency.p99 <= sla.p99_ms)
.sort((a, b) => a[1].p95 - b[1].p95); // Prefer faster
if (candidates.length === 0) {
throw new Error(`No model meets SLA: p95=${sla.p95_ms}ms, p99=${sla.p99_ms}ms`);
}
return candidates[0][0];
}
// Estimate combined time: routing + LLM + response time
estimateTotalLatency(model: string, query: string, estimatedOutputTokens: number = 500): number {
const routingOverhead = 50; // ms
const modelLatency = this.modelLatencies[model]?.p95 || 1000;
const outputLatency = (estimatedOutputTokens / 100) * 20; // ~20ms per 100 tokens
return routingOverhead + modelLatency + outputLatency;
}
}
A/B Testing Across Models
Measure model performance differences with live traffic.
interface ABTestResult {
model_a: string;
model_b: string;
winner: string | null;
metric: 'latency' | 'cost' | 'quality';
improvement_percent: number;
confidence: number;
}
class ABTestRunner {
private results: Map<string, number[]> = new Map();
recordVariant(variant: string, metric: number): void {
if (!this.results.has(variant)) {
this.results.set(variant, []);
}
this.results.get(variant)!.push(metric);
}
computeStats(variant: string): { mean: number; stddev: number; count: number } {
const values = this.results.get(variant) || [];
const mean = values.length > 0
? values.reduce((a, b) => a + b, 0) / values.length
: 0;
const variance = values.length > 0
? values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length
: 0;
return {
mean,
stddev: Math.sqrt(variance),
count: values.length
};
}
compareVariants(variantA: string, variantB: string, minSamples: number = 100): ABTestResult {
const statsA = this.computeStats(variantA);
const statsB = this.computeStats(variantB);
// Need sufficient samples for statistical significance
if (statsA.count < minSamples || statsB.count < minSamples) {
return {
model_a: variantA,
model_b: variantB,
winner: null,
metric: 'cost',
improvement_percent: 0,
confidence: 0
};
}
const improvement = ((statsA.mean - statsB.mean) / statsA.mean) * 100;
const winner = improvement > 0 ? variantB : variantA;
// T-test confidence
const pooledStddev = Math.sqrt(
(statsA.stddev ** 2 + statsB.stddev ** 2) / 2
);
const tScore = Math.abs(statsA.mean - statsB.mean) /
(pooledStddev * Math.sqrt(1 / statsA.count + 1 / statsB.count));
// Rough approximation of significance
const confidence = tScore > 1.96 ? 0.95 : Math.min(0.95, tScore / 1.96);
return {
model_a: variantA,
model_b: variantB,
winner: confidence > 0.8 ? winner : null,
metric: 'cost',
improvement_percent: Math.abs(improvement),
confidence
};
}
}
Routing Metrics and Monitoring
Track routing effectiveness and model performance.
interface RoutingMetrics {
total_queries: number;
routes_by_model: Record<string, number>;
avg_cost_per_query: number;
cost_saved_vs_premium: number;
escalation_rate: number;
}
class RoutingMetricsCollector {
private routingDecisions: Array<{
model: string;
cost: number;
latency_ms: number;
timestamp: Date;
}> = [];
recordRouting(model: string, cost: number, latency: number): void {
this.routingDecisions.push({
model,
cost,
latency_ms: latency,
timestamp: new Date()
});
}
getMetrics(days: number = 1): RoutingMetrics {
const cutoff = new Date();
cutoff.setDate(cutoff.getDate() - days);
const recentDecisions = this.routingDecisions.filter(
d => d.timestamp >= cutoff
);
const routesByModel: Record<string, number> = {};
let totalCost = 0;
for (const decision of recentDecisions) {
routesByModel[decision.model] = (routesByModel[decision.model] || 0) + 1;
totalCost += decision.cost;
}
// Compare to always using premium model
const premiumCostBaseline = recentDecisions.length * 0.15; // Estimate
const saved = Math.max(0, premiumCostBaseline - totalCost);
const escalationCount = recentDecisions.filter(
d => d.latency_ms > 2000
).length;
return {
total_queries: recentDecisions.length,
routes_by_model: routesByModel,
avg_cost_per_query: recentDecisions.length > 0 ? totalCost / recentDecisions.length : 0,
cost_saved_vs_premium: saved,
escalation_rate: recentDecisions.length > 0 ? escalationCount / recentDecisions.length : 0
};
}
}
Checklist
- Classify query complexity with heuristics (length, code, math, questions)
- Use cheap model (GPT-3.5) as classifier for routing decisions
- Route simple queries to nano/micro tiers, complex to premium
- Detect intent (support, coding, creative) and route by capability
- Implement confidence-based escalation for low-confidence responses
- Respect latency SLA and select fastest model that meets budget
- A/B test model changes before full rollout
- Monitor escalation rate to detect broken classifications
- Calculate cost savings vs. baseline (always premium model)
- Export routing metrics to analytics for dashboard visibility
Conclusion
Intelligent routing is the single biggest lever for cost reduction without sacrificing quality. Classify once, route decisively, and measure relentlessly. The savings compound—50% cost reduction with 99% quality retention is the sweet spot most teams never reach.