- Published on
LLM Production Architecture — A Complete Backend Design for AI-Powered Applications
- Authors
- Name
Introduction
Deploying a proof-of-concept LLM app versus running one at scale are entirely different problems. A production LLM backend requires request queuing, response caching, token budgeting, multi-tenant isolation, comprehensive observability, and graceful degradation. This guide covers complete system architecture.
- Architecture Overview
- Request Pipeline Design
- Async vs Sync LLM Calls
- Cost and Latency Optimization Layers
- Multi-Tenant Data Isolation
- Observability Stack
- Scaling to 10K Concurrent Users
- Disaster Recovery
- Cost at Scale Estimation
- Checklist
- Conclusion
Architecture Overview
A production LLM system has distinct layers:
interface LLMArchitecture {
apiLayer: APIGateway;
requestQueue: RequestQueue;
llmGateway: LLMGateway;
cache: ResponseCache;
vectorDB: VectorDatabase;
monitoring: MonitoringStack;
rateLimiter: RateLimiter;
}
// Request flow: API → RateLimit → Queue → LLMGateway → Cache/Model → Response
class APIGateway {
async handleRequest(request: UserRequest): Promise<Response> {
// Input validation and parsing
if (!request.apiKey) {
return { status: 401, error: 'Unauthorized' };
}
// Rate limiting
const allowed = await this.checkRateLimit(request.userId);
if (!allowed) {
return { status: 429, error: 'Rate limited' };
}
// Enqueue request
const requestId = await this.queue.enqueue({
userId: request.userId,
prompt: request.prompt,
model: request.model,
timestamp: Date.now()
});
// Return immediately for async (user polls for result)
// Or wait for completion if sync endpoint
return {
status: 202,
requestId,
statusUrl: `/status/${requestId}`
};
}
private checkRateLimit(userId: string): Promise<boolean> {
return Promise.resolve(true);
}
private queue: RequestQueue = new RequestQueue();
}
interface UserRequest {
userId: string;
prompt: string;
model: string;
apiKey: string;
maxTokens?: number;
temperature?: number;
}
interface Response {
status: number;
error?: string;
requestId?: string;
statusUrl?: string;
result?: string;
}
Request Pipeline Design
interface QueuedRequest {
requestId: string;
userId: string;
prompt: string;
model: string;
priority: 'high' | 'normal' | 'low';
timestamp: number;
status: 'queued' | 'processing' | 'completed' | 'failed';
result?: string;
error?: string;
}
class RequestQueue {
private highPriorityQueue: QueuedRequest[] = [];
private normalQueue: QueuedRequest[] = [];
private lowPriorityQueue: QueuedRequest[] = [];
async enqueue(request: Omit<QueuedRequest, 'requestId' | 'status'>): Promise<string> {
const queuedRequest: QueuedRequest = {
requestId: this.generateId(),
...request,
status: 'queued'
};
// Route to appropriate queue
if (request.priority === 'high') {
this.highPriorityQueue.push(queuedRequest);
} else if (request.priority === 'low') {
this.lowPriorityQueue.push(queuedRequest);
} else {
this.normalQueue.push(queuedRequest);
}
return queuedRequest.requestId;
}
async dequeue(): Promise<QueuedRequest | null> {
// Strict priority: high > normal > low
if (this.highPriorityQueue.length > 0) {
return this.highPriorityQueue.shift()!;
}
if (this.normalQueue.length > 0) {
return this.normalQueue.shift()!;
}
if (this.lowPriorityQueue.length > 0) {
return this.lowPriorityQueue.shift()!;
}
return null;
}
async getStatus(requestId: string): Promise<QueuedRequest | null> {
// Search all queues for request
const allQueues = [this.highPriorityQueue, this.normalQueue, this.lowPriorityQueue];
for (const queue of allQueues) {
const request = queue.find((r) => r.requestId === requestId);
if (request) return request;
}
return null;
}
private generateId(): string {
return `req-${Date.now()}-${Math.random().toString(36).substring(7)}`;
}
}
class LLMGateway {
private requestWorkers: number = 4;
async processQueue(queue: RequestQueue, cache: ResponseCache): Promise<void> {
// Run multiple workers processing requests in parallel
const workers = Array(this.requestWorkers)
.fill(null)
.map(() => this.workerLoop(queue, cache));
await Promise.all(workers);
}
private async workerLoop(queue: RequestQueue, cache: ResponseCache): Promise<void> {
while (true) {
const request = await queue.dequeue();
if (!request) {
// Wait before checking again
await this.sleep(100);
continue;
}
try {
// Check cache first
const cachedResponse = await cache.get(request.prompt, request.model);
if (cachedResponse) {
request.result = cachedResponse;
request.status = 'completed';
continue;
}
// Call LLM
request.status = 'processing';
const response = await this.callLLM(request.prompt, request.model);
request.result = response;
request.status = 'completed';
// Cache response
await cache.set(request.prompt, request.model, response);
} catch (error) {
request.status = 'failed';
request.error = error.message;
}
}
}
private async callLLM(prompt: string, model: string): Promise<string> {
// Call external LLM API or local model
return 'response';
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
Async vs Sync LLM Calls
Choose async for latency-sensitive applications:
interface RequestOptions {
mode: 'sync' | 'async';
timeout: number; // ms, for sync mode
pollInterval: number; // ms, for async polling
}
class RequestHandler {
async handleRequest(
request: UserRequest,
options: RequestOptions
): Promise<Response> {
if (options.mode === 'sync') {
return this.syncRequest(request, options.timeout);
} else {
return this.asyncRequest(request);
}
}
private async syncRequest(request: UserRequest, timeout: number): Promise<Response> {
const requestId = await this.enqueueRequest(request);
const startTime = Date.now();
// Poll for result with timeout
while (Date.now() - startTime < timeout) {
const status = await this.checkStatus(requestId);
if (status.status === 'completed') {
return { status: 200, result: status.result };
}
if (status.status === 'failed') {
return { status: 500, error: status.error };
}
// Wait before next poll
await this.sleep(100);
}
// Timeout: return partial response
return {
status: 504,
error: 'Request timeout. Poll status endpoint for result.',
requestId
};
}
private async asyncRequest(request: UserRequest): Promise<Response> {
const requestId = await this.enqueueRequest(request);
return {
status: 202,
requestId,
statusUrl: `/status/${requestId}`
};
}
private async enqueueRequest(request: UserRequest): Promise<string> {
return 'req-123';
}
private async checkStatus(requestId: string): Promise<QueuedRequest> {
return {} as QueuedRequest;
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
Cost and Latency Optimization Layers
interface OptimizationConfig {
cacheStrategy: 'none' | 'semantic' | 'exact';
compressionEnabled: boolean;
modelSelection: 'cost-optimized' | 'latency-optimized' | 'balanced';
batchingEnabled: boolean;
batchSize: number;
}
class CostAndLatencyOptimizer {
private config: OptimizationConfig;
private cache: Map<string, string> = new Map();
private requestBatch: QueuedRequest[] = [];
constructor(config: OptimizationConfig) {
this.config = config;
}
async optimizeRequest(request: QueuedRequest): Promise<string> {
// Layer 1: Cache check
let result = await this.checkCache(request.prompt, request.model);
if (result) {
return result;
}
// Layer 2: Request batching for cost savings
if (this.config.batchingEnabled) {
this.requestBatch.push(request);
if (this.requestBatch.length >= this.config.batchSize) {
const results = await this.batchProcess(this.requestBatch);
this.requestBatch = [];
return results[0];
}
// Wait for batch to fill
await this.sleep(50);
}
// Layer 3: Prompt compression for cost reduction
let prompt = request.prompt;
if (this.config.compressionEnabled) {
prompt = await this.compressPrompt(prompt);
}
// Layer 4: Model selection
const selectedModel = this.selectModel(request.model);
// Layer 5: Generate response
result = await this.generateResponse(prompt, selectedModel);
// Cache for future
await this.cache.set(`${request.prompt}:${request.model}`, result);
return result;
}
private async checkCache(
prompt: string,
model: string
): Promise<string | null> {
if (this.config.cacheStrategy === 'none') {
return null;
}
const cacheKey = `${prompt}:${model}`;
const cached = this.cache.get(cacheKey);
if (cached && this.config.cacheStrategy === 'exact') {
return cached;
}
if (this.config.cacheStrategy === 'semantic') {
// Find semantically similar cached prompts
const similarKey = await this.findSemanticallySimilar(prompt);
if (similarKey) {
return this.cache.get(similarKey) || null;
}
}
return null;
}
private async findSemanticallySimilar(prompt: string): Promise<string | null> {
// Use embeddings to find similar cached prompts
return null;
}
private async compressPrompt(prompt: string): Promise<string> {
// Remove redundant words, compress examples
const words = prompt.split(/\s+/);
const compressed = words.filter((w, i) => {
return i === 0 || w !== words[i - 1];
});
return compressed.join(' ');
}
private selectModel(requestedModel: string): string {
if (this.config.modelSelection === 'cost-optimized') {
// Use smaller model
return requestedModel.replace('gpt-4', 'gpt-3.5');
}
if (this.config.modelSelection === 'latency-optimized') {
// Use faster local model if available
return 'local-7b';
}
return requestedModel;
}
private async batchProcess(batch: QueuedRequest[]): Promise<string[]> {
// Process multiple requests in single API call
const prompts = batch.map((r) => r.prompt);
const results = await this.generateResponses(prompts, batch[0].model);
return results;
}
private async generateResponse(prompt: string, model: string): Promise<string> {
// Call LLM
return 'response';
}
private async generateResponses(prompts: string[], model: string): Promise<string[]> {
// Batch API call
return [];
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
private cache: Map<string, string> = new Map();
}
Multi-Tenant Data Isolation
interface TenantContext {
tenantId: string;
userId: string;
organization: string;
dataResidency: 'us' | 'eu' | 'apac';
complianceRequirements: string[];
}
class MultiTenantIsolation {
async processRequest(
request: QueuedRequest,
context: TenantContext
): Promise<string> {
// Verify user has access to tenant
const hasAccess = await this.verifyTenantAccess(context.userId, context.tenantId);
if (!hasAccess) {
throw new Error('Access denied');
}
// Route to tenant-specific infrastructure
const tenantGateway = await this.getTenantGateway(context.tenantId);
// Encrypt sensitive data
const encryptedPrompt = await this.encryptForTenant(request.prompt, context.tenantId);
// Process request
const response = await tenantGateway.process(encryptedPrompt);
// Decrypt response
const decrypted = await this.decryptForTenant(response, context.tenantId);
// Log with audit trail
await this.auditLog(context, request);
return decrypted;
}
private async verifyTenantAccess(userId: string, tenantId: string): Promise<boolean> {
// Check user's roles in tenant
return true;
}
private async getTenantGateway(tenantId: string): Promise<LLMGateway> {
// Return dedicated gateway for tenant (isolation)
return new LLMGateway();
}
private async encryptForTenant(data: string, tenantId: string): Promise<string> {
// Use tenant-specific key
return data;
}
private async decryptForTenant(data: string, tenantId: string): Promise<string> {
return data;
}
private async auditLog(context: TenantContext, request: QueuedRequest): Promise<void> {
// Log: who accessed what, when
console.log(`[AUDIT] User ${context.userId} processed request in ${context.tenantId}`);
}
}
Observability Stack
interface ObservabilityRequirements {
traces: TracingConfig;
metrics: MetricsConfig;
logs: LoggingConfig;
alerts: AlertConfig;
}
interface TracingConfig {
enabled: boolean;
samplingRate: number; // 0-1
backend: 'jaeger' | 'datadog' | 'honeycomb';
}
interface MetricsConfig {
enabled: boolean;
collectInterval: number; // ms
metrics: string[]; // latency, throughput, cost, errors
}
interface LoggingConfig {
enabled: boolean;
level: 'debug' | 'info' | 'warn' | 'error';
backend: 'cloudwatch' | 'stackdriver' | 'datadog';
}
interface AlertConfig {
rules: Array<{ metric: string; threshold: number; action: string }>;
}
class ObservabilityStack {
private tracer: any;
private metrics: MetricsCollector;
private logger: Logger;
async recordRequest(request: QueuedRequest): Promise<void> {
const span = this.tracer.startSpan('llm-request', {
attributes: {
userId: request.userId,
model: request.model,
promptLength: request.prompt.length
}
});
const startTime = Date.now();
try {
// Process request...
const duration = Date.now() - startTime;
// Record metrics
await this.metrics.record({
name: 'llm.request.latency',
value: duration,
tags: { model: request.model, status: 'success' }
});
span.setAttribute('status', 'success');
} catch (error) {
await this.metrics.record({
name: 'llm.request.error',
value: 1,
tags: { error: error.message }
});
span.recordException(error);
} finally {
span.end();
}
}
async monitorHealthChecks(): Promise<HealthStatus> {
return {
queueDepth: 150,
modelLatencyP95: 520, // ms
errorRate: 0.002,
cacheHitRate: 0.25,
costPerRequest: 0.015,
healthy: true
};
}
}
interface HealthStatus {
queueDepth: number;
modelLatencyP95: number;
errorRate: number;
cacheHitRate: number;
costPerRequest: number;
healthy: boolean;
}
class MetricsCollector {
async record(metric: { name: string; value: number; tags: Record<string, string> }): Promise<void> {
// Send to metrics backend
}
}
class Logger {
info(message: string, context: any): void {
console.log(`[INFO] ${message}`, context);
}
error(message: string, error: Error): void {
console.error(`[ERROR] ${message}`, error);
}
}
Scaling to 10K Concurrent Users
interface ScalingConfig {
maxConcurrentRequests: number;
maxQueueDepth: number;
autoScalingThreshold: number; // % queue utilization before scaling up
regionConfig: {
primary: string;
secondary: string;
};
}
class AutoScalingController {
private config: ScalingConfig;
private currentWorkers: number = 4;
async monitorAndScale(queue: RequestQueue): Promise<void> {
while (true) {
const queueDepth = await queue.getDepth();
const utilizationPercent = (queueDepth / this.config.maxQueueDepth) * 100;
if (utilizationPercent > this.config.autoScalingThreshold) {
// Scale up
const newWorkers = Math.ceil(this.currentWorkers * 1.5);
await this.scaleWorkers(newWorkers);
this.currentWorkers = newWorkers;
console.log(`Scaled to ${newWorkers} workers`);
} else if (utilizationPercent < 20) {
// Scale down
const newWorkers = Math.max(2, Math.floor(this.currentWorkers * 0.7));
await this.scaleWorkers(newWorkers);
this.currentWorkers = newWorkers;
}
await this.sleep(30000); // Check every 30 seconds
}
}
async scaleWorkers(targetWorkers: number): Promise<void> {
// Provision new containers/pods
console.log(`Scaling to ${targetWorkers} workers`);
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
class RegionalDistribution {
async routeRequest(request: QueuedRequest, regions: string[]): Promise<string> {
// Route to least-loaded region
const loads = await Promise.all(
regions.map((r) => this.getRegionLoad(r))
);
const leastLoadedRegion = regions[loads.indexOf(Math.min(...loads))];
const response = await this.processInRegion(request, leastLoadedRegion);
return response;
}
private async getRegionLoad(region: string): Promise<number> {
// Query region health/load metrics
return Math.random() * 100;
}
private async processInRegion(request: QueuedRequest, region: string): Promise<string> {
// Route to regional gateway
return 'response';
}
}
Disaster Recovery
interface DisasterRecoveryPlan {
rtoMinutes: number; // Recovery Time Objective
rpoMinutes: number; // Recovery Point Objective
failoverStrategy: 'active-passive' | 'active-active';
backupFrequency: number; // minutes
}
class DisasterRecoverySystem {
private primaryRegion: string = 'us-east-1';
private backupRegion: string = 'us-west-2';
private primaryHealthy = true;
async monitorAndFailover(): Promise<void> {
while (true) {
const primaryHealth = await this.checkHealth(this.primaryRegion);
if (!primaryHealth && this.primaryHealthy) {
// Detect failure
console.log('Primary region failure detected, initiating failover');
await this.failoverToBackup();
this.primaryHealthy = false;
} else if (primaryHealth && !this.primaryHealthy) {
// Primary recovered
console.log('Primary region recovered, failing back');
await this.failbackToPrimary();
this.primaryHealthy = true;
}
await this.sleep(10000); // Check every 10 seconds
}
}
private async failoverToBackup(): Promise<void> {
// 1. Update DNS to point to backup
// 2. Promote backup to primary
// 3. Redirect traffic
// 4. Verify data consistency
console.log(`Failover to ${this.backupRegion} complete`);
}
private async failbackToPrimary(): Promise<void> {
// 1. Sync data from backup to primary
// 2. Update DNS to point to primary
// 3. Resume normal operations
console.log(`Failback to ${this.primaryRegion} complete`);
}
private async checkHealth(region: string): Promise<boolean> {
// Health check: can we process requests?
return true;
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
Cost at Scale Estimation
interface CostEstimate {
modelCostPerMillion: number;
infrastructureCostPerMonth: number;
operationsCostPerMonth: number;
totalCostPerMillionRequests: number;
}
class CostCalculator {
async estimateCosts(
monthlyRequests: number,
avgTokensPerRequest: number,
selectedModel: string
): Promise<CostEstimate> {
// Model costs
const modelPricing: Record<string, number> = {
'gpt-4': 0.03, // $0.03 per 1K input tokens
'gpt-3.5': 0.0005,
'claude': 0.008,
'llama-2': 0 // Open source, self-hosted
};
const inputTokens = monthlyRequests * avgTokensPerRequest;
const modelCost = (inputTokens / 1000) * modelPricing[selectedModel];
// Infrastructure costs
const concurrentUsers = monthlyRequests / (30 * 24 * 60 * 10); // Assuming 10 min avg session
const workerCost = Math.ceil(concurrentUsers / 1000) * 500; // $500 per 1K concurrent
const storageCost = 500; // Cache, logs, etc.
const infrastructureCost = workerCost + storageCost;
// Operations costs
const monitoringCost = 300;
const supportCost = 200;
const operationsCost = monitoringCost + supportCost;
return {
modelCostPerMillion: (modelCost / monthlyRequests) * 1000000,
infrastructureCostPerMonth: infrastructureCost,
operationsCostPerMonth: operationsCost,
totalCostPerMillionRequests:
((modelCost + infrastructureCost + operationsCost) / monthlyRequests) * 1000000
};
}
}
Checklist
- Design API gateway with request validation and authentication
- Implement request queuing with priority levels
- Deploy multiple LLM gateway workers for parallelism
- Set up response caching (exact or semantic)
- Implement async request handling with polling
- Add prompt compression for cost savings
- Configure model selection based on cost/latency tradeoffs
- Enable request batching for throughput optimization
- Implement multi-tenant data isolation and encryption
- Deploy comprehensive observability (traces, metrics, logs, alerts)
- Set up auto-scaling based on queue depth
- Configure regional distribution for global serving
- Implement active-passive disaster recovery
- Estimate costs for target scale (10K concurrent users)
Conclusion
Production LLM systems require far more than calling a model API. Request queuing, caching, batching, and multi-tenancy handle scale. Cost and latency optimization layers ensure efficiency. Multi-tenant isolation ensures security. Comprehensive observability enables rapid debugging. Auto-scaling handles traffic spikes. Disaster recovery enables resilience. This architecture enables 10K+ concurrent users at reasonable cost and latency. Build this infrastructure before scaling to production.