Published on

LLM Production Architecture — A Complete Backend Design for AI-Powered Applications

Authors
  • Name
    Twitter

Introduction

Deploying a proof-of-concept LLM app versus running one at scale are entirely different problems. A production LLM backend requires request queuing, response caching, token budgeting, multi-tenant isolation, comprehensive observability, and graceful degradation. This guide covers complete system architecture.

Architecture Overview

A production LLM system has distinct layers:

interface LLMArchitecture {
  apiLayer: APIGateway;
  requestQueue: RequestQueue;
  llmGateway: LLMGateway;
  cache: ResponseCache;
  vectorDB: VectorDatabase;
  monitoring: MonitoringStack;
  rateLimiter: RateLimiter;
}

// Request flow: API → RateLimit → Queue → LLMGateway → Cache/Model → Response

class APIGateway {
  async handleRequest(request: UserRequest): Promise<Response> {
    // Input validation and parsing
    if (!request.apiKey) {
      return { status: 401, error: 'Unauthorized' };
    }

    // Rate limiting
    const allowed = await this.checkRateLimit(request.userId);
    if (!allowed) {
      return { status: 429, error: 'Rate limited' };
    }

    // Enqueue request
    const requestId = await this.queue.enqueue({
      userId: request.userId,
      prompt: request.prompt,
      model: request.model,
      timestamp: Date.now()
    });

    // Return immediately for async (user polls for result)
    // Or wait for completion if sync endpoint
    return {
      status: 202,
      requestId,
      statusUrl: `/status/${requestId}`
    };
  }

  private checkRateLimit(userId: string): Promise<boolean> {
    return Promise.resolve(true);
  }

  private queue: RequestQueue = new RequestQueue();
}

interface UserRequest {
  userId: string;
  prompt: string;
  model: string;
  apiKey: string;
  maxTokens?: number;
  temperature?: number;
}

interface Response {
  status: number;
  error?: string;
  requestId?: string;
  statusUrl?: string;
  result?: string;
}

Request Pipeline Design

interface QueuedRequest {
  requestId: string;
  userId: string;
  prompt: string;
  model: string;
  priority: 'high' | 'normal' | 'low';
  timestamp: number;
  status: 'queued' | 'processing' | 'completed' | 'failed';
  result?: string;
  error?: string;
}

class RequestQueue {
  private highPriorityQueue: QueuedRequest[] = [];
  private normalQueue: QueuedRequest[] = [];
  private lowPriorityQueue: QueuedRequest[] = [];

  async enqueue(request: Omit<QueuedRequest, 'requestId' | 'status'>): Promise<string> {
    const queuedRequest: QueuedRequest = {
      requestId: this.generateId(),
      ...request,
      status: 'queued'
    };

    // Route to appropriate queue
    if (request.priority === 'high') {
      this.highPriorityQueue.push(queuedRequest);
    } else if (request.priority === 'low') {
      this.lowPriorityQueue.push(queuedRequest);
    } else {
      this.normalQueue.push(queuedRequest);
    }

    return queuedRequest.requestId;
  }

  async dequeue(): Promise<QueuedRequest | null> {
    // Strict priority: high > normal > low
    if (this.highPriorityQueue.length > 0) {
      return this.highPriorityQueue.shift()!;
    }
    if (this.normalQueue.length > 0) {
      return this.normalQueue.shift()!;
    }
    if (this.lowPriorityQueue.length > 0) {
      return this.lowPriorityQueue.shift()!;
    }
    return null;
  }

  async getStatus(requestId: string): Promise<QueuedRequest | null> {
    // Search all queues for request
    const allQueues = [this.highPriorityQueue, this.normalQueue, this.lowPriorityQueue];
    for (const queue of allQueues) {
      const request = queue.find((r) => r.requestId === requestId);
      if (request) return request;
    }
    return null;
  }

  private generateId(): string {
    return `req-${Date.now()}-${Math.random().toString(36).substring(7)}`;
  }
}

class LLMGateway {
  private requestWorkers: number = 4;

  async processQueue(queue: RequestQueue, cache: ResponseCache): Promise<void> {
    // Run multiple workers processing requests in parallel
    const workers = Array(this.requestWorkers)
      .fill(null)
      .map(() => this.workerLoop(queue, cache));

    await Promise.all(workers);
  }

  private async workerLoop(queue: RequestQueue, cache: ResponseCache): Promise<void> {
    while (true) {
      const request = await queue.dequeue();
      if (!request) {
        // Wait before checking again
        await this.sleep(100);
        continue;
      }

      try {
        // Check cache first
        const cachedResponse = await cache.get(request.prompt, request.model);
        if (cachedResponse) {
          request.result = cachedResponse;
          request.status = 'completed';
          continue;
        }

        // Call LLM
        request.status = 'processing';
        const response = await this.callLLM(request.prompt, request.model);

        request.result = response;
        request.status = 'completed';

        // Cache response
        await cache.set(request.prompt, request.model, response);
      } catch (error) {
        request.status = 'failed';
        request.error = error.message;
      }
    }
  }

  private async callLLM(prompt: string, model: string): Promise<string> {
    // Call external LLM API or local model
    return 'response';
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Async vs Sync LLM Calls

Choose async for latency-sensitive applications:

interface RequestOptions {
  mode: 'sync' | 'async';
  timeout: number; // ms, for sync mode
  pollInterval: number; // ms, for async polling
}

class RequestHandler {
  async handleRequest(
    request: UserRequest,
    options: RequestOptions
  ): Promise<Response> {
    if (options.mode === 'sync') {
      return this.syncRequest(request, options.timeout);
    } else {
      return this.asyncRequest(request);
    }
  }

  private async syncRequest(request: UserRequest, timeout: number): Promise<Response> {
    const requestId = await this.enqueueRequest(request);
    const startTime = Date.now();

    // Poll for result with timeout
    while (Date.now() - startTime < timeout) {
      const status = await this.checkStatus(requestId);

      if (status.status === 'completed') {
        return { status: 200, result: status.result };
      }

      if (status.status === 'failed') {
        return { status: 500, error: status.error };
      }

      // Wait before next poll
      await this.sleep(100);
    }

    // Timeout: return partial response
    return {
      status: 504,
      error: 'Request timeout. Poll status endpoint for result.',
      requestId
    };
  }

  private async asyncRequest(request: UserRequest): Promise<Response> {
    const requestId = await this.enqueueRequest(request);
    return {
      status: 202,
      requestId,
      statusUrl: `/status/${requestId}`
    };
  }

  private async enqueueRequest(request: UserRequest): Promise<string> {
    return 'req-123';
  }

  private async checkStatus(requestId: string): Promise<QueuedRequest> {
    return {} as QueuedRequest;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Cost and Latency Optimization Layers

interface OptimizationConfig {
  cacheStrategy: 'none' | 'semantic' | 'exact';
  compressionEnabled: boolean;
  modelSelection: 'cost-optimized' | 'latency-optimized' | 'balanced';
  batchingEnabled: boolean;
  batchSize: number;
}

class CostAndLatencyOptimizer {
  private config: OptimizationConfig;
  private cache: Map<string, string> = new Map();
  private requestBatch: QueuedRequest[] = [];

  constructor(config: OptimizationConfig) {
    this.config = config;
  }

  async optimizeRequest(request: QueuedRequest): Promise<string> {
    // Layer 1: Cache check
    let result = await this.checkCache(request.prompt, request.model);
    if (result) {
      return result;
    }

    // Layer 2: Request batching for cost savings
    if (this.config.batchingEnabled) {
      this.requestBatch.push(request);

      if (this.requestBatch.length >= this.config.batchSize) {
        const results = await this.batchProcess(this.requestBatch);
        this.requestBatch = [];
        return results[0];
      }

      // Wait for batch to fill
      await this.sleep(50);
    }

    // Layer 3: Prompt compression for cost reduction
    let prompt = request.prompt;
    if (this.config.compressionEnabled) {
      prompt = await this.compressPrompt(prompt);
    }

    // Layer 4: Model selection
    const selectedModel = this.selectModel(request.model);

    // Layer 5: Generate response
    result = await this.generateResponse(prompt, selectedModel);

    // Cache for future
    await this.cache.set(`${request.prompt}:${request.model}`, result);

    return result;
  }

  private async checkCache(
    prompt: string,
    model: string
  ): Promise<string | null> {
    if (this.config.cacheStrategy === 'none') {
      return null;
    }

    const cacheKey = `${prompt}:${model}`;
    const cached = this.cache.get(cacheKey);

    if (cached && this.config.cacheStrategy === 'exact') {
      return cached;
    }

    if (this.config.cacheStrategy === 'semantic') {
      // Find semantically similar cached prompts
      const similarKey = await this.findSemanticallySimilar(prompt);
      if (similarKey) {
        return this.cache.get(similarKey) || null;
      }
    }

    return null;
  }

  private async findSemanticallySimilar(prompt: string): Promise<string | null> {
    // Use embeddings to find similar cached prompts
    return null;
  }

  private async compressPrompt(prompt: string): Promise<string> {
    // Remove redundant words, compress examples
    const words = prompt.split(/\s+/);
    const compressed = words.filter((w, i) => {
      return i === 0 || w !== words[i - 1];
    });

    return compressed.join(' ');
  }

  private selectModel(requestedModel: string): string {
    if (this.config.modelSelection === 'cost-optimized') {
      // Use smaller model
      return requestedModel.replace('gpt-4', 'gpt-3.5');
    }
    if (this.config.modelSelection === 'latency-optimized') {
      // Use faster local model if available
      return 'local-7b';
    }
    return requestedModel;
  }

  private async batchProcess(batch: QueuedRequest[]): Promise<string[]> {
    // Process multiple requests in single API call
    const prompts = batch.map((r) => r.prompt);
    const results = await this.generateResponses(prompts, batch[0].model);
    return results;
  }

  private async generateResponse(prompt: string, model: string): Promise<string> {
    // Call LLM
    return 'response';
  }

  private async generateResponses(prompts: string[], model: string): Promise<string[]> {
    // Batch API call
    return [];
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }

  private cache: Map<string, string> = new Map();
}

Multi-Tenant Data Isolation

interface TenantContext {
  tenantId: string;
  userId: string;
  organization: string;
  dataResidency: 'us' | 'eu' | 'apac';
  complianceRequirements: string[];
}

class MultiTenantIsolation {
  async processRequest(
    request: QueuedRequest,
    context: TenantContext
  ): Promise<string> {
    // Verify user has access to tenant
    const hasAccess = await this.verifyTenantAccess(context.userId, context.tenantId);
    if (!hasAccess) {
      throw new Error('Access denied');
    }

    // Route to tenant-specific infrastructure
    const tenantGateway = await this.getTenantGateway(context.tenantId);

    // Encrypt sensitive data
    const encryptedPrompt = await this.encryptForTenant(request.prompt, context.tenantId);

    // Process request
    const response = await tenantGateway.process(encryptedPrompt);

    // Decrypt response
    const decrypted = await this.decryptForTenant(response, context.tenantId);

    // Log with audit trail
    await this.auditLog(context, request);

    return decrypted;
  }

  private async verifyTenantAccess(userId: string, tenantId: string): Promise<boolean> {
    // Check user's roles in tenant
    return true;
  }

  private async getTenantGateway(tenantId: string): Promise<LLMGateway> {
    // Return dedicated gateway for tenant (isolation)
    return new LLMGateway();
  }

  private async encryptForTenant(data: string, tenantId: string): Promise<string> {
    // Use tenant-specific key
    return data;
  }

  private async decryptForTenant(data: string, tenantId: string): Promise<string> {
    return data;
  }

  private async auditLog(context: TenantContext, request: QueuedRequest): Promise<void> {
    // Log: who accessed what, when
    console.log(`[AUDIT] User ${context.userId} processed request in ${context.tenantId}`);
  }
}

Observability Stack

interface ObservabilityRequirements {
  traces: TracingConfig;
  metrics: MetricsConfig;
  logs: LoggingConfig;
  alerts: AlertConfig;
}

interface TracingConfig {
  enabled: boolean;
  samplingRate: number; // 0-1
  backend: 'jaeger' | 'datadog' | 'honeycomb';
}

interface MetricsConfig {
  enabled: boolean;
  collectInterval: number; // ms
  metrics: string[]; // latency, throughput, cost, errors
}

interface LoggingConfig {
  enabled: boolean;
  level: 'debug' | 'info' | 'warn' | 'error';
  backend: 'cloudwatch' | 'stackdriver' | 'datadog';
}

interface AlertConfig {
  rules: Array<{ metric: string; threshold: number; action: string }>;
}

class ObservabilityStack {
  private tracer: any;
  private metrics: MetricsCollector;
  private logger: Logger;

  async recordRequest(request: QueuedRequest): Promise<void> {
    const span = this.tracer.startSpan('llm-request', {
      attributes: {
        userId: request.userId,
        model: request.model,
        promptLength: request.prompt.length
      }
    });

    const startTime = Date.now();

    try {
      // Process request...

      const duration = Date.now() - startTime;

      // Record metrics
      await this.metrics.record({
        name: 'llm.request.latency',
        value: duration,
        tags: { model: request.model, status: 'success' }
      });

      span.setAttribute('status', 'success');
    } catch (error) {
      await this.metrics.record({
        name: 'llm.request.error',
        value: 1,
        tags: { error: error.message }
      });

      span.recordException(error);
    } finally {
      span.end();
    }
  }

  async monitorHealthChecks(): Promise<HealthStatus> {
    return {
      queueDepth: 150,
      modelLatencyP95: 520, // ms
      errorRate: 0.002,
      cacheHitRate: 0.25,
      costPerRequest: 0.015,
      healthy: true
    };
  }
}

interface HealthStatus {
  queueDepth: number;
  modelLatencyP95: number;
  errorRate: number;
  cacheHitRate: number;
  costPerRequest: number;
  healthy: boolean;
}

class MetricsCollector {
  async record(metric: { name: string; value: number; tags: Record<string, string> }): Promise<void> {
    // Send to metrics backend
  }
}

class Logger {
  info(message: string, context: any): void {
    console.log(`[INFO] ${message}`, context);
  }

  error(message: string, error: Error): void {
    console.error(`[ERROR] ${message}`, error);
  }
}

Scaling to 10K Concurrent Users

interface ScalingConfig {
  maxConcurrentRequests: number;
  maxQueueDepth: number;
  autoScalingThreshold: number; // % queue utilization before scaling up
  regionConfig: {
    primary: string;
    secondary: string;
  };
}

class AutoScalingController {
  private config: ScalingConfig;
  private currentWorkers: number = 4;

  async monitorAndScale(queue: RequestQueue): Promise<void> {
    while (true) {
      const queueDepth = await queue.getDepth();
      const utilizationPercent = (queueDepth / this.config.maxQueueDepth) * 100;

      if (utilizationPercent > this.config.autoScalingThreshold) {
        // Scale up
        const newWorkers = Math.ceil(this.currentWorkers * 1.5);
        await this.scaleWorkers(newWorkers);
        this.currentWorkers = newWorkers;
        console.log(`Scaled to ${newWorkers} workers`);
      } else if (utilizationPercent < 20) {
        // Scale down
        const newWorkers = Math.max(2, Math.floor(this.currentWorkers * 0.7));
        await this.scaleWorkers(newWorkers);
        this.currentWorkers = newWorkers;
      }

      await this.sleep(30000); // Check every 30 seconds
    }
  }

  async scaleWorkers(targetWorkers: number): Promise<void> {
    // Provision new containers/pods
    console.log(`Scaling to ${targetWorkers} workers`);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

class RegionalDistribution {
  async routeRequest(request: QueuedRequest, regions: string[]): Promise<string> {
    // Route to least-loaded region
    const loads = await Promise.all(
      regions.map((r) => this.getRegionLoad(r))
    );

    const leastLoadedRegion = regions[loads.indexOf(Math.min(...loads))];
    const response = await this.processInRegion(request, leastLoadedRegion);

    return response;
  }

  private async getRegionLoad(region: string): Promise<number> {
    // Query region health/load metrics
    return Math.random() * 100;
  }

  private async processInRegion(request: QueuedRequest, region: string): Promise<string> {
    // Route to regional gateway
    return 'response';
  }
}

Disaster Recovery

interface DisasterRecoveryPlan {
  rtoMinutes: number; // Recovery Time Objective
  rpoMinutes: number; // Recovery Point Objective
  failoverStrategy: 'active-passive' | 'active-active';
  backupFrequency: number; // minutes
}

class DisasterRecoverySystem {
  private primaryRegion: string = 'us-east-1';
  private backupRegion: string = 'us-west-2';
  private primaryHealthy = true;

  async monitorAndFailover(): Promise<void> {
    while (true) {
      const primaryHealth = await this.checkHealth(this.primaryRegion);

      if (!primaryHealth && this.primaryHealthy) {
        // Detect failure
        console.log('Primary region failure detected, initiating failover');
        await this.failoverToBackup();
        this.primaryHealthy = false;
      } else if (primaryHealth && !this.primaryHealthy) {
        // Primary recovered
        console.log('Primary region recovered, failing back');
        await this.failbackToPrimary();
        this.primaryHealthy = true;
      }

      await this.sleep(10000); // Check every 10 seconds
    }
  }

  private async failoverToBackup(): Promise<void> {
    // 1. Update DNS to point to backup
    // 2. Promote backup to primary
    // 3. Redirect traffic
    // 4. Verify data consistency
    console.log(`Failover to ${this.backupRegion} complete`);
  }

  private async failbackToPrimary(): Promise<void> {
    // 1. Sync data from backup to primary
    // 2. Update DNS to point to primary
    // 3. Resume normal operations
    console.log(`Failback to ${this.primaryRegion} complete`);
  }

  private async checkHealth(region: string): Promise<boolean> {
    // Health check: can we process requests?
    return true;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Cost at Scale Estimation

interface CostEstimate {
  modelCostPerMillion: number;
  infrastructureCostPerMonth: number;
  operationsCostPerMonth: number;
  totalCostPerMillionRequests: number;
}

class CostCalculator {
  async estimateCosts(
    monthlyRequests: number,
    avgTokensPerRequest: number,
    selectedModel: string
  ): Promise<CostEstimate> {
    // Model costs
    const modelPricing: Record<string, number> = {
      'gpt-4': 0.03, // $0.03 per 1K input tokens
      'gpt-3.5': 0.0005,
      'claude': 0.008,
      'llama-2': 0 // Open source, self-hosted
    };

    const inputTokens = monthlyRequests * avgTokensPerRequest;
    const modelCost = (inputTokens / 1000) * modelPricing[selectedModel];

    // Infrastructure costs
    const concurrentUsers = monthlyRequests / (30 * 24 * 60 * 10); // Assuming 10 min avg session
    const workerCost = Math.ceil(concurrentUsers / 1000) * 500; // $500 per 1K concurrent
    const storageCost = 500; // Cache, logs, etc.
    const infrastructureCost = workerCost + storageCost;

    // Operations costs
    const monitoringCost = 300;
    const supportCost = 200;
    const operationsCost = monitoringCost + supportCost;

    return {
      modelCostPerMillion: (modelCost / monthlyRequests) * 1000000,
      infrastructureCostPerMonth: infrastructureCost,
      operationsCostPerMonth: operationsCost,
      totalCostPerMillionRequests:
        ((modelCost + infrastructureCost + operationsCost) / monthlyRequests) * 1000000
    };
  }
}

Checklist

  • Design API gateway with request validation and authentication
  • Implement request queuing with priority levels
  • Deploy multiple LLM gateway workers for parallelism
  • Set up response caching (exact or semantic)
  • Implement async request handling with polling
  • Add prompt compression for cost savings
  • Configure model selection based on cost/latency tradeoffs
  • Enable request batching for throughput optimization
  • Implement multi-tenant data isolation and encryption
  • Deploy comprehensive observability (traces, metrics, logs, alerts)
  • Set up auto-scaling based on queue depth
  • Configure regional distribution for global serving
  • Implement active-passive disaster recovery
  • Estimate costs for target scale (10K concurrent users)

Conclusion

Production LLM systems require far more than calling a model API. Request queuing, caching, batching, and multi-tenancy handle scale. Cost and latency optimization layers ensure efficiency. Multi-tenant isolation ensures security. Comprehensive observability enables rapid debugging. Auto-scaling handles traffic spikes. Disaster recovery enables resilience. This architecture enables 10K+ concurrent users at reasonable cost and latency. Build this infrastructure before scaling to production.