Published on

A/B Testing LLM Models and Prompts — Replacing Guesswork With Data

Authors

Introduction

You''ve written a new prompt that you think is better. Or you want to try a faster, cheaper model. How do you know if the change actually improves user experience without breaking production?

This post covers A/B testing architectures for LLM model and prompt changes, with statistical rigor and safety nets.

Model A/B Testing Architecture

Split traffic between two model versions and measure quality metrics:

import { v4 as uuidv4 } from "uuid";

interface ExperimentConfig {
  experimentId: string;
  controlModelName: string;
  treatmentModelName: string;
  trafficSplitPercent: number; // e.g., 50 for 50/50 split
  minimumSampleSize: number;
  statisticalSignificanceThreshold: number; // e.g., 0.05
}

interface ExperimentResult {
  userId: string;
  experimentId: string;
  variant: "control" | "treatment";
  modelName: string;
  responseQuality: number; // 0-1
  latencyMs: number;
  costUSD: number;
  timestamp: Date;
}

class ABTestingOrchestrator {
  async assignVariant(
    userId: string,
    experimentId: string
  ): Promise<"control" | "treatment"> {
    // Deterministic assignment: same user always gets same variant
    const hash = this.hashUserId(userId);
    const assignment = hash % 100;

    // Read from feature flag system or database
    const experiment = await this.getExperiment(experimentId);

    if (assignment < experiment.trafficSplitPercent) {
      return "treatment";
    }
    return "control";
  }

  async selectModel(
    userId: string,
    experimentId: string,
    defaultModel: string
  ): Promise<string> {
    const variant = await this.assignVariant(userId, experimentId);
    const experiment = await this.getExperiment(experimentId);

    if (variant === "treatment") {
      return experiment.treatmentModelName;
    }
    return experiment.controlModelName;
  }

  async recordResult(result: ExperimentResult): Promise<void> {
    // Store in database for analysis
    await this.persistResult(result);

    // Check for early stopping (quality regression)
    const canary = await this.checkCanary(result.experimentId);
    if (canary.shouldStop) {
      await this.stopExperiment(result.experimentId);
      console.error(`Experiment stopped: ${canary.reason}`);
    }
  }

  private hashUserId(userId: string): number {
    let hash = 0;
    for (let i = 0; i < userId.length; i++) {
      const char = userId.charCodeAt(i);
      hash = (hash << 5) - hash + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }

  private async getExperiment(
    experimentId: string
  ): Promise<ExperimentConfig> {
    // Fetch from cache or database
    return {
      experimentId,
      controlModelName: "claude-3-5-sonnet-20241022",
      treatmentModelName: "claude-3-5-haiku-20241022",
      trafficSplitPercent: 50,
      minimumSampleSize: 1000,
      statisticalSignificanceThreshold: 0.05,
    };
  }

  private async persistResult(result: ExperimentResult): Promise<void> {
    // Store in database
  }

  private async checkCanary(experimentId: string): Promise<{
    shouldStop: boolean;
    reason: string;
  }> {
    // Early stopping if quality drops below threshold
    return { shouldStop: false, reason: "" };
  }

  private async stopExperiment(experimentId: string): Promise<void> {
    // Mark experiment as stopped in feature flag system
  }
}

export { ABTestingOrchestrator, ExperimentConfig, ExperimentResult };

Shadow Mode: Run Both Models, Show One

In shadow mode, call both the control and treatment model, but only show the user the control response. Use treatment responses for evaluation:

interface ShadowModeResult {
  controlResponse: string;
  treatmentResponse: string;
  comparisonScore: number;
  recommendTreatment: boolean;
}

class ShadowModeEvaluator {
  async runInShadowMode(
    prompt: string,
    controlModel: string,
    treatmentModel: string,
    userId: string
  ): Promise<ShadowModeResult> {
    const startTime = Date.now();

    // Run both models in parallel to minimize latency impact
    const [controlResult, treatmentResult] = await Promise.all([
      this.callModel(prompt, controlModel, userId),
      this.callModel(prompt, treatmentModel, userId),
    ]);

    const latencyMs = Date.now() - startTime;

    // Compare responses without user seeing it
    const comparison = await this.compareResponses(
      prompt,
      controlResult.response,
      treatmentResult.response
    );

    // Log for analysis
    await this.logShadowComparison({
      userId,
      controlModel,
      treatmentModel,
      controlLatency: controlResult.latencyMs,
      treatmentLatency: treatmentResult.latencyMs,
      comparisonScore: comparison.score,
      totalLatency: latencyMs,
    });

    return {
      controlResponse: controlResult.response,
      treatmentResponse: treatmentResult.response,
      comparisonScore: comparison.score,
      recommendTreatment: comparison.score > 0.7,
    };
  }

  private async callModel(
    prompt: string,
    model: string,
    userId: string
  ): Promise<{
    response: string;
    latencyMs: number;
    costUSD: number;
  }> {
    const startTime = Date.now();
    // Call the actual LLM API
    return {
      response: "sample response",
      latencyMs: Date.now() - startTime,
      costUSD: 0.01,
    };
  }

  private async compareResponses(
    prompt: string,
    controlResponse: string,
    treatmentResponse: string
  ): Promise<{ score: number; reasoning: string }> {
    // Use LLM-as-judge to compare quality
    return {
      score: 0.85,
      reasoning: "Treatment response is more detailed",
    };
  }

  private async logShadowComparison(
    data: Record<string, unknown>
  ): Promise<void> {
    // Persist to database for analysis
  }
}

export { ShadowModeEvaluator, ShadowModeResult };

Statistical Significance Testing

Know when your experiment results are statistically valid:

interface ExperimentAnalysis {
  controlQualityScore: number;
  treatmentQualityScore: number;
  qualityLift: number; // percentage improvement
  sampleSize: number;
  pValue: number;
  isSignificant: boolean;
  confidenceIntervalMin: number;
  confidenceIntervalMax: number;
  recommendation: "promote" | "reject" | "continue_testing";
}

class StatisticalAnalyzer {
  async analyzeExperiment(
    experimentId: string,
    minSampleSize: number = 1000,
    significanceThreshold: number = 0.05
  ): Promise<ExperimentAnalysis> {
    const results = await this.getExperimentResults(experimentId);

    const control = results.filter((r) => r.variant === "control");
    const treatment = results.filter((r) => r.variant === "treatment");

    if (
      control.length < minSampleSize ||
      treatment.length < minSampleSize
    ) {
      return {
        controlQualityScore: 0,
        treatmentQualityScore: 0,
        qualityLift: 0,
        sampleSize: Math.min(control.length, treatment.length),
        pValue: 1,
        isSignificant: false,
        confidenceIntervalMin: 0,
        confidenceIntervalMax: 0,
        recommendation: "continue_testing",
      };
    }

    const controlScore =
      control.reduce((sum, r) => sum + r.responseQuality, 0) /
      control.length;
    const treatmentScore =
      treatment.reduce((sum, r) => sum + r.responseQuality, 0) /
      treatment.length;

    const pValue = this.calculateTTest(control, treatment);
    const isSignificant = pValue < significanceThreshold;

    const qualityLift = ((treatmentScore - controlScore) / controlScore) * 100;

    const [confMin, confMax] = this.calculateConfidenceInterval(
      treatmentScore,
      treatment.length,
      0.95
    );

    let recommendation: "promote" | "reject" | "continue_testing" =
      "continue_testing";
    if (isSignificant && qualityLift > 0) {
      recommendation = "promote";
    } else if (isSignificant && qualityLift < -5) {
      recommendation = "reject";
    }

    return {
      controlQualityScore: controlScore,
      treatmentQualityScore: treatmentScore,
      qualityLift,
      sampleSize: Math.min(control.length, treatment.length),
      pValue,
      isSignificant,
      confidenceIntervalMin: confMin,
      confidenceIntervalMax: confMax,
      recommendation,
    };
  }

  private calculateTTest(
    control: ExperimentResult[],
    treatment: ExperimentResult[]
  ): number {
    // Welch''s t-test for unequal variances
    const controlMean =
      control.reduce((sum, r) => sum + r.responseQuality, 0) /
      control.length;
    const treatmentMean =
      treatment.reduce((sum, r) => sum + r.responseQuality, 0) /
      treatment.length;

    const controlVar =
      control.reduce((sum, r) => sum + Math.pow(r.responseQuality - controlMean, 2), 0) /
      (control.length - 1);
    const treatmentVar =
      treatment.reduce((sum, r) => sum + Math.pow(r.responseQuality - treatmentMean, 2), 0) /
      (treatment.length - 1);

    const tStatistic =
      (controlMean - treatmentMean) /
      Math.sqrt(controlVar / control.length + treatmentVar / treatment.length);

    // Approximate p-value (simplified)
    return Math.min(1, 2 * Math.exp(-Math.abs(tStatistic) / 2));
  }

  private calculateConfidenceInterval(
    mean: number,
    sampleSize: number,
    confidenceLevel: number
  ): [number, number] {
    const marginOfError = 1.96 * Math.sqrt(0.25 / sampleSize);
    return [mean - marginOfError, mean + marginOfError];
  }

  private async getExperimentResults(
    experimentId: string
  ): Promise<ExperimentResult[]> {
    // Query database
    return [];
  }
}

export { StatisticalAnalyzer, ExperimentAnalysis };

Gradual Rollout: Canary to 100%

Once your experiment is statistically significant, roll out gradually:

interface RolloutStage {
  stageId: number;
  trafficPercent: number;
  durationHours: number;
  checkpointName: string;
}

class GradualRolloutManager {
  private readonly DEFAULT_ROLLOUT_STAGES: RolloutStage[] = [
    {
      stageId: 1,
      trafficPercent: 5,
      durationHours: 2,
      checkpointName: "canary",
    },
    {
      stageId: 2,
      trafficPercent: 25,
      durationHours: 4,
      checkpointName: "early_adopters",
    },
    {
      stageId: 3,
      trafficPercent: 50,
      durationHours: 6,
      checkpointName: "half_rollout",
    },
    {
      stageId: 4,
      trafficPercent: 100,
      durationHours: 0,
      checkpointName: "full_rollout",
    },
  ];

  async initiateRollout(
    modelOrPromptId: string,
    stages?: RolloutStage[]
  ): Promise<void> {
    const rolloutStages = stages || this.DEFAULT_ROLLOUT_STAGES;

    for (const stage of rolloutStages) {
      console.log(`Rolling out to ${stage.trafficPercent}%`);

      // Update feature flag system
      await this.updateTrafficAllocation(modelOrPromptId, stage.trafficPercent);

      if (stage.durationHours > 0) {
        // Wait and monitor
        await this.monitorStage(modelOrPromptId, stage);

        // Check for issues
        const health = await this.checkServiceHealth(modelOrPromptId);
        if (!health.isHealthy) {
          console.error(`Rollout failed at stage ${stage.checkpointName}`);
          await this.rollback(modelOrPromptId);
          return;
        }
      }
    }

    console.log(`Rollout complete: ${modelOrPromptId}`);
  }

  private async updateTrafficAllocation(
    modelOrPromptId: string,
    percent: number
  ): Promise<void> {
    // Update feature flag service (LaunchDarkly, Unleash, etc.)
  }

  private async monitorStage(
    modelOrPromptId: string,
    stage: RolloutStage
  ): Promise<void> {
    const endTime = Date.now() + stage.durationHours * 60 * 60 * 1000;
    while (Date.now() < endTime) {
      // Wait 30 seconds between checks
      await new Promise((resolve) => setTimeout(resolve, 30000));

      const metrics = await this.getMetrics(modelOrPromptId);
      console.log(
        `Stage ${stage.checkpointName}: latency=${metrics.latencyMs}ms, errors=${metrics.errorRate}%`
      );
    }
  }

  private async checkServiceHealth(
    modelOrPromptId: string
  ): Promise<{ isHealthy: boolean; reason?: string }> {
    const metrics = await this.getMetrics(modelOrPromptId);

    if (metrics.errorRate > 5) {
      return { isHealthy: false, reason: "Error rate too high" };
    }

    if (metrics.latencyMs > 10000) {
      return { isHealthy: false, reason: "Latency degradation" };
    }

    return { isHealthy: true };
  }

  private async rollback(modelOrPromptId: string): Promise<void> {
    console.log(`Rolling back: ${modelOrPromptId}`);
    await this.updateTrafficAllocation(modelOrPromptId, 0);
  }

  private async getMetrics(modelOrPromptId: string): Promise<{
    latencyMs: number;
    errorRate: number;
  }> {
    // Query observability backend
    return { latencyMs: 1200, errorRate: 0.5 };
  }
}

export { GradualRolloutManager, RolloutStage };

Conclusion

A/B testing LLM models and prompts transforms experimentation from guesswork to data-driven decisions. Use shadow mode to test without user impact, statistical significance testing to validate results, and gradual rollouts to protect production.

The combination of these techniques lets you ship improvements with confidence, knowing you''ll catch regressions before they affect users at scale.