- Published on
A/B Testing LLM Models and Prompts — Replacing Guesswork With Data
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
You''ve written a new prompt that you think is better. Or you want to try a faster, cheaper model. How do you know if the change actually improves user experience without breaking production?
This post covers A/B testing architectures for LLM model and prompt changes, with statistical rigor and safety nets.
- Model A/B Testing Architecture
- Shadow Mode: Run Both Models, Show One
- Statistical Significance Testing
- Gradual Rollout: Canary to 100%
- Conclusion
Model A/B Testing Architecture
Split traffic between two model versions and measure quality metrics:
import { v4 as uuidv4 } from "uuid";
interface ExperimentConfig {
experimentId: string;
controlModelName: string;
treatmentModelName: string;
trafficSplitPercent: number; // e.g., 50 for 50/50 split
minimumSampleSize: number;
statisticalSignificanceThreshold: number; // e.g., 0.05
}
interface ExperimentResult {
userId: string;
experimentId: string;
variant: "control" | "treatment";
modelName: string;
responseQuality: number; // 0-1
latencyMs: number;
costUSD: number;
timestamp: Date;
}
class ABTestingOrchestrator {
async assignVariant(
userId: string,
experimentId: string
): Promise<"control" | "treatment"> {
// Deterministic assignment: same user always gets same variant
const hash = this.hashUserId(userId);
const assignment = hash % 100;
// Read from feature flag system or database
const experiment = await this.getExperiment(experimentId);
if (assignment < experiment.trafficSplitPercent) {
return "treatment";
}
return "control";
}
async selectModel(
userId: string,
experimentId: string,
defaultModel: string
): Promise<string> {
const variant = await this.assignVariant(userId, experimentId);
const experiment = await this.getExperiment(experimentId);
if (variant === "treatment") {
return experiment.treatmentModelName;
}
return experiment.controlModelName;
}
async recordResult(result: ExperimentResult): Promise<void> {
// Store in database for analysis
await this.persistResult(result);
// Check for early stopping (quality regression)
const canary = await this.checkCanary(result.experimentId);
if (canary.shouldStop) {
await this.stopExperiment(result.experimentId);
console.error(`Experiment stopped: ${canary.reason}`);
}
}
private hashUserId(userId: string): number {
let hash = 0;
for (let i = 0; i < userId.length; i++) {
const char = userId.charCodeAt(i);
hash = (hash << 5) - hash + char;
hash = hash & hash; // Convert to 32-bit integer
}
return Math.abs(hash);
}
private async getExperiment(
experimentId: string
): Promise<ExperimentConfig> {
// Fetch from cache or database
return {
experimentId,
controlModelName: "claude-3-5-sonnet-20241022",
treatmentModelName: "claude-3-5-haiku-20241022",
trafficSplitPercent: 50,
minimumSampleSize: 1000,
statisticalSignificanceThreshold: 0.05,
};
}
private async persistResult(result: ExperimentResult): Promise<void> {
// Store in database
}
private async checkCanary(experimentId: string): Promise<{
shouldStop: boolean;
reason: string;
}> {
// Early stopping if quality drops below threshold
return { shouldStop: false, reason: "" };
}
private async stopExperiment(experimentId: string): Promise<void> {
// Mark experiment as stopped in feature flag system
}
}
export { ABTestingOrchestrator, ExperimentConfig, ExperimentResult };
Shadow Mode: Run Both Models, Show One
In shadow mode, call both the control and treatment model, but only show the user the control response. Use treatment responses for evaluation:
interface ShadowModeResult {
controlResponse: string;
treatmentResponse: string;
comparisonScore: number;
recommendTreatment: boolean;
}
class ShadowModeEvaluator {
async runInShadowMode(
prompt: string,
controlModel: string,
treatmentModel: string,
userId: string
): Promise<ShadowModeResult> {
const startTime = Date.now();
// Run both models in parallel to minimize latency impact
const [controlResult, treatmentResult] = await Promise.all([
this.callModel(prompt, controlModel, userId),
this.callModel(prompt, treatmentModel, userId),
]);
const latencyMs = Date.now() - startTime;
// Compare responses without user seeing it
const comparison = await this.compareResponses(
prompt,
controlResult.response,
treatmentResult.response
);
// Log for analysis
await this.logShadowComparison({
userId,
controlModel,
treatmentModel,
controlLatency: controlResult.latencyMs,
treatmentLatency: treatmentResult.latencyMs,
comparisonScore: comparison.score,
totalLatency: latencyMs,
});
return {
controlResponse: controlResult.response,
treatmentResponse: treatmentResult.response,
comparisonScore: comparison.score,
recommendTreatment: comparison.score > 0.7,
};
}
private async callModel(
prompt: string,
model: string,
userId: string
): Promise<{
response: string;
latencyMs: number;
costUSD: number;
}> {
const startTime = Date.now();
// Call the actual LLM API
return {
response: "sample response",
latencyMs: Date.now() - startTime,
costUSD: 0.01,
};
}
private async compareResponses(
prompt: string,
controlResponse: string,
treatmentResponse: string
): Promise<{ score: number; reasoning: string }> {
// Use LLM-as-judge to compare quality
return {
score: 0.85,
reasoning: "Treatment response is more detailed",
};
}
private async logShadowComparison(
data: Record<string, unknown>
): Promise<void> {
// Persist to database for analysis
}
}
export { ShadowModeEvaluator, ShadowModeResult };
Statistical Significance Testing
Know when your experiment results are statistically valid:
interface ExperimentAnalysis {
controlQualityScore: number;
treatmentQualityScore: number;
qualityLift: number; // percentage improvement
sampleSize: number;
pValue: number;
isSignificant: boolean;
confidenceIntervalMin: number;
confidenceIntervalMax: number;
recommendation: "promote" | "reject" | "continue_testing";
}
class StatisticalAnalyzer {
async analyzeExperiment(
experimentId: string,
minSampleSize: number = 1000,
significanceThreshold: number = 0.05
): Promise<ExperimentAnalysis> {
const results = await this.getExperimentResults(experimentId);
const control = results.filter((r) => r.variant === "control");
const treatment = results.filter((r) => r.variant === "treatment");
if (
control.length < minSampleSize ||
treatment.length < minSampleSize
) {
return {
controlQualityScore: 0,
treatmentQualityScore: 0,
qualityLift: 0,
sampleSize: Math.min(control.length, treatment.length),
pValue: 1,
isSignificant: false,
confidenceIntervalMin: 0,
confidenceIntervalMax: 0,
recommendation: "continue_testing",
};
}
const controlScore =
control.reduce((sum, r) => sum + r.responseQuality, 0) /
control.length;
const treatmentScore =
treatment.reduce((sum, r) => sum + r.responseQuality, 0) /
treatment.length;
const pValue = this.calculateTTest(control, treatment);
const isSignificant = pValue < significanceThreshold;
const qualityLift = ((treatmentScore - controlScore) / controlScore) * 100;
const [confMin, confMax] = this.calculateConfidenceInterval(
treatmentScore,
treatment.length,
0.95
);
let recommendation: "promote" | "reject" | "continue_testing" =
"continue_testing";
if (isSignificant && qualityLift > 0) {
recommendation = "promote";
} else if (isSignificant && qualityLift < -5) {
recommendation = "reject";
}
return {
controlQualityScore: controlScore,
treatmentQualityScore: treatmentScore,
qualityLift,
sampleSize: Math.min(control.length, treatment.length),
pValue,
isSignificant,
confidenceIntervalMin: confMin,
confidenceIntervalMax: confMax,
recommendation,
};
}
private calculateTTest(
control: ExperimentResult[],
treatment: ExperimentResult[]
): number {
// Welch''s t-test for unequal variances
const controlMean =
control.reduce((sum, r) => sum + r.responseQuality, 0) /
control.length;
const treatmentMean =
treatment.reduce((sum, r) => sum + r.responseQuality, 0) /
treatment.length;
const controlVar =
control.reduce((sum, r) => sum + Math.pow(r.responseQuality - controlMean, 2), 0) /
(control.length - 1);
const treatmentVar =
treatment.reduce((sum, r) => sum + Math.pow(r.responseQuality - treatmentMean, 2), 0) /
(treatment.length - 1);
const tStatistic =
(controlMean - treatmentMean) /
Math.sqrt(controlVar / control.length + treatmentVar / treatment.length);
// Approximate p-value (simplified)
return Math.min(1, 2 * Math.exp(-Math.abs(tStatistic) / 2));
}
private calculateConfidenceInterval(
mean: number,
sampleSize: number,
confidenceLevel: number
): [number, number] {
const marginOfError = 1.96 * Math.sqrt(0.25 / sampleSize);
return [mean - marginOfError, mean + marginOfError];
}
private async getExperimentResults(
experimentId: string
): Promise<ExperimentResult[]> {
// Query database
return [];
}
}
export { StatisticalAnalyzer, ExperimentAnalysis };
Gradual Rollout: Canary to 100%
Once your experiment is statistically significant, roll out gradually:
interface RolloutStage {
stageId: number;
trafficPercent: number;
durationHours: number;
checkpointName: string;
}
class GradualRolloutManager {
private readonly DEFAULT_ROLLOUT_STAGES: RolloutStage[] = [
{
stageId: 1,
trafficPercent: 5,
durationHours: 2,
checkpointName: "canary",
},
{
stageId: 2,
trafficPercent: 25,
durationHours: 4,
checkpointName: "early_adopters",
},
{
stageId: 3,
trafficPercent: 50,
durationHours: 6,
checkpointName: "half_rollout",
},
{
stageId: 4,
trafficPercent: 100,
durationHours: 0,
checkpointName: "full_rollout",
},
];
async initiateRollout(
modelOrPromptId: string,
stages?: RolloutStage[]
): Promise<void> {
const rolloutStages = stages || this.DEFAULT_ROLLOUT_STAGES;
for (const stage of rolloutStages) {
console.log(`Rolling out to ${stage.trafficPercent}%`);
// Update feature flag system
await this.updateTrafficAllocation(modelOrPromptId, stage.trafficPercent);
if (stage.durationHours > 0) {
// Wait and monitor
await this.monitorStage(modelOrPromptId, stage);
// Check for issues
const health = await this.checkServiceHealth(modelOrPromptId);
if (!health.isHealthy) {
console.error(`Rollout failed at stage ${stage.checkpointName}`);
await this.rollback(modelOrPromptId);
return;
}
}
}
console.log(`Rollout complete: ${modelOrPromptId}`);
}
private async updateTrafficAllocation(
modelOrPromptId: string,
percent: number
): Promise<void> {
// Update feature flag service (LaunchDarkly, Unleash, etc.)
}
private async monitorStage(
modelOrPromptId: string,
stage: RolloutStage
): Promise<void> {
const endTime = Date.now() + stage.durationHours * 60 * 60 * 1000;
while (Date.now() < endTime) {
// Wait 30 seconds between checks
await new Promise((resolve) => setTimeout(resolve, 30000));
const metrics = await this.getMetrics(modelOrPromptId);
console.log(
`Stage ${stage.checkpointName}: latency=${metrics.latencyMs}ms, errors=${metrics.errorRate}%`
);
}
}
private async checkServiceHealth(
modelOrPromptId: string
): Promise<{ isHealthy: boolean; reason?: string }> {
const metrics = await this.getMetrics(modelOrPromptId);
if (metrics.errorRate > 5) {
return { isHealthy: false, reason: "Error rate too high" };
}
if (metrics.latencyMs > 10000) {
return { isHealthy: false, reason: "Latency degradation" };
}
return { isHealthy: true };
}
private async rollback(modelOrPromptId: string): Promise<void> {
console.log(`Rolling back: ${modelOrPromptId}`);
await this.updateTrafficAllocation(modelOrPromptId, 0);
}
private async getMetrics(modelOrPromptId: string): Promise<{
latencyMs: number;
errorRate: number;
}> {
// Query observability backend
return { latencyMs: 1200, errorRate: 0.5 };
}
}
export { GradualRolloutManager, RolloutStage };
Conclusion
A/B testing LLM models and prompts transforms experimentation from guesswork to data-driven decisions. Use shadow mode to test without user impact, statistical significance testing to validate results, and gradual rollouts to protect production.
The combination of these techniques lets you ship improvements with confidence, knowing you''ll catch regressions before they affect users at scale.