Published on

AI Model Versioning — Managing Model Updates Without Breaking Your Application

Authors
  • Name
    Twitter

Introduction

Deploying a new model version without breaking production requires more than a git commit and pushing to main. This guide covers semantic versioning for AI models, maintaining model registries, canary deployments, shadow testing, A/B testing, and automated rollback strategies.

Semantic Versioning for AI Models

Apply semantic versioning to models: MAJOR.MINOR.PATCH.

  • MAJOR: Breaking changes (different input schema, output format incompatible with downstream systems)
  • MINOR: New capabilities, improved performance that doesn't break contracts
  • PATCH: Bug fixes, numerical improvements without behavioral changes
interface ModelVersion {
  name: string;
  semanticVersion: string;
  trainingData: {
    version: string;
    dateRange: { start: Date; end: Date };
    size: number;
  };
  evaluationMetrics: Record<string, number>;
  releaseDate: Date;
  changelog: string;
  breakingChanges: string[];
  compatibleDownstreamVersions: string[];
}

function parseModelVersion(version: string): {
  major: number;
  minor: number;
  patch: number;
} {
  const [major, minor, patch] = version.split('.').map(Number);
  return { major, minor, patch };
}

function isBreakingChange(
  oldVersion: ModelVersion,
  newVersion: ModelVersion
): boolean {
  return (
    JSON.stringify(oldVersion.evaluationMetrics) !==
    JSON.stringify(newVersion.evaluationMetrics) ||
    newVersion.breakingChanges.length > 0
  );
}

Model Registry Architecture

A model registry centralizes model metadata, lineage, and governance:

interface ModelRegistry {
  listVersions(modelName: string): Promise<ModelVersion[]>;
  getVersion(modelName: string, version: string): Promise<ModelVersion>;
  registerModel(metadata: ModelVersion): Promise<void>;
  markStage(modelName: string, version: string, stage: 'staging' | 'production'): Promise<void>;
  getProductionModel(modelName: string): Promise<ModelVersion>;
}

class LocalModelRegistry implements ModelRegistry {
  private models: Map<string, ModelVersion[]> = new Map();

  async registerModel(metadata: ModelVersion): Promise<void> {
    if (!this.models.has(metadata.name)) {
      this.models.set(metadata.name, []);
    }
    const versions = this.models.get(metadata.name)!;
    versions.push(metadata);
    versions.sort((a, b) => b.releaseDate.getTime() - a.releaseDate.getTime());
  }

  async getVersion(modelName: string, version: string): Promise<ModelVersion> {
    const versions = this.models.get(modelName) || [];
    const found = versions.find((v) => v.semanticVersion === version);
    if (!found) throw new Error(`Version ${version} not found`);
    return found;
  }

  async listVersions(modelName: string): Promise<ModelVersion[]> {
    return this.models.get(modelName) || [];
  }

  async markStage(
    modelName: string,
    version: string,
    stage: 'staging' | 'production'
  ): Promise<void> {
    const ver = await this.getVersion(modelName, version);
    console.log(`Marked ${modelName}@${version} as ${stage}`);
  }

  async getProductionModel(modelName: string): Promise<ModelVersion> {
    const versions = await this.listVersions(modelName);
    return versions[0]; // Most recent
  }
}

Canary Deployment Strategy

Don't deploy to 100% of traffic immediately. Start with 5%, monitor metrics, gradually increase:

interface CanaryConfig {
  initialTrafficPercent: number;
  increments: number[];
  metricsThresholds: Record<string, number>;
  monitoringDurationMinutes: number;
}

class CanaryDeployment {
  private currentTrafficPercent: number = 0;
  private startTime: Date = new Date();
  private metricsHistory: Record<string, number[]> = {};

  async deploy(
    newModel: ModelVersion,
    oldModel: ModelVersion,
    config: CanaryConfig
  ): Promise<boolean> {
    this.currentTrafficPercent = config.initialTrafficPercent;

    for (const target of config.increments) {
      console.log(`Routing ${this.currentTrafficPercent}% traffic to new model`);

      const passedMonitoring = await this.monitorMetrics(
        newModel,
        oldModel,
        config
      );

      if (!passedMonitoring) {
        console.log('Metrics degradation detected, rolling back');
        return false;
      }

      this.currentTrafficPercent = target;
      await this.sleep(config.monitoringDurationMinutes * 60 * 1000);
    }

    console.log('Canary deployment successful, routing 100% traffic');
    return true;
  }

  private async monitorMetrics(
    newModel: ModelVersion,
    oldModel: ModelVersion,
    config: CanaryConfig
  ): Promise<boolean> {
    for (const [metric, threshold] of Object.entries(config.metricsThresholds)) {
      const newValue = await this.getMetricValue(newModel, metric);
      const oldValue = await this.getMetricValue(oldModel, metric);

      const degradation = (oldValue - newValue) / oldValue;
      if (degradation > threshold) {
        console.log(`Metric ${metric} degraded by ${(degradation * 100).toFixed(1)}%`);
        return false;
      }
    }
    return true;
  }

  private async getMetricValue(
    model: ModelVersion,
    metric: string
  ): Promise<number> {
    // Query monitoring system for recent metric values
    return model.evaluationMetrics[metric] || 0;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Shadow Testing New Models

Run new model in parallel with production model, compare results without impacting users:

interface ShadowTestResult {
  queryId: string;
  productionOutput: string;
  newModelOutput: string;
  differenceSeverity: 'none' | 'minor' | 'moderate' | 'severe';
  agreement: number;
}

class ShadowTester {
  async runShadowTest(
    productionModel: LLMModel,
    newModel: LLMModel,
    testQueries: string[]
  ): Promise<ShadowTestResult[]> {
    const results: ShadowTestResult[] = [];

    for (const query of testQueries) {
      const prodResponse = await productionModel.generate(query);
      const newResponse = await newModel.generate(query);

      const similarity = this.computeSimilarity(prodResponse, newResponse);

      results.push({
        queryId: this.generateId(),
        productionOutput: prodResponse,
        newModelOutput: newResponse,
        differenceSeverity: this.classifySeverity(similarity),
        agreement: similarity
      });
    }

    // Alert if agreement is unexpectedly low
    const avgAgreement = results.reduce((a, b) => a + b.agreement, 0) / results.length;
    if (avgAgreement < 0.7) {
      console.warn(`Low agreement detected: ${(avgAgreement * 100).toFixed(1)}%`);
    }

    return results;
  }

  private computeSimilarity(output1: string, output2: string): number {
    // Use semantic similarity (embeddings) rather than string matching
    const tokens1 = output1.split(/\s+/);
    const tokens2 = output2.split(/\s+/);
    const common = tokens1.filter((t) => tokens2.includes(t)).length;
    return common / Math.max(tokens1.length, tokens2.length);
  }

  private classifySeverity(agreement: number): ShadowTestResult['differenceSeverity'] {
    if (agreement > 0.9) return 'none';
    if (agreement > 0.7) return 'minor';
    if (agreement > 0.5) return 'moderate';
    return 'severe';
  }

  private generateId(): string {
    return Math.random().toString(36).substring(7);
  }
}

Rollback Triggers

Define metrics that trigger automatic rollback:

interface RollbackPolicy {
  metrics: Array<{
    name: string;
    threshold: number;
    direction: 'greater' | 'less'; // greater for latency, less for accuracy
  }>;
  evalWindow: number; // minutes
  triggerAfter: number; // consecutive violations
}

class RollbackManager {
  private violationCounts: Record<string, number> = {};

  async monitorForRollback(
    currentModel: ModelVersion,
    policy: RollbackPolicy,
    metrics: Record<string, number>
  ): Promise<boolean> {
    let violations = 0;

    for (const metricPolicy of policy.metrics) {
      const value = metrics[metricPolicy.name];
      const triggered =
        metricPolicy.direction === 'greater'
          ? value > metricPolicy.threshold
          : value < metricPolicy.threshold;

      if (triggered) {
        violations++;
        this.violationCounts[metricPolicy.name] =
          (this.violationCounts[metricPolicy.name] || 0) + 1;

        if (this.violationCounts[metricPolicy.name] >= policy.triggerAfter) {
          console.log(
            `Rollback triggered: ${metricPolicy.name} violated ${policy.triggerAfter} times`
          );
          return true;
        }
      }
    }

    return violations === 0;
  }
}

A/B Testing Model Versions

For longer-term comparisons, split users and track business metrics:

interface ABTestConfig {
  modelA: ModelVersion;
  modelB: ModelVersion;
  trafficSplit: number; // 0.5 for 50/50
  duration: number; // days
  businessMetrics: string[];
  minSampleSize: number;
}

class ModelABTest {
  async runABTest(config: ABTestConfig): Promise<ABTestResult> {
    const usersA: string[] = [];
    const usersB: string[] = [];

    // Partition users
    const allUsers = await this.getAllUsers();
    for (const user of allUsers) {
      if (Math.random() < config.trafficSplit) {
        usersA.push(user);
      } else {
        usersB.push(user);
      }
    }

    // Collect metrics
    const metricsA = await this.collectMetrics(config.modelA, usersA, config.duration);
    const metricsB = await this.collectMetrics(config.modelB, usersB, config.duration);

    // Compare
    const winner = this.determineWinner(metricsA, metricsB);

    return {
      winner,
      metricsA,
      metricsB,
      confidence: this.calculateConfidence(metricsA, metricsB)
    };
  }

  private determineWinner(
    metricsA: Record<string, number>,
    metricsB: Record<string, number>
  ): 'A' | 'B' | 'tie' {
    let aWins = 0;
    let bWins = 0;

    for (const key in metricsA) {
      if (metricsA[key] > metricsB[key]) aWins++;
      else if (metricsB[key] > metricsA[key]) bWins++;
    }

    if (aWins > bWins) return 'A';
    if (bWins > aWins) return 'B';
    return 'tie';
  }

  private calculateConfidence(
    metricsA: Record<string, number>,
    metricsB: Record<string, number>
  ): number {
    // Use statistical test to calculate p-value
    return 0.95; // Placeholder
  }

  private async collectMetrics(
    model: ModelVersion,
    users: string[],
    durationDays: number
  ): Promise<Record<string, number>> {
    // Query analytics for business metrics
    return {};
  }

  private async getAllUsers(): Promise<string[]> {
    return [];
  }
}

Version Pinning for Deterministic Behavior

Critical applications may require pinning to specific model versions:

interface ApplicationConfig {
  modelName: string;
  pinned: boolean;
  pinnedVersion?: string;
  allowMinorUpdates: boolean;
  maxMajorVersion: number;
}

class VersionConstraint {
  static checkCompatibility(
    app: ApplicationConfig,
    availableVersion: string
  ): boolean {
    if (app.pinned && app.pinnedVersion) {
      return availableVersion === app.pinnedVersion;
    }

    if (!app.allowMinorUpdates) {
      // Only allow patch updates
      const [appMajor, appMinor] = app.pinnedVersion?.split('.') || [];
      const [newMajor, newMinor] = availableVersion.split('.');
      return appMajor === newMajor && appMinor === newMinor;
    }

    // Allow minor updates within major version
    const [appMajor] = app.pinnedVersion?.split('.') || [];
    const [newMajor] = availableVersion.split('.');
    return parseInt(newMajor) <= app.maxMajorVersion;
  }
}

Checklist

  • Implement semantic versioning for all models
  • Set up centralized model registry with metadata
  • Document training data version and date range for each model
  • Track evaluation metrics with each model version
  • Create canary deployment pipelines with graduated traffic shifts
  • Implement shadow testing before production deployment
  • Define rollback triggers based on monitoring metrics
  • Run A/B tests for significant model updates
  • Allow version pinning for critical applications
  • Automate model changelog generation from commits and eval changes

Conclusion

Model versioning is infrastructure for confident deployments. Semantic versioning prevents unexpected breaking changes. Model registries centralize governance. Canary deployments catch issues early. Shadow testing validates new models risk-free. Automated rollback prevents cascading failures. A/B testing validates business impact. Version pinning gives applications control. Together, these practices enable continuous improvement without breaking production systems.