AI Model Versioning — Managing Model Updates Without Breaking Your Application

Introduction

Deploying a new model version without breaking production requires more than a git commit and pushing to main. This guide covers semantic versioning for AI models, maintaining model registries, canary deployments, shadow testing, A/B testing, and automated rollback strategies.

Semantic Versioning for AI Models
Model Registry Architecture
Canary Deployment Strategy
Shadow Testing New Models
Rollback Triggers
A/B Testing Model Versions
Version Pinning for Deterministic Behavior
Checklist
Conclusion

Semantic Versioning for AI Models

Apply semantic versioning to models: MAJOR.MINOR.PATCH.

MAJOR: Breaking changes (different input schema, output format incompatible with downstream systems)
MINOR: New capabilities, improved performance that doesn't break contracts
PATCH: Bug fixes, numerical improvements without behavioral changes

interface ModelVersion {
  name: string;
  semanticVersion: string;
  trainingData: {
    version: string;
    dateRange: { start: Date; end: Date };
    size: number;
  };
  evaluationMetrics: Record&lt;string, number&gt;;
  releaseDate: Date;
  changelog: string;
  breakingChanges: string[];
  compatibleDownstreamVersions: string[];
}

function parseModelVersion(version: string): {
  major: number;
  minor: number;
  patch: number;
} {
  const [major, minor, patch] = version.split('.').map(Number);
  return { major, minor, patch };
}

function isBreakingChange(
  oldVersion: ModelVersion,
  newVersion: ModelVersion
): boolean {
  return (
    JSON.stringify(oldVersion.evaluationMetrics) !==
    JSON.stringify(newVersion.evaluationMetrics) ||
    newVersion.breakingChanges.length &gt; 0
  );
}

Model Registry Architecture

A model registry centralizes model metadata, lineage, and governance:

interface ModelRegistry {
  listVersions(modelName: string): Promise&lt;ModelVersion[]&gt;;
  getVersion(modelName: string, version: string): Promise&lt;ModelVersion&gt;;
  registerModel(metadata: ModelVersion): Promise&lt;void&gt;;
  markStage(modelName: string, version: string, stage: 'staging' | 'production'): Promise&lt;void&gt;;
  getProductionModel(modelName: string): Promise&lt;ModelVersion&gt;;
}

class LocalModelRegistry implements ModelRegistry {
  private models: Map&lt;string, ModelVersion[]&gt; = new Map();

  async registerModel(metadata: ModelVersion): Promise&lt;void&gt; {
    if (!this.models.has(metadata.name)) {
      this.models.set(metadata.name, []);
    }
    const versions = this.models.get(metadata.name)!;
    versions.push(metadata);
    versions.sort((a, b) =&gt; b.releaseDate.getTime() - a.releaseDate.getTime());
  }

  async getVersion(modelName: string, version: string): Promise&lt;ModelVersion&gt; {
    const versions = this.models.get(modelName) || [];
    const found = versions.find((v) =&gt; v.semanticVersion === version);
    if (!found) throw new Error(`Version ${version} not found`);
    return found;
  }

  async listVersions(modelName: string): Promise&lt;ModelVersion[]&gt; {
    return this.models.get(modelName) || [];
  }

  async markStage(
    modelName: string,
    version: string,
    stage: 'staging' | 'production'
  ): Promise&lt;void&gt; {
    const ver = await this.getVersion(modelName, version);
    console.log(`Marked ${modelName}@${version} as ${stage}`);
  }

  async getProductionModel(modelName: string): Promise&lt;ModelVersion&gt; {
    const versions = await this.listVersions(modelName);
    return versions[0]; // Most recent
  }
}

Canary Deployment Strategy

Don't deploy to 100% of traffic immediately. Start with 5%, monitor metrics, gradually increase:

interface CanaryConfig {
  initialTrafficPercent: number;
  increments: number[];
  metricsThresholds: Record&lt;string, number&gt;;
  monitoringDurationMinutes: number;
}

class CanaryDeployment {
  private currentTrafficPercent: number = 0;
  private startTime: Date = new Date();
  private metricsHistory: Record&lt;string, number[]&gt; = {};

  async deploy(
    newModel: ModelVersion,
    oldModel: ModelVersion,
    config: CanaryConfig
  ): Promise&lt;boolean&gt; {
    this.currentTrafficPercent = config.initialTrafficPercent;

    for (const target of config.increments) {
      console.log(`Routing ${this.currentTrafficPercent}% traffic to new model`);

      const passedMonitoring = await this.monitorMetrics(
        newModel,
        oldModel,
        config
      );

      if (!passedMonitoring) {
        console.log('Metrics degradation detected, rolling back');
        return false;
      }

      this.currentTrafficPercent = target;
      await this.sleep(config.monitoringDurationMinutes * 60 * 1000);
    }

    console.log('Canary deployment successful, routing 100% traffic');
    return true;
  }

  private async monitorMetrics(
    newModel: ModelVersion,
    oldModel: ModelVersion,
    config: CanaryConfig
  ): Promise&lt;boolean&gt; {
    for (const [metric, threshold] of Object.entries(config.metricsThresholds)) {
      const newValue = await this.getMetricValue(newModel, metric);
      const oldValue = await this.getMetricValue(oldModel, metric);

      const degradation = (oldValue - newValue) / oldValue;
      if (degradation &gt; threshold) {
        console.log(`Metric ${metric} degraded by ${(degradation * 100).toFixed(1)}%`);
        return false;
      }
    }
    return true;
  }

  private async getMetricValue(
    model: ModelVersion,
    metric: string
  ): Promise&lt;number&gt; {
    // Query monitoring system for recent metric values
    return model.evaluationMetrics[metric] || 0;
  }

  private sleep(ms: number): Promise&lt;void&gt; {
    return new Promise((resolve) =&gt; setTimeout(resolve, ms));
  }
}

Shadow Testing New Models

Run new model in parallel with production model, compare results without impacting users:

interface ShadowTestResult {
  queryId: string;
  productionOutput: string;
  newModelOutput: string;
  differenceSeverity: 'none' | 'minor' | 'moderate' | 'severe';
  agreement: number;
}

class ShadowTester {
  async runShadowTest(
    productionModel: LLMModel,
    newModel: LLMModel,
    testQueries: string[]
  ): Promise&lt;ShadowTestResult[]&gt; {
    const results: ShadowTestResult[] = [];

    for (const query of testQueries) {
      const prodResponse = await productionModel.generate(query);
      const newResponse = await newModel.generate(query);

      const similarity = this.computeSimilarity(prodResponse, newResponse);

      results.push({
        queryId: this.generateId(),
        productionOutput: prodResponse,
        newModelOutput: newResponse,
        differenceSeverity: this.classifySeverity(similarity),
        agreement: similarity
      });
    }

    // Alert if agreement is unexpectedly low
    const avgAgreement = results.reduce((a, b) =&gt; a + b.agreement, 0) / results.length;
    if (avgAgreement &lt; 0.7) {
      console.warn(`Low agreement detected: ${(avgAgreement * 100).toFixed(1)}%`);
    }

    return results;
  }

  private computeSimilarity(output1: string, output2: string): number {
    // Use semantic similarity (embeddings) rather than string matching
    const tokens1 = output1.split(/\s+/);
    const tokens2 = output2.split(/\s+/);
    const common = tokens1.filter((t) =&gt; tokens2.includes(t)).length;
    return common / Math.max(tokens1.length, tokens2.length);
  }

  private classifySeverity(agreement: number): ShadowTestResult['differenceSeverity'] {
    if (agreement &gt; 0.9) return 'none';
    if (agreement &gt; 0.7) return 'minor';
    if (agreement &gt; 0.5) return 'moderate';
    return 'severe';
  }

  private generateId(): string {
    return Math.random().toString(36).substring(7);
  }
}

Rollback Triggers

Define metrics that trigger automatic rollback:

interface RollbackPolicy {
  metrics: Array&lt;{
    name: string;
    threshold: number;
    direction: 'greater' | 'less'; // greater for latency, less for accuracy
  }&gt;;
  evalWindow: number; // minutes
  triggerAfter: number; // consecutive violations
}

class RollbackManager {
  private violationCounts: Record&lt;string, number&gt; = {};

  async monitorForRollback(
    currentModel: ModelVersion,
    policy: RollbackPolicy,
    metrics: Record&lt;string, number&gt;
  ): Promise&lt;boolean&gt; {
    let violations = 0;

    for (const metricPolicy of policy.metrics) {
      const value = metrics[metricPolicy.name];
      const triggered =
        metricPolicy.direction === 'greater'
          ? value &gt; metricPolicy.threshold
          : value &lt; metricPolicy.threshold;

      if (triggered) {
        violations++;
        this.violationCounts[metricPolicy.name] =
          (this.violationCounts[metricPolicy.name] || 0) + 1;

        if (this.violationCounts[metricPolicy.name] &gt;= policy.triggerAfter) {
          console.log(
            `Rollback triggered: ${metricPolicy.name} violated ${policy.triggerAfter} times`
          );
          return true;
        }
      }
    }

    return violations === 0;
  }
}

A/B Testing Model Versions

For longer-term comparisons, split users and track business metrics:

interface ABTestConfig {
  modelA: ModelVersion;
  modelB: ModelVersion;
  trafficSplit: number; // 0.5 for 50/50
  duration: number; // days
  businessMetrics: string[];
  minSampleSize: number;
}

class ModelABTest {
  async runABTest(config: ABTestConfig): Promise&lt;ABTestResult&gt; {
    const usersA: string[] = [];
    const usersB: string[] = [];

    // Partition users
    const allUsers = await this.getAllUsers();
    for (const user of allUsers) {
      if (Math.random() &lt; config.trafficSplit) {
        usersA.push(user);
      } else {
        usersB.push(user);
      }
    }

    // Collect metrics
    const metricsA = await this.collectMetrics(config.modelA, usersA, config.duration);
    const metricsB = await this.collectMetrics(config.modelB, usersB, config.duration);

    // Compare
    const winner = this.determineWinner(metricsA, metricsB);

    return {
      winner,
      metricsA,
      metricsB,
      confidence: this.calculateConfidence(metricsA, metricsB)
    };
  }

  private determineWinner(
    metricsA: Record&lt;string, number&gt;,
    metricsB: Record&lt;string, number&gt;
  ): 'A' | 'B' | 'tie' {
    let aWins = 0;
    let bWins = 0;

    for (const key in metricsA) {
      if (metricsA[key] &gt; metricsB[key]) aWins++;
      else if (metricsB[key] &gt; metricsA[key]) bWins++;
    }

    if (aWins &gt; bWins) return 'A';
    if (bWins &gt; aWins) return 'B';
    return 'tie';
  }

  private calculateConfidence(
    metricsA: Record&lt;string, number&gt;,
    metricsB: Record&lt;string, number&gt;
  ): number {
    // Use statistical test to calculate p-value
    return 0.95; // Placeholder
  }

  private async collectMetrics(
    model: ModelVersion,
    users: string[],
    durationDays: number
  ): Promise&lt;Record&lt;string, number&gt;&gt; {
    // Query analytics for business metrics
    return {};
  }

  private async getAllUsers(): Promise&lt;string[]&gt; {
    return [];
  }
}

Version Pinning for Deterministic Behavior

Critical applications may require pinning to specific model versions:

interface ApplicationConfig {
  modelName: string;
  pinned: boolean;
  pinnedVersion?: string;
  allowMinorUpdates: boolean;
  maxMajorVersion: number;
}

class VersionConstraint {
  static checkCompatibility(
    app: ApplicationConfig,
    availableVersion: string
  ): boolean {
    if (app.pinned &amp;&amp; app.pinnedVersion) {
      return availableVersion === app.pinnedVersion;
    }

    if (!app.allowMinorUpdates) {
      // Only allow patch updates
      const [appMajor, appMinor] = app.pinnedVersion?.split('.') || [];
      const [newMajor, newMinor] = availableVersion.split('.');
      return appMajor === newMajor &amp;&amp; appMinor === newMinor;
    }

    // Allow minor updates within major version
    const [appMajor] = app.pinnedVersion?.split('.') || [];
    const [newMajor] = availableVersion.split('.');
    return parseInt(newMajor) &lt;= app.maxMajorVersion;
  }
}

Checklist

Conclusion

Model versioning is infrastructure for confident deployments. Semantic versioning prevents unexpected breaking changes. Model registries centralize governance. Canary deployments catch issues early. Shadow testing validates new models risk-free. Automated rollback prevents cascading failures. A/B testing validates business impact. Version pinning gives applications control. Together, these practices enable continuous improvement without breaking production systems.