Zero-Downtime AI System Updates — Deploying New Models and Prompts Without Outages

Introduction

Deploying new AI models or prompts breaks conversations. Users suddenly get different behavior. Worse: if the new model is broken, your entire service degrades. Zero-downtime updates require careful orchestration: shadow mode, feature flags, canary deployments, and rollback plans.

Deploying New LLM Versions Without Breaking Conversations
Shadow Mode for New Models
Feature Flags for Model Switching
Prompt Versioning With Rollback Capability
A/B Testing Prompts in Production
Canary Deployment for RAG Pipeline Changes
Embedding Model Migration (Re-Embedding Strategy)
Conversation Context Migration
Checklist
Conclusion

Deploying New LLM Versions Without Breaking Conversations

Conversations maintain context across messages. Switching models mid-conversation breaks continuity. Deploy carefully.

// Store model version in conversation
interface Conversation {
  id: string;
  userId: string;
  messages: Message[];
  modelVersion: string; // Track which model generated each response
  createdAt: Date;
  migratedAt?: Date; // When conversation switched models
}

// New conversations use new model
async function createConversation(userId: string) {
  const latestModel = await getLatestModel(); // e.g., gpt-4o-2024-05-13

  return await db.conversations.insertOne({
    id: uuidv4(),
    userId,
    messages: [],
    modelVersion: latestModel,
    createdAt: new Date()
  });
}

// Existing conversations stay on old model
async function getResponse(conversationId: string, userMessage: string) {
  const conversation = await db.conversations.findOne({ id: conversationId });
  const model = conversation.modelVersion; // Use the model the conversation started with

  const response = await openai.createChatCompletion({
    model,
    messages: conversation.messages
  });

  return response.choices[0].message.content;
}

// Optionally: migrate old conversations to new model
async function migrateConversationModel(conversationId: string) {
  const conversation = await db.conversations.findOne({ id: conversationId });
  const oldModel = conversation.modelVersion;
  const newModel = await getLatestModel();

  if (oldModel === newModel) return; // Already on latest

  // Update model version
  await db.conversations.updateOne(
    { id: conversationId },
    { $set: { modelVersion: newModel, migratedAt: new Date() } }
  );

  // Warn user: behavior may change
  logger.info('Conversation migrated to new model', {
    conversationId,
    from: oldModel,
    to: newModel
  });
}

Keep conversations on their original model version. Only migrate deliberately, and with user consent.

Shadow Mode for New Models

Run both old and new models. Compare outputs before switching.

// Generate responses from both models
async function generateWithShadow(prompt: string) {
  const oldModel = 'gpt-4o-2024-05-13';
  const newModel = 'gpt-4o-2024-08-06';

  // Primary: return old model''s response to user
  const primaryPromise = openai.createChatCompletion({
    model: oldModel,
    messages: [{ role: 'user', content: prompt }]
  });

  // Shadow: run new model in background (don''t wait)
  const shadowPromise = openai.createChatCompletion({
    model: newModel,
    messages: [{ role: 'user', content: prompt }]
  }).then(response =&gt; {
    return {
      model: newModel,
      content: response.choices[0].message.content,
      tokens: response.usage.total_tokens
    };
  }).catch(error =&gt; {
    logger.error('Shadow model error', { error, model: newModel });
    return null;
  });

  const [primary, shadow] = await Promise.all([primaryPromise, shadowPromise]);

  // Return primary response immediately
  const response = primary.choices[0].message.content;

  // Compare in background
  setImmediate(async () =&gt; {
    if (shadow) {
      const comparison = {
        prompt,
        oldModel,
        oldResponse: response,
        newModel,
        newResponse: shadow.content,
        similarity: calculateSimilarity(response, shadow.content),
        timestamp: new Date()
      };

      await db.modelComparisons.insertOne(comparison);

      // Alert if new model diverges significantly
      if (comparison.similarity &lt; 0.7) {
        logger.warn('New model diverges from current model', comparison);
      }
    }
  });

  return response;
}

// Analyze shadow comparisons
async function analyzeModelMigration() {
  const comparisons = await db.modelComparisons.find({
    timestamp: { $gte: last7Days() }
  });

  const similarity = comparisons.map(c =&gt; c.similarity);
  const avgSimilarity = similarity.reduce((a, b) =&gt; a + b) / similarity.length;
  const minSimilarity = Math.min(...similarity);

  return {
    sampleSize: comparisons.length,
    averageSimilarity: avgSimilarity,
    minSimilarity,
    readyToSwitch: avgSimilarity &gt; 0.9 &amp;&amp; minSimilarity &gt; 0.7
  };
}

Shadow mode catches model regressions before they hit users. Run for 1-2 weeks, analyze results, then switch.

Feature Flags for Model Switching

Use feature flags to control model versions.

async function getModelForUser(userId: string) {
  // Get user''s feature flags
  const flags = await featureFlags.getForUser(userId);

  // 1% of users get new model
  if (flags['use_gpt4o_2024_08'] === true) {
    return 'gpt-4o-2024-08-06';
  }

  // Everyone else gets stable model
  return 'gpt-4o-2024-05-13';
}

// Incrementally roll out
async function rolloutNewModel() {
  // Day 1: 1% of users
  await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.01);

  // Day 3: 5%
  await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.05);

  // Day 7: 25%
  await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.25);

  // Day 14: 100%
  await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 1.0);
}

// Kill switch: revert instantly
async function killNewModel() {
  await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0);
  logger.warn('Reverted to stable model');
}

Incremental rollout catches issues early. If new model has bugs affecting 1% of users, you catch it immediately.

Prompt Versioning With Rollback Capability

Prompts change behavior. Version them.

interface PromptVersion {
  id: string;
  name: string; // e.g., "summarization_v1"
  content: string;
  version: number;
  status: 'draft' | 'active' | 'deprecated';
  createdAt: Date;
  createdBy: string;
  metrics?: {
    testPassRate: number;
    userSatisfaction: number;
    cost: number;
  };
}

// Store prompt versions
async function createPromptVersion(name: string, content: string) {
  const existing = await db.prompts.findOne({ name });
  const nextVersion = existing ? existing.version + 1 : 1;

  return await db.prompts.insertOne({
    id: uuidv4(),
    name,
    content,
    version: nextVersion,
    status: 'draft',
    createdAt: new Date(),
    createdBy: getCurrentUser()
  });
}

// Test prompt before activation
async function testPromptVersion(promptVersionId: string, testCases: Array&lt;{ input: string; expectedOutput: string }&gt;) {
  const prompt = await db.prompts.findOne({ id: promptVersionId });

  let passCount = 0;

  for (const testCase of testCases) {
    const response = await openai.createChatCompletion({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: prompt.content },
        { role: 'user', content: testCase.input }
      ]
    });

    if (isSimilar(response.choices[0].message.content, testCase.expectedOutput)) {
      passCount++;
    }
  }

  const passRate = passCount / testCases.length;

  await db.prompts.updateOne(
    { id: promptVersionId },
    { $set: { metrics: { testPassRate: passRate } } }
  );

  return { passRate, passed: passRate &gt; 0.8 };
}

// Activate prompt
async function activatePromptVersion(promptVersionId: string) {
  const prompt = await db.prompts.findOne({ id: promptVersionId });

  if ((prompt.metrics?.testPassRate || 0) &lt; 0.8) {
    throw new Error('Prompt did not pass tests');
  }

  // Deactivate previous version
  await db.prompts.updateOne(
    { name: prompt.name, status: 'active' },
    { $set: { status: 'deprecated' } }
  );

  // Activate new version
  await db.prompts.updateOne(
    { id: promptVersionId },
    { $set: { status: 'active' } }
  );
}

// Rollback to previous version
async function rollbackPrompt(name: string) {
  const versions = await db.prompts.find({ name }).sort({ version: -1 });

  const current = versions[0];
  const previous = versions[1];

  if (!previous) {
    throw new Error('No previous version to rollback to');
  }

  await db.prompts.updateOne(
    { id: current.id },
    { $set: { status: 'deprecated' } }
  );

  await db.prompts.updateOne(
    { id: previous.id },
    { $set: { status: 'active' } }
  );

  logger.info('Prompt rolled back', { name, from: current.version, to: previous.version });
}

// Use active prompt
async function getActivePrompt(name: string) {
  const prompt = await db.prompts.findOne({ name, status: 'active' });

  if (!prompt) {
    throw new Error(`No active prompt: ${name}`);
  }

  return prompt.content;
}

Test prompts before deployment. Rollback takes seconds.

A/B Testing Prompts in Production

Run experiments: which prompt version performs better?

async function generateWithExperiment(userId: string, prompt: string) {
  // Assign user to experiment variant
  const variant = await getExperimentVariant(userId, 'summarization_v1_vs_v2');

  let promptContent;

  if (variant === 'control') {
    promptContent = await getActivePrompt('summarization_v1');
  } else {
    promptContent = await getActivePrompt('summarization_v2');
  }

  const response = await openai.createChatCompletion({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: promptContent },
      { role: 'user', content: prompt }
    ]
  });

  // Record which variant user saw
  await db.experiments.insertOne({
    experimentId: 'summarization_v1_vs_v2',
    userId,
    variant,
    prompt,
    response: response.choices[0].message.content,
    timestamp: new Date()
  });

  return response.choices[0].message.content;
}

// Analyze experiment results
async function analyzeExperiment(experimentId: string) {
  const results = await db.experiments.find({ experimentId });

  const byVariant = {};
  for (const result of results) {
    if (!byVariant[result.variant]) {
      byVariant[result.variant] = { responses: [], scores: [] };
    }

    // Score response quality (or use user feedback)
    const score = await scoreResponse(result.response);
    byVariant[result.variant].scores.push(score);
  }

  return {
    control: {
      avgScore: avg(byVariant.control.scores),
      sampleSize: byVariant.control.scores.length
    },
    variant: {
      avgScore: avg(byVariant.variant.scores),
      sampleSize: byVariant.variant.scores.length
    },
    winner: avg(byVariant.variant.scores) &gt; avg(byVariant.control.scores) ? 'variant' : 'control'
  };
}

A/B tests reveal which prompt performs better in the wild. Use results to decide rollout.

Canary Deployment for RAG Pipeline Changes

When updating RAG (document embedding, retrieval), deploy to a small subset first.

async function searchDocumentsWithCanary(userId: string, query: string) {
  // 10% of users get new RAG pipeline
  const useNewRAG = shouldCanary(userId, 0.1);

  if (useNewRAG) {
    return await newRagPipeline.search(query, { tenantId: userId });
  }

  return await oldRagPipeline.search(query, { tenantId: userId });
}

// New RAG: new embedding model, new retrieval logic
const newRagPipeline = {
  async search(query: string, options: { tenantId: string }) {
    // Use new embedding model
    const embedding = await model.embed(query, {
      model: 'text-embedding-3-large' // New model
    });

    // New retrieval: hybrid search (keyword + semantic)
    const results = await vectorDb.hybridSearch(embedding, query, {
      filter: { tenantId: options.tenantId },
      topK: 10,
      returnMetadata: true
    });

    return results;
  }
};

// Monitor canary quality
async function monitorCanary() {
  const canaryResults = await db.searches.find({ usedNewRAG: true });
  const controlResults = await db.searches.find({ usedNewRAG: false });

  // Compare metrics
  const canaryRelevance = calculateAverageRelevance(canaryResults);
  const controlRelevance = calculateAverageRelevance(controlResults);

  if (canaryRelevance &lt; controlRelevance * 0.95) {
    // New RAG is worse, revert canary
    logger.warn('Canary RAG underperforming, reverting');
    await featureFlags.disable('use_new_rag');
  } else if (canaryRelevance &gt; controlRelevance * 1.05) {
    // New RAG is better, gradually roll out
    await featureFlags.updatePercentageRollout('use_new_rag', 0.25);
  }
}

Canary deployments catch quality regressions on 10% of users instead of 100%.

Embedding Model Migration (Re-Embedding Strategy)

Changing embedding models requires re-embedding all documents. Plan this carefully.

async function migrateEmbeddingModel(oldModel: string, newModel: string) {
  // Phase 1: Create parallel vector store with new embeddings
  const newVectorDb = new VectorDB({
    name: `vectordb_${newModel}`,
    dimension: await getModelDimension(newModel)
  });

  logger.info('Starting embedding migration', { oldModel, newModel });

  // Phase 2: Batch re-embed all documents
  const allDocuments = await db.documents.find({});
  const batchSize = 100;

  for (let i = 0; i &lt; allDocuments.length; i += batchSize) {
    const batch = allDocuments.slice(i, i + batchSize);

    const embeddings = await Promise.all(
      batch.map(doc =&gt;
        model.embed(doc.content, { model: newModel })
      )
    );

    // Insert into new vector DB
    await Promise.all(
      batch.map((doc, idx) =&gt;
        newVectorDb.insert({
          embedding: embeddings[idx],
          documentId: doc.id,
          tenantId: doc.tenantId,
          content: doc.content
        })
      )
    );

    logger.info('Re-embedded batch', {
      oldModel,
      newModel,
      progress: `${i + batchSize} of ${allDocuments.length}`
    });
  }

  // Phase 3: Dual-write phase (write to both old and new vector DBs)
  // New documents go to both
  await featureFlags.enable('dual_write_embeddings');

  // Phase 4: Dual-read phase (try new first, fallback to old)
  await featureFlags.enable('read_new_embeddings_first');

  // Phase 5: Cutover (read only from new)
  await featureFlags.disable('read_old_embeddings_fallback');

  logger.info('Embedding migration complete', { oldModel, newModel });
}

// During migration: dual write
async function indexDocument(tenantId: string, document: string) {
  const chunks = splitIntoChunks(document);

  const oldEmbeddings = await model.embed(chunks, { model: 'text-embedding-3-small' });
  const newEmbeddings = await model.embed(chunks, { model: 'text-embedding-3-large' });

  // Write to both vector DBs
  await Promise.all([
    oldVectorDb.insert({ chunks, embeddings: oldEmbeddings, tenantId }),
    newVectorDb.insert({ chunks, embeddings: newEmbeddings, tenantId })
  ]);
}

Embedding migrations take time. Use dual-write/dual-read patterns to migrate without downtime.

Conversation Context Migration

When schema changes, migrate existing conversation history.

// Old schema
interface OldConversation {
  id: string;
  messages: Array&lt;{ role: string; content: string }&gt;;
}

// New schema: add metadata
interface NewConversation {
  id: string;
  messages: Array&lt;{ role: string; content: string; model: string; timestamp: Date }&gt;;
  modelVersion: string;
  metadata: Record&lt;string, any&gt;;
}

async function migrateConversationSchema() {
  const oldConversations = await db.conversations.find({});

  for (const oldConv of oldConversations) {
    const newConv: NewConversation = {
      id: oldConv.id,
      messages: oldConv.messages.map((msg, idx) =&gt; ({
        role: msg.role,
        content: msg.content,
        model: 'gpt-4o-2024-05-13', // Assume all old messages used this
        timestamp: oldConv.createdAt // Fake timestamps
      })),
      modelVersion: 'gpt-4o-2024-05-13',
      metadata: {}
    };

    // Use upsert to handle partial migrations
    await db.conversations.updateOne(
      { id: newConv.id },
      { $set: newConv },
      { upsert: true }
    );
  }
}

Upserts allow phased migrations. New conversations use new schema. Old conversations are migrated in the background.

Checklist

Keep conversations on their original model version
Run shadow mode for 1-2 weeks before switching models
Use feature flags with percentage rollouts (1% → 5% → 25% → 100%)
Version prompts; test before deployment; rollback in seconds
A/B test new prompts; measure user satisfaction
Canary RAG changes to 10% of users first
Plan embedding model migrations with dual-write/dual-read phases
Use upserts for gradual schema migrations
Monitor metrics during rollouts; kill switch if needed
Test zero-downtime deployments in staging before production

Conclusion

Zero-downtime AI updates require paranoia. Shadow mode catches regressions. Feature flags enable gradual rollout. Canary deployments limit blast radius. Prompt versioning enables instant rollback. Combining these patterns makes model and prompt updates boring, safe, and reversible.