- Published on
Zero-Downtime AI System Updates — Deploying New Models and Prompts Without Outages
- Authors
- Name
Introduction
Deploying new AI models or prompts breaks conversations. Users suddenly get different behavior. Worse: if the new model is broken, your entire service degrades. Zero-downtime updates require careful orchestration: shadow mode, feature flags, canary deployments, and rollback plans.
- Deploying New LLM Versions Without Breaking Conversations
- Shadow Mode for New Models
- Feature Flags for Model Switching
- Prompt Versioning With Rollback Capability
- A/B Testing Prompts in Production
- Canary Deployment for RAG Pipeline Changes
- Embedding Model Migration (Re-Embedding Strategy)
- Conversation Context Migration
- Checklist
- Conclusion
Deploying New LLM Versions Without Breaking Conversations
Conversations maintain context across messages. Switching models mid-conversation breaks continuity. Deploy carefully.
// Store model version in conversation
interface Conversation {
id: string;
userId: string;
messages: Message[];
modelVersion: string; // Track which model generated each response
createdAt: Date;
migratedAt?: Date; // When conversation switched models
}
// New conversations use new model
async function createConversation(userId: string) {
const latestModel = await getLatestModel(); // e.g., gpt-4o-2024-05-13
return await db.conversations.insertOne({
id: uuidv4(),
userId,
messages: [],
modelVersion: latestModel,
createdAt: new Date()
});
}
// Existing conversations stay on old model
async function getResponse(conversationId: string, userMessage: string) {
const conversation = await db.conversations.findOne({ id: conversationId });
const model = conversation.modelVersion; // Use the model the conversation started with
const response = await openai.createChatCompletion({
model,
messages: conversation.messages
});
return response.choices[0].message.content;
}
// Optionally: migrate old conversations to new model
async function migrateConversationModel(conversationId: string) {
const conversation = await db.conversations.findOne({ id: conversationId });
const oldModel = conversation.modelVersion;
const newModel = await getLatestModel();
if (oldModel === newModel) return; // Already on latest
// Update model version
await db.conversations.updateOne(
{ id: conversationId },
{ $set: { modelVersion: newModel, migratedAt: new Date() } }
);
// Warn user: behavior may change
logger.info('Conversation migrated to new model', {
conversationId,
from: oldModel,
to: newModel
});
}
Keep conversations on their original model version. Only migrate deliberately, and with user consent.
Shadow Mode for New Models
Run both old and new models. Compare outputs before switching.
// Generate responses from both models
async function generateWithShadow(prompt: string) {
const oldModel = 'gpt-4o-2024-05-13';
const newModel = 'gpt-4o-2024-08-06';
// Primary: return old model''s response to user
const primaryPromise = openai.createChatCompletion({
model: oldModel,
messages: [{ role: 'user', content: prompt }]
});
// Shadow: run new model in background (don''t wait)
const shadowPromise = openai.createChatCompletion({
model: newModel,
messages: [{ role: 'user', content: prompt }]
}).then(response => {
return {
model: newModel,
content: response.choices[0].message.content,
tokens: response.usage.total_tokens
};
}).catch(error => {
logger.error('Shadow model error', { error, model: newModel });
return null;
});
const [primary, shadow] = await Promise.all([primaryPromise, shadowPromise]);
// Return primary response immediately
const response = primary.choices[0].message.content;
// Compare in background
setImmediate(async () => {
if (shadow) {
const comparison = {
prompt,
oldModel,
oldResponse: response,
newModel,
newResponse: shadow.content,
similarity: calculateSimilarity(response, shadow.content),
timestamp: new Date()
};
await db.modelComparisons.insertOne(comparison);
// Alert if new model diverges significantly
if (comparison.similarity < 0.7) {
logger.warn('New model diverges from current model', comparison);
}
}
});
return response;
}
// Analyze shadow comparisons
async function analyzeModelMigration() {
const comparisons = await db.modelComparisons.find({
timestamp: { $gte: last7Days() }
});
const similarity = comparisons.map(c => c.similarity);
const avgSimilarity = similarity.reduce((a, b) => a + b) / similarity.length;
const minSimilarity = Math.min(...similarity);
return {
sampleSize: comparisons.length,
averageSimilarity: avgSimilarity,
minSimilarity,
readyToSwitch: avgSimilarity > 0.9 && minSimilarity > 0.7
};
}
Shadow mode catches model regressions before they hit users. Run for 1-2 weeks, analyze results, then switch.
Feature Flags for Model Switching
Use feature flags to control model versions.
async function getModelForUser(userId: string) {
// Get user''s feature flags
const flags = await featureFlags.getForUser(userId);
// 1% of users get new model
if (flags['use_gpt4o_2024_08'] === true) {
return 'gpt-4o-2024-08-06';
}
// Everyone else gets stable model
return 'gpt-4o-2024-05-13';
}
// Incrementally roll out
async function rolloutNewModel() {
// Day 1: 1% of users
await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.01);
// Day 3: 5%
await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.05);
// Day 7: 25%
await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0.25);
// Day 14: 100%
await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 1.0);
}
// Kill switch: revert instantly
async function killNewModel() {
await featureFlags.updatePercentageRollout('use_gpt4o_2024_08', 0);
logger.warn('Reverted to stable model');
}
Incremental rollout catches issues early. If new model has bugs affecting 1% of users, you catch it immediately.
Prompt Versioning With Rollback Capability
Prompts change behavior. Version them.
interface PromptVersion {
id: string;
name: string; // e.g., "summarization_v1"
content: string;
version: number;
status: 'draft' | 'active' | 'deprecated';
createdAt: Date;
createdBy: string;
metrics?: {
testPassRate: number;
userSatisfaction: number;
cost: number;
};
}
// Store prompt versions
async function createPromptVersion(name: string, content: string) {
const existing = await db.prompts.findOne({ name });
const nextVersion = existing ? existing.version + 1 : 1;
return await db.prompts.insertOne({
id: uuidv4(),
name,
content,
version: nextVersion,
status: 'draft',
createdAt: new Date(),
createdBy: getCurrentUser()
});
}
// Test prompt before activation
async function testPromptVersion(promptVersionId: string, testCases: Array<{ input: string; expectedOutput: string }>) {
const prompt = await db.prompts.findOne({ id: promptVersionId });
let passCount = 0;
for (const testCase of testCases) {
const response = await openai.createChatCompletion({
model: 'gpt-4o',
messages: [
{ role: 'system', content: prompt.content },
{ role: 'user', content: testCase.input }
]
});
if (isSimilar(response.choices[0].message.content, testCase.expectedOutput)) {
passCount++;
}
}
const passRate = passCount / testCases.length;
await db.prompts.updateOne(
{ id: promptVersionId },
{ $set: { metrics: { testPassRate: passRate } } }
);
return { passRate, passed: passRate > 0.8 };
}
// Activate prompt
async function activatePromptVersion(promptVersionId: string) {
const prompt = await db.prompts.findOne({ id: promptVersionId });
if ((prompt.metrics?.testPassRate || 0) < 0.8) {
throw new Error('Prompt did not pass tests');
}
// Deactivate previous version
await db.prompts.updateOne(
{ name: prompt.name, status: 'active' },
{ $set: { status: 'deprecated' } }
);
// Activate new version
await db.prompts.updateOne(
{ id: promptVersionId },
{ $set: { status: 'active' } }
);
}
// Rollback to previous version
async function rollbackPrompt(name: string) {
const versions = await db.prompts.find({ name }).sort({ version: -1 });
const current = versions[0];
const previous = versions[1];
if (!previous) {
throw new Error('No previous version to rollback to');
}
await db.prompts.updateOne(
{ id: current.id },
{ $set: { status: 'deprecated' } }
);
await db.prompts.updateOne(
{ id: previous.id },
{ $set: { status: 'active' } }
);
logger.info('Prompt rolled back', { name, from: current.version, to: previous.version });
}
// Use active prompt
async function getActivePrompt(name: string) {
const prompt = await db.prompts.findOne({ name, status: 'active' });
if (!prompt) {
throw new Error(`No active prompt: ${name}`);
}
return prompt.content;
}
Test prompts before deployment. Rollback takes seconds.
A/B Testing Prompts in Production
Run experiments: which prompt version performs better?
async function generateWithExperiment(userId: string, prompt: string) {
// Assign user to experiment variant
const variant = await getExperimentVariant(userId, 'summarization_v1_vs_v2');
let promptContent;
if (variant === 'control') {
promptContent = await getActivePrompt('summarization_v1');
} else {
promptContent = await getActivePrompt('summarization_v2');
}
const response = await openai.createChatCompletion({
model: 'gpt-4o',
messages: [
{ role: 'system', content: promptContent },
{ role: 'user', content: prompt }
]
});
// Record which variant user saw
await db.experiments.insertOne({
experimentId: 'summarization_v1_vs_v2',
userId,
variant,
prompt,
response: response.choices[0].message.content,
timestamp: new Date()
});
return response.choices[0].message.content;
}
// Analyze experiment results
async function analyzeExperiment(experimentId: string) {
const results = await db.experiments.find({ experimentId });
const byVariant = {};
for (const result of results) {
if (!byVariant[result.variant]) {
byVariant[result.variant] = { responses: [], scores: [] };
}
// Score response quality (or use user feedback)
const score = await scoreResponse(result.response);
byVariant[result.variant].scores.push(score);
}
return {
control: {
avgScore: avg(byVariant.control.scores),
sampleSize: byVariant.control.scores.length
},
variant: {
avgScore: avg(byVariant.variant.scores),
sampleSize: byVariant.variant.scores.length
},
winner: avg(byVariant.variant.scores) > avg(byVariant.control.scores) ? 'variant' : 'control'
};
}
A/B tests reveal which prompt performs better in the wild. Use results to decide rollout.
Canary Deployment for RAG Pipeline Changes
When updating RAG (document embedding, retrieval), deploy to a small subset first.
async function searchDocumentsWithCanary(userId: string, query: string) {
// 10% of users get new RAG pipeline
const useNewRAG = shouldCanary(userId, 0.1);
if (useNewRAG) {
return await newRagPipeline.search(query, { tenantId: userId });
}
return await oldRagPipeline.search(query, { tenantId: userId });
}
// New RAG: new embedding model, new retrieval logic
const newRagPipeline = {
async search(query: string, options: { tenantId: string }) {
// Use new embedding model
const embedding = await model.embed(query, {
model: 'text-embedding-3-large' // New model
});
// New retrieval: hybrid search (keyword + semantic)
const results = await vectorDb.hybridSearch(embedding, query, {
filter: { tenantId: options.tenantId },
topK: 10,
returnMetadata: true
});
return results;
}
};
// Monitor canary quality
async function monitorCanary() {
const canaryResults = await db.searches.find({ usedNewRAG: true });
const controlResults = await db.searches.find({ usedNewRAG: false });
// Compare metrics
const canaryRelevance = calculateAverageRelevance(canaryResults);
const controlRelevance = calculateAverageRelevance(controlResults);
if (canaryRelevance < controlRelevance * 0.95) {
// New RAG is worse, revert canary
logger.warn('Canary RAG underperforming, reverting');
await featureFlags.disable('use_new_rag');
} else if (canaryRelevance > controlRelevance * 1.05) {
// New RAG is better, gradually roll out
await featureFlags.updatePercentageRollout('use_new_rag', 0.25);
}
}
Canary deployments catch quality regressions on 10% of users instead of 100%.
Embedding Model Migration (Re-Embedding Strategy)
Changing embedding models requires re-embedding all documents. Plan this carefully.
async function migrateEmbeddingModel(oldModel: string, newModel: string) {
// Phase 1: Create parallel vector store with new embeddings
const newVectorDb = new VectorDB({
name: `vectordb_${newModel}`,
dimension: await getModelDimension(newModel)
});
logger.info('Starting embedding migration', { oldModel, newModel });
// Phase 2: Batch re-embed all documents
const allDocuments = await db.documents.find({});
const batchSize = 100;
for (let i = 0; i < allDocuments.length; i += batchSize) {
const batch = allDocuments.slice(i, i + batchSize);
const embeddings = await Promise.all(
batch.map(doc =>
model.embed(doc.content, { model: newModel })
)
);
// Insert into new vector DB
await Promise.all(
batch.map((doc, idx) =>
newVectorDb.insert({
embedding: embeddings[idx],
documentId: doc.id,
tenantId: doc.tenantId,
content: doc.content
})
)
);
logger.info('Re-embedded batch', {
oldModel,
newModel,
progress: `${i + batchSize} of ${allDocuments.length}`
});
}
// Phase 3: Dual-write phase (write to both old and new vector DBs)
// New documents go to both
await featureFlags.enable('dual_write_embeddings');
// Phase 4: Dual-read phase (try new first, fallback to old)
await featureFlags.enable('read_new_embeddings_first');
// Phase 5: Cutover (read only from new)
await featureFlags.disable('read_old_embeddings_fallback');
logger.info('Embedding migration complete', { oldModel, newModel });
}
// During migration: dual write
async function indexDocument(tenantId: string, document: string) {
const chunks = splitIntoChunks(document);
const oldEmbeddings = await model.embed(chunks, { model: 'text-embedding-3-small' });
const newEmbeddings = await model.embed(chunks, { model: 'text-embedding-3-large' });
// Write to both vector DBs
await Promise.all([
oldVectorDb.insert({ chunks, embeddings: oldEmbeddings, tenantId }),
newVectorDb.insert({ chunks, embeddings: newEmbeddings, tenantId })
]);
}
Embedding migrations take time. Use dual-write/dual-read patterns to migrate without downtime.
Conversation Context Migration
When schema changes, migrate existing conversation history.
// Old schema
interface OldConversation {
id: string;
messages: Array<{ role: string; content: string }>;
}
// New schema: add metadata
interface NewConversation {
id: string;
messages: Array<{ role: string; content: string; model: string; timestamp: Date }>;
modelVersion: string;
metadata: Record<string, any>;
}
async function migrateConversationSchema() {
const oldConversations = await db.conversations.find({});
for (const oldConv of oldConversations) {
const newConv: NewConversation = {
id: oldConv.id,
messages: oldConv.messages.map((msg, idx) => ({
role: msg.role,
content: msg.content,
model: 'gpt-4o-2024-05-13', // Assume all old messages used this
timestamp: oldConv.createdAt // Fake timestamps
})),
modelVersion: 'gpt-4o-2024-05-13',
metadata: {}
};
// Use upsert to handle partial migrations
await db.conversations.updateOne(
{ id: newConv.id },
{ $set: newConv },
{ upsert: true }
);
}
}
Upserts allow phased migrations. New conversations use new schema. Old conversations are migrated in the background.
Checklist
- Keep conversations on their original model version
- Run shadow mode for 1-2 weeks before switching models
- Use feature flags with percentage rollouts (1% → 5% → 25% → 100%)
- Version prompts; test before deployment; rollback in seconds
- A/B test new prompts; measure user satisfaction
- Canary RAG changes to 10% of users first
- Plan embedding model migrations with dual-write/dual-read phases
- Use upserts for gradual schema migrations
- Monitor metrics during rollouts; kill switch if needed
- Test zero-downtime deployments in staging before production
Conclusion
Zero-downtime AI updates require paranoia. Shadow mode catches regressions. Feature flags enable gradual rollout. Canary deployments limit blast radius. Prompt versioning enables instant rollback. Combining these patterns makes model and prompt updates boring, safe, and reversible.