Feature Flags for AI Systems — Model Switching, Gradual Rollout, and Kill Switches
Advertisement
Introduction
AI adds complexity to feature flags. Traditional flags switch code paths. AI flags switch models, prompts, and costs. They need cost-awareness, per-flag metrics, and instant kill switches for runaway costs. This post covers production patterns.
- Why AI Needs Feature Flags More Than Regular Features
- Flag Types for AI
- Percentage Rollout by User
- Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini
- Kill Switch for Runaway Costs
- Flag-Driven A/B Testing for AI Features
- OpenFeature SDK With AI Context
- Monitoring Per-Flag AI Quality Metrics
- Checklist
- Conclusion
Why AI Needs Feature Flags More Than Regular Features
Regular features break code or break UX. AI features break cost or break output quality.
A code bug affects all users equally. A degraded AI model affects output quality unpredictably—some prompts worse, some fine. You can't wait for 100% rollout before discovering the problem.
Non-determinism demands gradual rollout. Start with 1% of users, monitor quality metrics, then increase. If quality drops, kill the flag instantly.
// Traditional feature flag: simple on/off
if (featureFlags.get('dark_mode')) {
applyDarkMode();
}
// AI feature flag: cost-aware, with metrics
const llmModel = featureFlags.get('use_gpt4o_model') === true
? 'gpt-4o' // Costs 3x more
: 'gpt-4o-mini'; // Cheaper
const response = await openai.createChatCompletion({
model: llmModel,
messages: [{ role: 'user', content: prompt }]
});
// Record metrics
metrics.gauge('ai.model_used', llmModel === 'gpt-4o' ? 1 : 0);
metrics.counter('ai.tokens_by_model', response.usage.total_tokens, {
model: llmModel
});
// Cost tracking per flag
const costUsd = (response.usage.total_tokens / 1000) * getPricing(llmModel);
await db.costByFlag.increment('gpt4o_model_flag', costUsd);
Flag Types for AI
AI requires different flag types than regular code.
// Type 1: Model version flags
const flags = {
// Boolean: enable/disable feature
'enable_ai_search': { type: 'boolean', value: true },
// Model selection: which model to use
'llm_model': {
type: 'enum',
options: ['gpt-4o-mini', 'gpt-4o'],
value: 'gpt-4o-mini'
},
// Prompt version: which prompt template
'summarization_prompt': {
type: 'enum',
options: ['summarization_v1', 'summarization_v2', 'summarization_v3'],
value: 'summarization_v1'
},
// Feature toggle: RAG on/off
'enable_rag_context': { type: 'boolean', value: true },
// Percentage rollout: gradual deployment
'new_gpt4o_model_rollout': {
type: 'percentage',
percentageOfUsers: 0.05, // 5%
seed: 'new_gpt4o_model_rollout'
}
};
async function getModel(userId: string) {
// For percentage rollouts, hash user ID to decide deterministically
const useNewModel = featureFlags.isEnabledForUser(
'new_gpt4o_model_rollout',
userId
);
if (useNewModel) {
return 'gpt-4o'; // New, expensive model
}
return 'gpt-4o-mini'; // Stable, cheap model
}
Percentage Rollout by User
Consistent assignment: same user always gets same variant.
class PercentageRollout {
isEnabledForUser(flagName: string, userId: string, percentage: number): boolean {
// Create deterministic hash from user ID and flag name
const hash = hashFunction(`${flagName}:${userId}`);
// Map hash to percentage
const value = hash % 100;
return value < percentage;
}
}
// Usage
const rollout = new PercentageRollout();
// 5% of users: consistent assignment
for (const userId of ['user_1', 'user_2', 'user_3']) {
const enabled = rollout.isEnabledForUser('new_model', userId, 5);
console.log(`${userId}: ${enabled}`);
// user_1: false
// user_2: true (always true for this user, even on retry)
// user_3: false
}
// Increase rollout to 10%
const users10pct = await db.users.find({});
const rolledOutUsers = users10pct.filter(u =>
rollout.isEnabledForUser('new_model', u.id, 10)
);
console.log(`Rolled out to ${rolledOutUsers.length} users (10%)`);
Percentage rollout must be deterministic. Same user always gets same variant. Hash is stable across deployments.
Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini
Different tiers deserve different models.
interface FlagRule {
id: string;
flagName: string;
conditions: Array<{
attribute: string; // e.g., 'tier', 'country', 'plan'
operator: 'equals' | 'in' | 'contains' | 'gt' | 'lt';
value: string | string[] | number;
}>;
variant: string; // Which variant to serve if rule matches
priority: number; // Higher priority matches first
}
const rules: FlagRule[] = [
{
id: 'rule_enterprise_gpt4o',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'enterprise'
}
],
variant: 'gpt-4o',
priority: 100
},
{
id: 'rule_pro_gpt4o_mini',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'pro'
}
],
variant: 'gpt-4o-mini',
priority: 50
},
{
id: 'rule_free_gpt4o_mini',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'free'
}
],
variant: 'gpt-4o-mini',
priority: 25
}
];
async function evaluateFlags(userId: string) {
const user = await db.users.findOne({ id: userId });
// Sort rules by priority (highest first)
const sorted = rules.sort((a, b) => b.priority - a.priority);
for (const rule of sorted) {
if (this.matchesConditions(user, rule.conditions)) {
return rule.variant; // First matching rule wins
}
}
// Default
return 'gpt-4o-mini';
}
private matchesConditions(user: any, conditions: any[]): boolean {
return conditions.every(condition => {
const value = user[condition.attribute];
switch (condition.operator) {
case 'equals':
return value === condition.value;
case 'in':
return (condition.value as string[]).includes(value);
case 'contains':
return (value as string).includes(condition.value);
case 'gt':
return value > condition.value;
case 'lt':
return value < condition.value;
default:
return false;
}
});
}
Targeting rules let you give better models to paying customers, and cheaper models to free tier.
Kill Switch for Runaway Costs
Instantly disable expensive features if costs spike.
class CostKillSwitch {
async checkCostLimit(flagName: string) {
const costThreshold = 1000; // Kill switch at $1000/hour
const hourlySpend = await this.getHourlySpend(flagName);
if (hourlySpend > costThreshold) {
logger.error('Cost kill switch triggered', {
flagName,
hourlySpend,
threshold: costThreshold
});
// Disable the flag
await featureFlags.disable(flagName);
// Alert ops
await sendAlert({
severity: 'critical',
title: 'Cost kill switch triggered',
message: `${flagName} spending $${hourlySpend}/hour, disabled`
});
return { allowed: false, reason: 'Cost limit exceeded' };
}
return { allowed: true };
}
async getHourlySpend(flagName: string) {
const costs = await db.costByFlag.find({
flagName,
timestamp: { $gte: oneHourAgo() }
});
return costs.reduce((sum, c) => sum + c.costUsd, 0);
}
}
// Middleware: check kill switch before using flag
app.use(async (req, res, next) => {
const killSwitch = new CostKillSwitch();
const flagName = getRelevantFlag(req);
const check = await killSwitch.checkCostLimit(flagName);
if (!check.allowed) {
return res.status(503).json({
error: 'Feature temporarily unavailable',
reason: check.reason
});
}
next();
});
Cost kill switches are insurance. If a bug causes a runaway expensive operation, it's disabled automatically before it costs thousands.
Flag-Driven A/B Testing for AI Features
Use flags to run experiments.
interface Experiment {
id: string;
flagName: string;
control: { variant: string; percentage: number };
treatment: { variant: string; percentage: number };
startDate: Date;
endDate: Date;
metric: 'user_satisfaction' | 'latency' | 'cost' | 'quality';
}
async function runExperiment(experiment: Experiment) {
// Assign users to control or treatment
const flag = featureFlags.get(experiment.flagName);
// 50% control, 50% treatment
flag.type = 'experiment';
flag.control = experiment.control;
flag.treatment = experiment.treatment;
flag.metric = experiment.metric;
// Log which variant each user sees
await db.experiments.insertOne({
experimentId: experiment.id,
userId: req.user.id,
variant: getVariant(experiment, req.user.id),
timestamp: new Date()
});
}
async function analyzeExperiment(experimentId: string) {
const results = await db.experiments.find({ experimentId });
const byVariant = {};
for (const result of results) {
if (!byVariant[result.variant]) {
byVariant[result.variant] = { scores: [], users: 0 };
}
// Fetch metric (user satisfaction, latency, etc.)
const metric = await getMetric(result);
byVariant[result.variant].scores.push(metric);
byVariant[result.variant].users++;
}
// Calculate winner
const controlMean = mean(byVariant.control.scores);
const treatmentMean = mean(byVariant.treatment.scores);
return {
control: { mean: controlMean, n: byVariant.control.users },
treatment: { mean: treatmentMean, n: byVariant.treatment.users },
winner: treatmentMean > controlMean ? 'treatment' : 'control',
confidence: calculateStatisticalSignificance(byVariant.control.scores, byVariant.treatment.scores)
};
}
Experiments measure which flag variant performs better. Use results to decide rollout.
OpenFeature SDK With AI Context
OpenFeature is a standard for feature flags. Use it with AI context.
// Setup OpenFeature
import { OpenFeature } from '@openfeature/js-sdk';
const provider = new MyCustomFlagProvider();
OpenFeature.setProvider(provider);
const client = OpenFeature.getClient();
// AI-specific context
const aiContext = {
userId: req.user.id,
tenantId: req.user.tenantId,
userTier: req.user.tier,
region: req.headers['x-region'],
// AI-specific
modelFamily: 'gpt-4', // What family of models
promptType: 'summarization', // What task
inputTokens: 150, // Estimated input tokens
costBudget: 5.0 // Max willing to spend
};
const model = await client.getStringDetails('llm_model', 'gpt-4o-mini', aiContext);
console.log(model.value); // e.g., 'gpt-4o' for enterprise, 'gpt-4o-mini' for free
// Cost-aware flag evaluation
const maxCost = await client.getNumberDetails('max_request_cost_usd', 1.0, aiContext);
const estimatedCost = estimateTokenCost(prompt, model.value);
if (estimatedCost > maxCost.value) {
return res.status(429).json({
error: 'Request would exceed cost limit',
estimated: estimatedCost,
limit: maxCost.value
});
}
OpenFeature provides a vendor-agnostic flag API. Custom context includes AI-specific fields.
Monitoring Per-Flag AI Quality Metrics
Track quality for each flag variant.
class FlagMetricsCollector {
async recordMetric(flagName: string, variant: string, metrics: {
latency: number;
tokens: number;
costUsd: number;
userSatisfaction?: number; // 1-5 rating
errorRate?: number;
}) {
await db.flagMetrics.insertOne({
flagName,
variant,
...metrics,
timestamp: new Date()
});
// Aggregate metrics per variant
const hour = new Date().toISOString().split('T')[0] + 'T' + Math.floor(new Date().getHours());
await db.flagMetricsHourly.updateOne(
{ flagName, variant, hour },
{
$inc: {
requestCount: 1,
totalCost: metrics.costUsd,
totalTokens: metrics.tokens,
totalLatency: metrics.latency
},
$push: { satisfactionScores: metrics.userSatisfaction }
},
{ upsert: true }
);
}
async getMetricsForVariant(flagName: string, variant: string, period: string) {
const metrics = await db.flagMetricsHourly.find({
flagName,
variant,
hour: { $gte: startOfPeriod(period) }
});
return {
avgCost: sum(m => m.totalCost) / sum(m => m.requestCount),
avgLatency: sum(m => m.totalLatency) / sum(m => m.requestCount),
avgSatisfaction: mean(metrics.flatMap(m => m.satisfactionScores)),
errorRate: sum(m => m.totalErrors) / sum(m => m.requestCount)
};
}
}
// Compare variants
async function compareVariants(flagName: string) {
const variants = await getVariants(flagName);
const comparison = {};
for (const variant of variants) {
comparison[variant] = await this.getMetricsForVariant(flagName, variant, 'last_7_days');
}
return comparison;
}
Monitor cost, latency, tokens, and user satisfaction per variant. Use this data to make rollout decisions.
Checklist
- Feature flags for model selection, not hardcoded models
- Percentage rollouts: 1% → 5% → 25% → 100%
- Targeting rules for tier-based model assignment
- Cost kill switches: auto-disable if cost > threshold
- A/B tests built on flags with statistical analysis
- OpenFeature SDK for vendor-agnostic flag management
- Per-flag quality metrics: cost, latency, satisfaction
- Flag evaluation includes AI-specific context (tokens, budget)
- Kill switch response is clear: why disabled, when available
- Monitor flag health: compare variants, detect regressions
Conclusion
Feature flags transform AI deployments from binary on/off to gradual, measurable rollouts. Kill switches provide insurance. Metrics reveal quality regressions early. Using flags for model versions, prompts, and features makes AI deployment boring and safe.
Advertisement
Written by