- Published on
Feature Flags for AI Systems — Model Switching, Gradual Rollout, and Kill Switches
- Authors
- Name
Introduction
AI adds complexity to feature flags. Traditional flags switch code paths. AI flags switch models, prompts, and costs. They need cost-awareness, per-flag metrics, and instant kill switches for runaway costs. This post covers production patterns.
- Why AI Needs Feature Flags More Than Regular Features
- Flag Types for AI
- Percentage Rollout by User
- Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini
- Kill Switch for Runaway Costs
- Flag-Driven A/B Testing for AI Features
- OpenFeature SDK With AI Context
- Monitoring Per-Flag AI Quality Metrics
- Checklist
- Conclusion
Why AI Needs Feature Flags More Than Regular Features
Regular features break code or break UX. AI features break cost or break output quality.
A code bug affects all users equally. A degraded AI model affects output quality unpredictably—some prompts worse, some fine. You can't wait for 100% rollout before discovering the problem.
Non-determinism demands gradual rollout. Start with 1% of users, monitor quality metrics, then increase. If quality drops, kill the flag instantly.
// Traditional feature flag: simple on/off
if (featureFlags.get('dark_mode')) {
applyDarkMode();
}
// AI feature flag: cost-aware, with metrics
const llmModel = featureFlags.get('use_gpt4o_model') === true
? 'gpt-4o' // Costs 3x more
: 'gpt-4o-mini'; // Cheaper
const response = await openai.createChatCompletion({
model: llmModel,
messages: [{ role: 'user', content: prompt }]
});
// Record metrics
metrics.gauge('ai.model_used', llmModel === 'gpt-4o' ? 1 : 0);
metrics.counter('ai.tokens_by_model', response.usage.total_tokens, {
model: llmModel
});
// Cost tracking per flag
const costUsd = (response.usage.total_tokens / 1000) * getPricing(llmModel);
await db.costByFlag.increment('gpt4o_model_flag', costUsd);
Flag Types for AI
AI requires different flag types than regular code.
// Type 1: Model version flags
const flags = {
// Boolean: enable/disable feature
'enable_ai_search': { type: 'boolean', value: true },
// Model selection: which model to use
'llm_model': {
type: 'enum',
options: ['gpt-4o-mini', 'gpt-4o'],
value: 'gpt-4o-mini'
},
// Prompt version: which prompt template
'summarization_prompt': {
type: 'enum',
options: ['summarization_v1', 'summarization_v2', 'summarization_v3'],
value: 'summarization_v1'
},
// Feature toggle: RAG on/off
'enable_rag_context': { type: 'boolean', value: true },
// Percentage rollout: gradual deployment
'new_gpt4o_model_rollout': {
type: 'percentage',
percentageOfUsers: 0.05, // 5%
seed: 'new_gpt4o_model_rollout'
}
};
async function getModel(userId: string) {
// For percentage rollouts, hash user ID to decide deterministically
const useNewModel = featureFlags.isEnabledForUser(
'new_gpt4o_model_rollout',
userId
);
if (useNewModel) {
return 'gpt-4o'; // New, expensive model
}
return 'gpt-4o-mini'; // Stable, cheap model
}
Percentage Rollout by User
Consistent assignment: same user always gets same variant.
class PercentageRollout {
isEnabledForUser(flagName: string, userId: string, percentage: number): boolean {
// Create deterministic hash from user ID and flag name
const hash = hashFunction(`${flagName}:${userId}`);
// Map hash to percentage
const value = hash % 100;
return value < percentage;
}
}
// Usage
const rollout = new PercentageRollout();
// 5% of users: consistent assignment
for (const userId of ['user_1', 'user_2', 'user_3']) {
const enabled = rollout.isEnabledForUser('new_model', userId, 5);
console.log(`${userId}: ${enabled}`);
// user_1: false
// user_2: true (always true for this user, even on retry)
// user_3: false
}
// Increase rollout to 10%
const users10pct = await db.users.find({});
const rolledOutUsers = users10pct.filter(u =>
rollout.isEnabledForUser('new_model', u.id, 10)
);
console.log(`Rolled out to ${rolledOutUsers.length} users (10%)`);
Percentage rollout must be deterministic. Same user always gets same variant. Hash is stable across deployments.
Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini
Different tiers deserve different models.
interface FlagRule {
id: string;
flagName: string;
conditions: Array<{
attribute: string; // e.g., 'tier', 'country', 'plan'
operator: 'equals' | 'in' | 'contains' | 'gt' | 'lt';
value: string | string[] | number;
}>;
variant: string; // Which variant to serve if rule matches
priority: number; // Higher priority matches first
}
const rules: FlagRule[] = [
{
id: 'rule_enterprise_gpt4o',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'enterprise'
}
],
variant: 'gpt-4o',
priority: 100
},
{
id: 'rule_pro_gpt4o_mini',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'pro'
}
],
variant: 'gpt-4o-mini',
priority: 50
},
{
id: 'rule_free_gpt4o_mini',
flagName: 'llm_model',
conditions: [
{
attribute: 'tier',
operator: 'equals',
value: 'free'
}
],
variant: 'gpt-4o-mini',
priority: 25
}
];
async function evaluateFlags(userId: string) {
const user = await db.users.findOne({ id: userId });
// Sort rules by priority (highest first)
const sorted = rules.sort((a, b) => b.priority - a.priority);
for (const rule of sorted) {
if (this.matchesConditions(user, rule.conditions)) {
return rule.variant; // First matching rule wins
}
}
// Default
return 'gpt-4o-mini';
}
private matchesConditions(user: any, conditions: any[]): boolean {
return conditions.every(condition => {
const value = user[condition.attribute];
switch (condition.operator) {
case 'equals':
return value === condition.value;
case 'in':
return (condition.value as string[]).includes(value);
case 'contains':
return (value as string).includes(condition.value);
case 'gt':
return value > condition.value;
case 'lt':
return value < condition.value;
default:
return false;
}
});
}
Targeting rules let you give better models to paying customers, and cheaper models to free tier.
Kill Switch for Runaway Costs
Instantly disable expensive features if costs spike.
class CostKillSwitch {
async checkCostLimit(flagName: string) {
const costThreshold = 1000; // Kill switch at $1000/hour
const hourlySpend = await this.getHourlySpend(flagName);
if (hourlySpend > costThreshold) {
logger.error('Cost kill switch triggered', {
flagName,
hourlySpend,
threshold: costThreshold
});
// Disable the flag
await featureFlags.disable(flagName);
// Alert ops
await sendAlert({
severity: 'critical',
title: 'Cost kill switch triggered',
message: `${flagName} spending $${hourlySpend}/hour, disabled`
});
return { allowed: false, reason: 'Cost limit exceeded' };
}
return { allowed: true };
}
async getHourlySpend(flagName: string) {
const costs = await db.costByFlag.find({
flagName,
timestamp: { $gte: oneHourAgo() }
});
return costs.reduce((sum, c) => sum + c.costUsd, 0);
}
}
// Middleware: check kill switch before using flag
app.use(async (req, res, next) => {
const killSwitch = new CostKillSwitch();
const flagName = getRelevantFlag(req);
const check = await killSwitch.checkCostLimit(flagName);
if (!check.allowed) {
return res.status(503).json({
error: 'Feature temporarily unavailable',
reason: check.reason
});
}
next();
});
Cost kill switches are insurance. If a bug causes a runaway expensive operation, it's disabled automatically before it costs thousands.
Flag-Driven A/B Testing for AI Features
Use flags to run experiments.
interface Experiment {
id: string;
flagName: string;
control: { variant: string; percentage: number };
treatment: { variant: string; percentage: number };
startDate: Date;
endDate: Date;
metric: 'user_satisfaction' | 'latency' | 'cost' | 'quality';
}
async function runExperiment(experiment: Experiment) {
// Assign users to control or treatment
const flag = featureFlags.get(experiment.flagName);
// 50% control, 50% treatment
flag.type = 'experiment';
flag.control = experiment.control;
flag.treatment = experiment.treatment;
flag.metric = experiment.metric;
// Log which variant each user sees
await db.experiments.insertOne({
experimentId: experiment.id,
userId: req.user.id,
variant: getVariant(experiment, req.user.id),
timestamp: new Date()
});
}
async function analyzeExperiment(experimentId: string) {
const results = await db.experiments.find({ experimentId });
const byVariant = {};
for (const result of results) {
if (!byVariant[result.variant]) {
byVariant[result.variant] = { scores: [], users: 0 };
}
// Fetch metric (user satisfaction, latency, etc.)
const metric = await getMetric(result);
byVariant[result.variant].scores.push(metric);
byVariant[result.variant].users++;
}
// Calculate winner
const controlMean = mean(byVariant.control.scores);
const treatmentMean = mean(byVariant.treatment.scores);
return {
control: { mean: controlMean, n: byVariant.control.users },
treatment: { mean: treatmentMean, n: byVariant.treatment.users },
winner: treatmentMean > controlMean ? 'treatment' : 'control',
confidence: calculateStatisticalSignificance(byVariant.control.scores, byVariant.treatment.scores)
};
}
Experiments measure which flag variant performs better. Use results to decide rollout.
OpenFeature SDK With AI Context
OpenFeature is a standard for feature flags. Use it with AI context.
// Setup OpenFeature
import { OpenFeature } from '@openfeature/js-sdk';
const provider = new MyCustomFlagProvider();
OpenFeature.setProvider(provider);
const client = OpenFeature.getClient();
// AI-specific context
const aiContext = {
userId: req.user.id,
tenantId: req.user.tenantId,
userTier: req.user.tier,
region: req.headers['x-region'],
// AI-specific
modelFamily: 'gpt-4', // What family of models
promptType: 'summarization', // What task
inputTokens: 150, // Estimated input tokens
costBudget: 5.0 // Max willing to spend
};
const model = await client.getStringDetails('llm_model', 'gpt-4o-mini', aiContext);
console.log(model.value); // e.g., 'gpt-4o' for enterprise, 'gpt-4o-mini' for free
// Cost-aware flag evaluation
const maxCost = await client.getNumberDetails('max_request_cost_usd', 1.0, aiContext);
const estimatedCost = estimateTokenCost(prompt, model.value);
if (estimatedCost > maxCost.value) {
return res.status(429).json({
error: 'Request would exceed cost limit',
estimated: estimatedCost,
limit: maxCost.value
});
}
OpenFeature provides a vendor-agnostic flag API. Custom context includes AI-specific fields.
Monitoring Per-Flag AI Quality Metrics
Track quality for each flag variant.
class FlagMetricsCollector {
async recordMetric(flagName: string, variant: string, metrics: {
latency: number;
tokens: number;
costUsd: number;
userSatisfaction?: number; // 1-5 rating
errorRate?: number;
}) {
await db.flagMetrics.insertOne({
flagName,
variant,
...metrics,
timestamp: new Date()
});
// Aggregate metrics per variant
const hour = new Date().toISOString().split('T')[0] + 'T' + Math.floor(new Date().getHours());
await db.flagMetricsHourly.updateOne(
{ flagName, variant, hour },
{
$inc: {
requestCount: 1,
totalCost: metrics.costUsd,
totalTokens: metrics.tokens,
totalLatency: metrics.latency
},
$push: { satisfactionScores: metrics.userSatisfaction }
},
{ upsert: true }
);
}
async getMetricsForVariant(flagName: string, variant: string, period: string) {
const metrics = await db.flagMetricsHourly.find({
flagName,
variant,
hour: { $gte: startOfPeriod(period) }
});
return {
avgCost: sum(m => m.totalCost) / sum(m => m.requestCount),
avgLatency: sum(m => m.totalLatency) / sum(m => m.requestCount),
avgSatisfaction: mean(metrics.flatMap(m => m.satisfactionScores)),
errorRate: sum(m => m.totalErrors) / sum(m => m.requestCount)
};
}
}
// Compare variants
async function compareVariants(flagName: string) {
const variants = await getVariants(flagName);
const comparison = {};
for (const variant of variants) {
comparison[variant] = await this.getMetricsForVariant(flagName, variant, 'last_7_days');
}
return comparison;
}
Monitor cost, latency, tokens, and user satisfaction per variant. Use this data to make rollout decisions.
Checklist
- Feature flags for model selection, not hardcoded models
- Percentage rollouts: 1% → 5% → 25% → 100%
- Targeting rules for tier-based model assignment
- Cost kill switches: auto-disable if cost > threshold
- A/B tests built on flags with statistical analysis
- OpenFeature SDK for vendor-agnostic flag management
- Per-flag quality metrics: cost, latency, satisfaction
- Flag evaluation includes AI-specific context (tokens, budget)
- Kill switch response is clear: why disabled, when available
- Monitor flag health: compare variants, detect regressions
Conclusion
Feature flags transform AI deployments from binary on/off to gradual, measurable rollouts. Kill switches provide insurance. Metrics reveal quality regressions early. Using flags for model versions, prompts, and features makes AI deployment boring and safe.