Feature Flags for AI Systems — Model Switching, Gradual Rollout, and Kill Switches

Introduction

AI adds complexity to feature flags. Traditional flags switch code paths. AI flags switch models, prompts, and costs. They need cost-awareness, per-flag metrics, and instant kill switches for runaway costs. This post covers production patterns.

Why AI Needs Feature Flags More Than Regular Features
Flag Types for AI
Percentage Rollout by User
Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini
Kill Switch for Runaway Costs
Flag-Driven A/B Testing for AI Features
OpenFeature SDK With AI Context
Monitoring Per-Flag AI Quality Metrics
Checklist
Conclusion

Why AI Needs Feature Flags More Than Regular Features

Regular features break code or break UX. AI features break cost or break output quality.

A code bug affects all users equally. A degraded AI model affects output quality unpredictably—some prompts worse, some fine. You can't wait for 100% rollout before discovering the problem.

Non-determinism demands gradual rollout. Start with 1% of users, monitor quality metrics, then increase. If quality drops, kill the flag instantly.

// Traditional feature flag: simple on/off
if (featureFlags.get('dark_mode')) {
  applyDarkMode();
}

// AI feature flag: cost-aware, with metrics
const llmModel = featureFlags.get('use_gpt4o_model') === true
  ? 'gpt-4o' // Costs 3x more
  : 'gpt-4o-mini'; // Cheaper

const response = await openai.createChatCompletion({
  model: llmModel,
  messages: [{ role: 'user', content: prompt }]
});

// Record metrics
metrics.gauge('ai.model_used', llmModel === 'gpt-4o' ? 1 : 0);
metrics.counter('ai.tokens_by_model', response.usage.total_tokens, {
  model: llmModel
});

// Cost tracking per flag
const costUsd = (response.usage.total_tokens / 1000) * getPricing(llmModel);
await db.costByFlag.increment('gpt4o_model_flag', costUsd);

Flag Types for AI

AI requires different flag types than regular code.

// Type 1: Model version flags
const flags = {
  // Boolean: enable/disable feature
  'enable_ai_search': { type: 'boolean', value: true },

  // Model selection: which model to use
  'llm_model': {
    type: 'enum',
    options: ['gpt-4o-mini', 'gpt-4o'],
    value: 'gpt-4o-mini'
  },

  // Prompt version: which prompt template
  'summarization_prompt': {
    type: 'enum',
    options: ['summarization_v1', 'summarization_v2', 'summarization_v3'],
    value: 'summarization_v1'
  },

  // Feature toggle: RAG on/off
  'enable_rag_context': { type: 'boolean', value: true },

  // Percentage rollout: gradual deployment
  'new_gpt4o_model_rollout': {
    type: 'percentage',
    percentageOfUsers: 0.05, // 5%
    seed: 'new_gpt4o_model_rollout'
  }
};

async function getModel(userId: string) {
  // For percentage rollouts, hash user ID to decide deterministically
  const useNewModel = featureFlags.isEnabledForUser(
    'new_gpt4o_model_rollout',
    userId
  );

  if (useNewModel) {
    return 'gpt-4o'; // New, expensive model
  }

  return 'gpt-4o-mini'; // Stable, cheap model
}

Percentage Rollout by User

Consistent assignment: same user always gets same variant.

class PercentageRollout {
  isEnabledForUser(flagName: string, userId: string, percentage: number): boolean {
    // Create deterministic hash from user ID and flag name
    const hash = hashFunction(`${flagName}:${userId}`);

    // Map hash to percentage
    const value = hash % 100;

    return value &lt; percentage;
  }
}

// Usage
const rollout = new PercentageRollout();

// 5% of users: consistent assignment
for (const userId of ['user_1', 'user_2', 'user_3']) {
  const enabled = rollout.isEnabledForUser('new_model', userId, 5);
  console.log(`${userId}: ${enabled}`);
  // user_1: false
  // user_2: true (always true for this user, even on retry)
  // user_3: false
}

// Increase rollout to 10%
const users10pct = await db.users.find({});
const rolledOutUsers = users10pct.filter(u =&gt;
  rollout.isEnabledForUser('new_model', u.id, 10)
);

console.log(`Rolled out to ${rolledOutUsers.length} users (10%)`);

Percentage rollout must be deterministic. Same user always gets same variant. Hash is stable across deployments.

Targeting Rules: Enterprise Gets GPT-4o, Free Tier Gets GPT-4o-mini

Different tiers deserve different models.

interface FlagRule {
  id: string;
  flagName: string;
  conditions: Array&lt;{
    attribute: string; // e.g., 'tier', 'country', 'plan'
    operator: 'equals' | 'in' | 'contains' | 'gt' | 'lt';
    value: string | string[] | number;
  }&gt;;
  variant: string; // Which variant to serve if rule matches
  priority: number; // Higher priority matches first
}

const rules: FlagRule[] = [
  {
    id: 'rule_enterprise_gpt4o',
    flagName: 'llm_model',
    conditions: [
      {
        attribute: 'tier',
        operator: 'equals',
        value: 'enterprise'
      }
    ],
    variant: 'gpt-4o',
    priority: 100
  },
  {
    id: 'rule_pro_gpt4o_mini',
    flagName: 'llm_model',
    conditions: [
      {
        attribute: 'tier',
        operator: 'equals',
        value: 'pro'
      }
    ],
    variant: 'gpt-4o-mini',
    priority: 50
  },
  {
    id: 'rule_free_gpt4o_mini',
    flagName: 'llm_model',
    conditions: [
      {
        attribute: 'tier',
        operator: 'equals',
        value: 'free'
      }
    ],
    variant: 'gpt-4o-mini',
    priority: 25
  }
];

async function evaluateFlags(userId: string) {
  const user = await db.users.findOne({ id: userId });

  // Sort rules by priority (highest first)
  const sorted = rules.sort((a, b) =&gt; b.priority - a.priority);

  for (const rule of sorted) {
    if (this.matchesConditions(user, rule.conditions)) {
      return rule.variant; // First matching rule wins
    }
  }

  // Default
  return 'gpt-4o-mini';
}

private matchesConditions(user: any, conditions: any[]): boolean {
  return conditions.every(condition =&gt; {
    const value = user[condition.attribute];

    switch (condition.operator) {
      case 'equals':
        return value === condition.value;
      case 'in':
        return (condition.value as string[]).includes(value);
      case 'contains':
        return (value as string).includes(condition.value);
      case 'gt':
        return value &gt; condition.value;
      case 'lt':
        return value &lt; condition.value;
      default:
        return false;
    }
  });
}

Targeting rules let you give better models to paying customers, and cheaper models to free tier.

Kill Switch for Runaway Costs

Instantly disable expensive features if costs spike.

class CostKillSwitch {
  async checkCostLimit(flagName: string) {
    const costThreshold = 1000; // Kill switch at $1000/hour
    const hourlySpend = await this.getHourlySpend(flagName);

    if (hourlySpend &gt; costThreshold) {
      logger.error('Cost kill switch triggered', {
        flagName,
        hourlySpend,
        threshold: costThreshold
      });

      // Disable the flag
      await featureFlags.disable(flagName);

      // Alert ops
      await sendAlert({
        severity: 'critical',
        title: 'Cost kill switch triggered',
        message: `${flagName} spending $${hourlySpend}/hour, disabled`
      });

      return { allowed: false, reason: 'Cost limit exceeded' };
    }

    return { allowed: true };
  }

  async getHourlySpend(flagName: string) {
    const costs = await db.costByFlag.find({
      flagName,
      timestamp: { $gte: oneHourAgo() }
    });

    return costs.reduce((sum, c) =&gt; sum + c.costUsd, 0);
  }
}

// Middleware: check kill switch before using flag
app.use(async (req, res, next) =&gt; {
  const killSwitch = new CostKillSwitch();

  const flagName = getRelevantFlag(req);
  const check = await killSwitch.checkCostLimit(flagName);

  if (!check.allowed) {
    return res.status(503).json({
      error: 'Feature temporarily unavailable',
      reason: check.reason
    });
  }

  next();
});

Cost kill switches are insurance. If a bug causes a runaway expensive operation, it's disabled automatically before it costs thousands.

Flag-Driven A/B Testing for AI Features

Use flags to run experiments.

interface Experiment {
  id: string;
  flagName: string;
  control: { variant: string; percentage: number };
  treatment: { variant: string; percentage: number };
  startDate: Date;
  endDate: Date;
  metric: 'user_satisfaction' | 'latency' | 'cost' | 'quality';
}

async function runExperiment(experiment: Experiment) {
  // Assign users to control or treatment
  const flag = featureFlags.get(experiment.flagName);

  // 50% control, 50% treatment
  flag.type = 'experiment';
  flag.control = experiment.control;
  flag.treatment = experiment.treatment;
  flag.metric = experiment.metric;

  // Log which variant each user sees
  await db.experiments.insertOne({
    experimentId: experiment.id,
    userId: req.user.id,
    variant: getVariant(experiment, req.user.id),
    timestamp: new Date()
  });
}

async function analyzeExperiment(experimentId: string) {
  const results = await db.experiments.find({ experimentId });

  const byVariant = {};
  for (const result of results) {
    if (!byVariant[result.variant]) {
      byVariant[result.variant] = { scores: [], users: 0 };
    }

    // Fetch metric (user satisfaction, latency, etc.)
    const metric = await getMetric(result);
    byVariant[result.variant].scores.push(metric);
    byVariant[result.variant].users++;
  }

  // Calculate winner
  const controlMean = mean(byVariant.control.scores);
  const treatmentMean = mean(byVariant.treatment.scores);

  return {
    control: { mean: controlMean, n: byVariant.control.users },
    treatment: { mean: treatmentMean, n: byVariant.treatment.users },
    winner: treatmentMean &gt; controlMean ? 'treatment' : 'control',
    confidence: calculateStatisticalSignificance(byVariant.control.scores, byVariant.treatment.scores)
  };
}

Experiments measure which flag variant performs better. Use results to decide rollout.

OpenFeature SDK With AI Context

OpenFeature is a standard for feature flags. Use it with AI context.

// Setup OpenFeature
import { OpenFeature } from '@openfeature/js-sdk';

const provider = new MyCustomFlagProvider();
OpenFeature.setProvider(provider);

const client = OpenFeature.getClient();

// AI-specific context
const aiContext = {
  userId: req.user.id,
  tenantId: req.user.tenantId,
  userTier: req.user.tier,
  region: req.headers['x-region'],

  // AI-specific
  modelFamily: 'gpt-4', // What family of models
  promptType: 'summarization', // What task
  inputTokens: 150, // Estimated input tokens
  costBudget: 5.0 // Max willing to spend
};

const model = await client.getStringDetails('llm_model', 'gpt-4o-mini', aiContext);
console.log(model.value); // e.g., 'gpt-4o' for enterprise, 'gpt-4o-mini' for free

// Cost-aware flag evaluation
const maxCost = await client.getNumberDetails('max_request_cost_usd', 1.0, aiContext);
const estimatedCost = estimateTokenCost(prompt, model.value);

if (estimatedCost &gt; maxCost.value) {
  return res.status(429).json({
    error: 'Request would exceed cost limit',
    estimated: estimatedCost,
    limit: maxCost.value
  });
}

OpenFeature provides a vendor-agnostic flag API. Custom context includes AI-specific fields.

Monitoring Per-Flag AI Quality Metrics

Track quality for each flag variant.

class FlagMetricsCollector {
  async recordMetric(flagName: string, variant: string, metrics: {
    latency: number;
    tokens: number;
    costUsd: number;
    userSatisfaction?: number; // 1-5 rating
    errorRate?: number;
  }) {
    await db.flagMetrics.insertOne({
      flagName,
      variant,
      ...metrics,
      timestamp: new Date()
    });

    // Aggregate metrics per variant
    const hour = new Date().toISOString().split('T')[0] + 'T' + Math.floor(new Date().getHours());
    await db.flagMetricsHourly.updateOne(
      { flagName, variant, hour },
      {
        $inc: {
          requestCount: 1,
          totalCost: metrics.costUsd,
          totalTokens: metrics.tokens,
          totalLatency: metrics.latency
        },
        $push: { satisfactionScores: metrics.userSatisfaction }
      },
      { upsert: true }
    );
  }

  async getMetricsForVariant(flagName: string, variant: string, period: string) {
    const metrics = await db.flagMetricsHourly.find({
      flagName,
      variant,
      hour: { $gte: startOfPeriod(period) }
    });

    return {
      avgCost: sum(m =&gt; m.totalCost) / sum(m =&gt; m.requestCount),
      avgLatency: sum(m =&gt; m.totalLatency) / sum(m =&gt; m.requestCount),
      avgSatisfaction: mean(metrics.flatMap(m =&gt; m.satisfactionScores)),
      errorRate: sum(m =&gt; m.totalErrors) / sum(m =&gt; m.requestCount)
    };
  }
}

// Compare variants
async function compareVariants(flagName: string) {
  const variants = await getVariants(flagName);

  const comparison = {};
  for (const variant of variants) {
    comparison[variant] = await this.getMetricsForVariant(flagName, variant, 'last_7_days');
  }

  return comparison;
}

Monitor cost, latency, tokens, and user satisfaction per variant. Use this data to make rollout decisions.

Checklist

Feature flags for model selection, not hardcoded models
Percentage rollouts: 1% → 5% → 25% → 100%
Targeting rules for tier-based model assignment
Cost kill switches: auto-disable if cost > threshold
A/B tests built on flags with statistical analysis
OpenFeature SDK for vendor-agnostic flag management
Per-flag quality metrics: cost, latency, satisfaction
Flag evaluation includes AI-specific context (tokens, budget)
Kill switch response is clear: why disabled, when available
Monitor flag health: compare variants, detect regressions

Conclusion

Feature flags transform AI deployments from binary on/off to gradual, measurable rollouts. Kill switches provide insurance. Metrics reveal quality regressions early. Using flags for model versions, prompts, and features makes AI deployment boring and safe.