SLOs, SLIs, and Error Budgets — Reliability Engineering That Product Teams Will Actually Use

Introduction

SLOs (Service Level Objectives) translate reliability into business language. Instead of abstract uptime percentages, SLOs answer: "How much latency can we tolerate?" and "How many errors break our promise to users?" Error budgets quantify how much reliability we can spend. Without them, teams either move too fast and cause outages, or move too slowly preventing velocity. The right framework enables both.

SLI Definition: Measuring What Matters
SLO Target Setting
Error Budget Calculation
Multi-Window Burn Rate Alerting
Prometheus Recording Rules for SLIs
SLO Dashboard in Grafana
Error Budget Policies
SLO Review Process
Checklist
Conclusion

SLI Definition: Measuring What Matters

SLIs (Service Level Indicators) are the measurements that matter to users:

// Service Level Indicators - actual measurements

interface SLIDefinition {
  name: string;
  description: string;
  measurement: 'availability' | 'latency' | 'error_rate';
  threshold: number;
  window: string; // e.g., "p99", "p95"
}

const sliDefinitions: SLIDefinition[] = [
  {
    name: 'API Availability',
    description: 'Percentage of requests that return 2xx or 3xx',
    measurement: 'availability',
    threshold: 99.9,
    window: 'per request',
  },
  {
    name: 'API Latency (p99)',
    description: '99th percentile of request latency',
    measurement: 'latency',
    threshold: 500, // milliseconds
    window: 'p99',
  },
  {
    name: 'Error Rate',
    description: 'Percentage of requests returning 5xx',
    measurement: 'error_rate',
    threshold: 0.1, // 0.1% = 1 in 1000
    window: 'per minute',
  },
  {
    name: 'Database Query Latency (p95)',
    description: 'Time for 95th percentile database query',
    measurement: 'latency',
    threshold: 100, // milliseconds
    window: 'p95',
  },
];

// Measure SLIs continuously
class SLIMeasurement {
  private measurements: Map<string, number[]> = new Map();

  recordRequest(
    sliName: string,
    duration: number,
    statusCode: number
  ): void {
    if (!this.measurements.has(sliName)) {
      this.measurements.set(sliName, []);
    }

    const values = this.measurements.get(sliName)!;
    values.push(duration);

    // Keep rolling window (last hour of data)
    if (values.length > 360000) {
      values.shift();
    }
  }

  calculateSLI(sliName: string, definition: SLIDefinition): number {
    const values = this.measurements.get(sliName) || [];

    if (definition.measurement === 'latency') {
      const percentile = parseInt(definition.window.replace('p', ''));
      return this.percentile(values, percentile / 100);
    }

    if (definition.measurement === 'availability') {
      // Success = 2xx or 3xx
      return (values.length / values.length) * 100;
    }

    return 0;
  }

  private percentile(arr: number[], p: number): number {
    const sorted = [...arr].sort((a, b) => a - b);
    const index = Math.ceil(sorted.length * p) - 1;
    return sorted[Math.max(0, index)];
  }
}

SLO Target Setting

Set SLO targets that balance ambition with feasibility:

// SLO targets: What we promise to users
interface SLOTarget {
  sliName: string;
  target: number; // percentage for availability, milliseconds for latency
  window: string; // "monthly", "quarterly", "yearly"
  rationale: string;
}

const sloTargets: SLOTarget[] = [
  {
    sliName: 'API Availability',
    target: 99.9, // 99.9% uptime
    window: 'monthly',
    rationale: 'Allow 43 minutes of downtime per month for maintenance/incidents',
  },
  {
    sliName: 'API Latency (p99)',
    target: 500, // 500ms
    window: 'monthly',
    rationale: 'Users notice latency >500ms, competitive requirement',
  },
  {
    sliName: 'Error Rate',
    target: 0.1, // 0.1%
    window: 'monthly',
    rationale: '1 error per 1000 requests acceptable, align with user expectations',
  },
];

// Why NOT 99.99%?
// 99.9% = 43 minutes downtime/month
// 99.99% = 4.3 minutes downtime/month
// 99.999% = 26 seconds downtime/month

// Cost doubles with each 9 (infrastructure, testing, on-call)
// Diminishing returns: 99.99% costs 3x more than 99.9% for 10x less downtime

function calculateDowntimeAllowance(sloTarget: number, windowDays: number): number {
  const totalMinutes = windowDays * 24 * 60;
  const allowedDowntimeMinutes = totalMinutes * (1 - sloTarget / 100);
  return allowedDowntimeMinutes;
}

console.log('Monthly downtime allowance:');
console.log(`99.9%: ${calculateDowntimeAllowance(99.9, 30).toFixed(2)} minutes`);   // 43.2
console.log(`99.95%: ${calculateDowntimeAllowance(99.95, 30).toFixed(2)} minutes`); // 21.6
console.log(`99.99%: ${calculateDowntimeAllowance(99.99, 30).toFixed(2)} minutes`); // 4.32

Error Budget Calculation

How much reliability can we "spend"?

interface ErrorBudget {
  sloTarget: number;
  windowDays: number;
  totalEvents: number;
  allowedBadEvents: number;
  burnRate: number;
  burnedSoFar: number;
  remainingBudget: number;
  daysRemaining: number;
}

class ErrorBudgetTracker {
  private startDate: Date;
  private sloTarget: number;
  private windowDays: number;
  private badEvents = 0;
  private totalEvents = 0;

  constructor(sloTarget: number, windowDays: number = 30) {
    this.sloTarget = sloTarget;
    this.windowDays = windowDays;
    this.startDate = new Date();
  }

  recordEvent(success: boolean): void {
    this.totalEvents++;
    if (!success) {
      this.badEvents++;
    }
  }

  calculateBudget(): ErrorBudget {
    const allowedBadEvents = Math.floor(
      this.totalEvents * (1 - this.sloTarget / 100)
    );

    const burnRate = this.badEvents / this.totalEvents;
    const burnedSoFar = this.badEvents;
    const remainingBudget = Math.max(0, allowedBadEvents - burnedSoFar);

    const elapsedDays =
      (Date.now() - this.startDate.getTime()) / (1000 * 60 * 60 * 24);
    const daysRemaining = this.windowDays - elapsedDays;

    return {
      sloTarget: this.sloTarget,
      windowDays: this.windowDays,
      totalEvents: this.totalEvents,
      allowedBadEvents,
      burnRate,
      burnedSoFar,
      remainingBudget,
      daysRemaining,
    };
  }

  getRiskLevel(): 'safe' | 'warning' | 'critical' {
    const budget = this.calculateBudget();

    if (budget.daysRemaining === 0) return 'critical';

    const projectedBurn = budget.burnRate * (budget.daysRemaining * 86400);
    const percentageOfBudgetRemaining = (budget.remainingBudget / budget.allowedBadEvents) * 100;

    if (percentageOfBudgetRemaining > 50) return 'safe';
    if (percentageOfBudgetRemaining > 10) return 'warning';
    return 'critical';
  }
}

// Example: 99.9% availability SLO for March
const budget = new ErrorBudgetTracker(99.9, 31);

// March so far: 10M requests, 1000 errors
for (let i = 0; i < 10000000; i++) {
  budget.recordEvent(i < 9999000); // 1000 errors
}

const currentBudget = budget.calculateBudget();
console.log('Error Budget Report:');
console.log(`SLO Target: ${currentBudget.sloTarget}%`);
console.log(`Total Requests: ${currentBudget.totalEvents}`);
console.log(`Allowed Errors: ${currentBudget.allowedBadEvents}`);
console.log(`Errors So Far: ${currentBudget.burnedSoFar}`);
console.log(`Remaining Budget: ${currentBudget.remainingBudget}`);
console.log(`Risk Level: ${budget.getRiskLevel()}`);

// Output:
// SLO Target: 99.9%
// Total Requests: 10000000
// Allowed Errors: 10000
// Errors So Far: 1000
// Remaining Budget: 9000
// Risk Level: safe (9000/10000 remaining)

Multi-Window Burn Rate Alerting

Alert on fast burn rates before budget exhaustion:

interface BurnRateAlert {
  window: string; // "5m", "1h"
  threshold: number; // burn rate (errors per second)
  consecutiveBreaches: number; // how many windows before alert
}

class BurnRateAlerter {
  private sloTarget: number;
  private allowedBadEventsPerSecond: number;
  private windows: Map<string, number[]> = new Map(); // window -> errors in that window

  constructor(sloTarget: number, requestsPerSecond: number = 1000) {
    this.sloTarget = sloTarget;

    // 99.9% SLO = 0.1% error rate = 1 error per 1000 requests
    const errorRate = (1 - sloTarget / 100);
    this.allowedBadEventsPerSecond = requestsPerSecond * errorRate;
  }

  recordError(timestamp: number = Date.now()): void {
    // Track errors in multiple windows
    this.updateWindow('5m', timestamp);
    this.updateWindow('1h', timestamp);
    this.updateWindow('5min', timestamp);
  }

  private updateWindow(windowLabel: string, timestamp: number): void {
    if (!this.windows.has(windowLabel)) {
      this.windows.set(windowLabel, []);
    }

    const errors = this.windows.get(windowLabel)!;
    errors.push(timestamp);

    // Remove old errors outside window
    const windowMs = this.getWindowMs(windowLabel);
    const cutoff = timestamp - windowMs;
    const filtered = errors.filter(t => t > cutoff);
    this.windows.set(windowLabel, filtered);
  }

  private getWindowMs(label: string): number {
    const map: Record<string, number> = {
      '5m': 5 * 60 * 1000,
      '1h': 60 * 60 * 1000,
      '5min': 5 * 60 * 1000,
    };
    return map[label] || 60000;
  }

  checkBurnRates(): Map<string, number> {
    const burnRates = new Map<string, number>();

    for (const [window, errors] of this.windows.entries()) {
      const windowMs = this.getWindowMs(window);
      const seconds = windowMs / 1000;
      const burnRate = errors.length / seconds;

      burnRates.set(window, burnRate);

      // Alert if burning >5x the allowed rate
      if (burnRate > this.allowedBadEventsPerSecond * 5) {
        console.warn(
          `ALERT: High burn rate (${window}): ${burnRate.toFixed(2)} errors/sec ` +
          `(5x threshold: ${(this.allowedBadEventsPerSecond * 5).toFixed(2)})`
        );
      }
    }

    return burnRates;
  }
}

// Multi-window alerting thresholds
const burnRateAlerts: BurnRateAlert[] = [
  {
    window: '5m',
    threshold: 10, // 10x burn rate for 5 minutes
    consecutiveBreaches: 1, // Alert immediately
  },
  {
    window: '1h',
    threshold: 3, // 3x burn rate for 1 hour
    consecutiveBreaches: 2, // Alert after 2 consecutive breaches (2 hours)
  },
  {
    window: '6h',
    threshold: 1, // 1x burn rate (entire error budget) for 6 hours
    consecutiveBreaches: 1, // Alert immediately
  },
];

// Benefits of multi-window:
// - 5m x 10x: Catches immediate spikes (deploy bug)
// - 1h x 3x: Catches sustained issues (resource exhaustion)
// - 6h x 1x: Catches slow budget burn (gradual performance regression)

Prometheus Recording Rules for SLIs

Precompute SLI metrics for fast queries:

# prometheus_recording_rules.yml
groups:
  - name: sli_recording_rules
    interval: 30s
    rules:
      # Availability SLI: % of requests returning 2xx or 3xx
      - record: sli:request:success_ratio
        expr: |
          sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # Latency SLI: p99 latency
      - record: sli:request:latency_p99
        expr: |
          histogram_quantile(
            0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error rate SLI: % of requests returning 5xx
      - record: sli:request:error_ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # Burn rate: Ratio of actual errors to SLO budget
      - record: sli:request:burn_rate_5m
        expr: |
          (1 - sli:request:success_ratio) / 0.001
          # 0.001 = 0.1% SLO target
          # Divide by budget to get burn rate

      # 1-hour burn rate for multi-window alerting
      - record: sli:request:burn_rate_1h
        expr: |
          (1 - avg(sli:request:success_ratio)) / 0.001

SLO Dashboard in Grafana

Visualize SLO status and error budget:

interface SLODashboard {
  title: string;
  description: string;
  panels: DashboardPanel[];
}

interface DashboardPanel {
  title: string;
  type: 'graph' | 'stat' | 'gauge';
  targets: MetricQuery[];
}

const sloDashboard: SLODashboard = {
  title: 'Service SLO Dashboard',
  description: 'Real-time SLO status and error budget tracking',
  panels: [
    {
      title: 'API Availability (99.9% SLO)',
      type: 'gauge',
      targets: [
        {
          expr: 'sli:request:success_ratio * 100',
          legendFormat: 'Current: {{ instance }}',
        },
      ],
    },
    {
      title: 'Error Budget Remaining (Monthly)',
      type: 'stat',
      targets: [
        {
          expr: 'sli:error_budget:remaining_percent',
          legendFormat: 'Remaining %',
        },
      ],
    },
    {
      title: 'Burn Rate (Multi-Window)',
      type: 'graph',
      targets: [
        {
          expr: 'sli:request:burn_rate_5m',
          legendFormat: '5min burn rate',
        },
        {
          expr: 'sli:request:burn_rate_1h',
          legendFormat: '1h burn rate',
        },
      ],
    },
    {
      title: 'Error Rate (p99 latency)',
      type: 'graph',
      targets: [
        {
          expr: 'sli:request:latency_p99',
          legendFormat: 'p99 latency (seconds)',
        },
      ],
    },
    {
      title: 'Requests Per Second',
      type: 'graph',
      targets: [
        {
          expr: 'sum(rate(http_requests_total[5m]))',
          legendFormat: 'Throughput',
        },
      ],
    },
  ],
};

// Grafana API to create/update dashboard
async function syncSLODashboard(dashboard: SLODashboard): Promise<void> {
  const response = await fetch('http://grafana:3000/api/dashboards/db', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.GRAFANA_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      dashboard: {
        title: dashboard.title,
        description: dashboard.description,
        panels: dashboard.panels,
      },
    }),
  });

  if (!response.ok) {
    throw new Error(`Failed to sync dashboard: ${response.statusText}`);
  }
}

Error Budget Policies

How to spend or protect error budget:

interface ErrorBudgetPolicy {
  name: string;
  condition: string;
  action: string;
}

const errorBudgetPolicies: ErrorBudgetPolicy[] = [
  {
    name: 'Freeze on Critical Budget Depletion',
    condition: 'Error budget remaining < 5%',
    action: 'Halt all non-critical deployments. Production fixes only.',
  },
  {
    name: 'Incident Review on Burn',
    condition: 'Error budget burn rate > 2x per month',
    action: 'Mandatory incident postmortem. Identify systemic issues.',
  },
  {
    name: 'Testing Increase on Budget Pressure',
    condition: 'Error budget remaining between 10-20%',
    action: 'All deployments require explicit QA sign-off and canary testing.',
  },
  {
    name: 'Infrastructure Investment on Consistent Burn',
    condition: '3+ months of budget exhaustion',
    action: 'Allocate engineering time to infrastructure improvements.',
  },
];

async function enforceErrorBudgetPolicy(budget: ErrorBudget): Promise<void> {
  const percentageRemaining = (budget.remainingBudget / budget.allowedBadEvents) * 100;

  if (percentageRemaining < 5) {
    console.warn('CRITICAL: Error budget depleted. Deploy freeze in effect.');
    // Prevent non-hotfix deployments
    // Notify on-call team
  } else if (percentageRemaining < 20) {
    console.warn('WARNING: Error budget at risk. Extra caution for deployments.');
    // Require explicit approval
    // Enable canary deployments
  }
}

SLO Review Process

Quarterly retrospectives on SLO targets:

interface SLOReviewMetrics {
  sloName: string;
  targetAchieved: boolean;
  monthlyStatus: Array<{
    month: string;
    actual: number;
    target: number;
  }>;
  incidents: number;
  recommendation: string;
}

const sloReviewProcess = {
  quarterly: async () => {
    // Collect metrics from last quarter
    const reviews: SLOReviewMetrics[] = [];

    // Review each SLO
    reviews.push({
      sloName: 'API Availability',
      targetAchieved: true, // Achieved 99.95% vs 99.9% target
      monthlyStatus: [
        { month: 'January', actual: 99.92, target: 99.9 },
        { month: 'February', actual: 99.95, target: 99.9 },
        { month: 'March', actual: 99.94, target: 99.9 },
      ],
      incidents: 1, // One incident with 2-hour duration
      recommendation: 'SLO is achievable. Consider tightening to 99.95% next quarter.',
    });

    // Decision matrix
    const decisions = {
      consistentlyMissing: 'Loosen SLO to realistic target',
      consistentlyExceeding: 'Tighten SLO to use resources optimally',
      oneIncident: 'Keep as-is, focus on incident prevention',
      multipleIncidents: 'Investigate root causes, may need infrastructure investment',
    };

    return reviews;
  },
};

Checklist

Conclusion

SLOs align product expectations with engineering reality. Error budgets quantify the tradeoff between reliability and velocity: with good SLIs and burn rate alerts, you can safely deploy multiple times per day while knowing exactly when to slow down. Product teams get predictable stability, engineering teams get permission to move fast, and the system stays reliable.