- Published on
SLOs, SLIs, and Error Budgets — Reliability Engineering That Product Teams Will Actually Use
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
SLOs (Service Level Objectives) translate reliability into business language. Instead of abstract uptime percentages, SLOs answer: "How much latency can we tolerate?" and "How many errors break our promise to users?" Error budgets quantify how much reliability we can spend. Without them, teams either move too fast and cause outages, or move too slowly preventing velocity. The right framework enables both.
- SLI Definition: Measuring What Matters
- SLO Target Setting
- Error Budget Calculation
- Multi-Window Burn Rate Alerting
- Prometheus Recording Rules for SLIs
- SLO Dashboard in Grafana
- Error Budget Policies
- SLO Review Process
- Checklist
- Conclusion
SLI Definition: Measuring What Matters
SLIs (Service Level Indicators) are the measurements that matter to users:
// Service Level Indicators - actual measurements
interface SLIDefinition {
name: string;
description: string;
measurement: 'availability' | 'latency' | 'error_rate';
threshold: number;
window: string; // e.g., "p99", "p95"
}
const sliDefinitions: SLIDefinition[] = [
{
name: 'API Availability',
description: 'Percentage of requests that return 2xx or 3xx',
measurement: 'availability',
threshold: 99.9,
window: 'per request',
},
{
name: 'API Latency (p99)',
description: '99th percentile of request latency',
measurement: 'latency',
threshold: 500, // milliseconds
window: 'p99',
},
{
name: 'Error Rate',
description: 'Percentage of requests returning 5xx',
measurement: 'error_rate',
threshold: 0.1, // 0.1% = 1 in 1000
window: 'per minute',
},
{
name: 'Database Query Latency (p95)',
description: 'Time for 95th percentile database query',
measurement: 'latency',
threshold: 100, // milliseconds
window: 'p95',
},
];
// Measure SLIs continuously
class SLIMeasurement {
private measurements: Map<string, number[]> = new Map();
recordRequest(
sliName: string,
duration: number,
statusCode: number
): void {
if (!this.measurements.has(sliName)) {
this.measurements.set(sliName, []);
}
const values = this.measurements.get(sliName)!;
values.push(duration);
// Keep rolling window (last hour of data)
if (values.length > 360000) {
values.shift();
}
}
calculateSLI(sliName: string, definition: SLIDefinition): number {
const values = this.measurements.get(sliName) || [];
if (definition.measurement === 'latency') {
const percentile = parseInt(definition.window.replace('p', ''));
return this.percentile(values, percentile / 100);
}
if (definition.measurement === 'availability') {
// Success = 2xx or 3xx
return (values.length / values.length) * 100;
}
return 0;
}
private percentile(arr: number[], p: number): number {
const sorted = [...arr].sort((a, b) => a - b);
const index = Math.ceil(sorted.length * p) - 1;
return sorted[Math.max(0, index)];
}
}
SLO Target Setting
Set SLO targets that balance ambition with feasibility:
// SLO targets: What we promise to users
interface SLOTarget {
sliName: string;
target: number; // percentage for availability, milliseconds for latency
window: string; // "monthly", "quarterly", "yearly"
rationale: string;
}
const sloTargets: SLOTarget[] = [
{
sliName: 'API Availability',
target: 99.9, // 99.9% uptime
window: 'monthly',
rationale: 'Allow 43 minutes of downtime per month for maintenance/incidents',
},
{
sliName: 'API Latency (p99)',
target: 500, // 500ms
window: 'monthly',
rationale: 'Users notice latency >500ms, competitive requirement',
},
{
sliName: 'Error Rate',
target: 0.1, // 0.1%
window: 'monthly',
rationale: '1 error per 1000 requests acceptable, align with user expectations',
},
];
// Why NOT 99.99%?
// 99.9% = 43 minutes downtime/month
// 99.99% = 4.3 minutes downtime/month
// 99.999% = 26 seconds downtime/month
// Cost doubles with each 9 (infrastructure, testing, on-call)
// Diminishing returns: 99.99% costs 3x more than 99.9% for 10x less downtime
function calculateDowntimeAllowance(sloTarget: number, windowDays: number): number {
const totalMinutes = windowDays * 24 * 60;
const allowedDowntimeMinutes = totalMinutes * (1 - sloTarget / 100);
return allowedDowntimeMinutes;
}
console.log('Monthly downtime allowance:');
console.log(`99.9%: ${calculateDowntimeAllowance(99.9, 30).toFixed(2)} minutes`); // 43.2
console.log(`99.95%: ${calculateDowntimeAllowance(99.95, 30).toFixed(2)} minutes`); // 21.6
console.log(`99.99%: ${calculateDowntimeAllowance(99.99, 30).toFixed(2)} minutes`); // 4.32
Error Budget Calculation
How much reliability can we "spend"?
interface ErrorBudget {
sloTarget: number;
windowDays: number;
totalEvents: number;
allowedBadEvents: number;
burnRate: number;
burnedSoFar: number;
remainingBudget: number;
daysRemaining: number;
}
class ErrorBudgetTracker {
private startDate: Date;
private sloTarget: number;
private windowDays: number;
private badEvents = 0;
private totalEvents = 0;
constructor(sloTarget: number, windowDays: number = 30) {
this.sloTarget = sloTarget;
this.windowDays = windowDays;
this.startDate = new Date();
}
recordEvent(success: boolean): void {
this.totalEvents++;
if (!success) {
this.badEvents++;
}
}
calculateBudget(): ErrorBudget {
const allowedBadEvents = Math.floor(
this.totalEvents * (1 - this.sloTarget / 100)
);
const burnRate = this.badEvents / this.totalEvents;
const burnedSoFar = this.badEvents;
const remainingBudget = Math.max(0, allowedBadEvents - burnedSoFar);
const elapsedDays =
(Date.now() - this.startDate.getTime()) / (1000 * 60 * 60 * 24);
const daysRemaining = this.windowDays - elapsedDays;
return {
sloTarget: this.sloTarget,
windowDays: this.windowDays,
totalEvents: this.totalEvents,
allowedBadEvents,
burnRate,
burnedSoFar,
remainingBudget,
daysRemaining,
};
}
getRiskLevel(): 'safe' | 'warning' | 'critical' {
const budget = this.calculateBudget();
if (budget.daysRemaining === 0) return 'critical';
const projectedBurn = budget.burnRate * (budget.daysRemaining * 86400);
const percentageOfBudgetRemaining = (budget.remainingBudget / budget.allowedBadEvents) * 100;
if (percentageOfBudgetRemaining > 50) return 'safe';
if (percentageOfBudgetRemaining > 10) return 'warning';
return 'critical';
}
}
// Example: 99.9% availability SLO for March
const budget = new ErrorBudgetTracker(99.9, 31);
// March so far: 10M requests, 1000 errors
for (let i = 0; i < 10000000; i++) {
budget.recordEvent(i < 9999000); // 1000 errors
}
const currentBudget = budget.calculateBudget();
console.log('Error Budget Report:');
console.log(`SLO Target: ${currentBudget.sloTarget}%`);
console.log(`Total Requests: ${currentBudget.totalEvents}`);
console.log(`Allowed Errors: ${currentBudget.allowedBadEvents}`);
console.log(`Errors So Far: ${currentBudget.burnedSoFar}`);
console.log(`Remaining Budget: ${currentBudget.remainingBudget}`);
console.log(`Risk Level: ${budget.getRiskLevel()}`);
// Output:
// SLO Target: 99.9%
// Total Requests: 10000000
// Allowed Errors: 10000
// Errors So Far: 1000
// Remaining Budget: 9000
// Risk Level: safe (9000/10000 remaining)
Multi-Window Burn Rate Alerting
Alert on fast burn rates before budget exhaustion:
interface BurnRateAlert {
window: string; // "5m", "1h"
threshold: number; // burn rate (errors per second)
consecutiveBreaches: number; // how many windows before alert
}
class BurnRateAlerter {
private sloTarget: number;
private allowedBadEventsPerSecond: number;
private windows: Map<string, number[]> = new Map(); // window -> errors in that window
constructor(sloTarget: number, requestsPerSecond: number = 1000) {
this.sloTarget = sloTarget;
// 99.9% SLO = 0.1% error rate = 1 error per 1000 requests
const errorRate = (1 - sloTarget / 100);
this.allowedBadEventsPerSecond = requestsPerSecond * errorRate;
}
recordError(timestamp: number = Date.now()): void {
// Track errors in multiple windows
this.updateWindow('5m', timestamp);
this.updateWindow('1h', timestamp);
this.updateWindow('5min', timestamp);
}
private updateWindow(windowLabel: string, timestamp: number): void {
if (!this.windows.has(windowLabel)) {
this.windows.set(windowLabel, []);
}
const errors = this.windows.get(windowLabel)!;
errors.push(timestamp);
// Remove old errors outside window
const windowMs = this.getWindowMs(windowLabel);
const cutoff = timestamp - windowMs;
const filtered = errors.filter(t => t > cutoff);
this.windows.set(windowLabel, filtered);
}
private getWindowMs(label: string): number {
const map: Record<string, number> = {
'5m': 5 * 60 * 1000,
'1h': 60 * 60 * 1000,
'5min': 5 * 60 * 1000,
};
return map[label] || 60000;
}
checkBurnRates(): Map<string, number> {
const burnRates = new Map<string, number>();
for (const [window, errors] of this.windows.entries()) {
const windowMs = this.getWindowMs(window);
const seconds = windowMs / 1000;
const burnRate = errors.length / seconds;
burnRates.set(window, burnRate);
// Alert if burning >5x the allowed rate
if (burnRate > this.allowedBadEventsPerSecond * 5) {
console.warn(
`ALERT: High burn rate (${window}): ${burnRate.toFixed(2)} errors/sec ` +
`(5x threshold: ${(this.allowedBadEventsPerSecond * 5).toFixed(2)})`
);
}
}
return burnRates;
}
}
// Multi-window alerting thresholds
const burnRateAlerts: BurnRateAlert[] = [
{
window: '5m',
threshold: 10, // 10x burn rate for 5 minutes
consecutiveBreaches: 1, // Alert immediately
},
{
window: '1h',
threshold: 3, // 3x burn rate for 1 hour
consecutiveBreaches: 2, // Alert after 2 consecutive breaches (2 hours)
},
{
window: '6h',
threshold: 1, // 1x burn rate (entire error budget) for 6 hours
consecutiveBreaches: 1, // Alert immediately
},
];
// Benefits of multi-window:
// - 5m x 10x: Catches immediate spikes (deploy bug)
// - 1h x 3x: Catches sustained issues (resource exhaustion)
// - 6h x 1x: Catches slow budget burn (gradual performance regression)
Prometheus Recording Rules for SLIs
Precompute SLI metrics for fast queries:
# prometheus_recording_rules.yml
groups:
- name: sli_recording_rules
interval: 30s
rules:
# Availability SLI: % of requests returning 2xx or 3xx
- record: sli:request:success_ratio
expr: |
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI: p99 latency
- record: sli:request:latency_p99
expr: |
histogram_quantile(
0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error rate SLI: % of requests returning 5xx
- record: sli:request:error_ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Burn rate: Ratio of actual errors to SLO budget
- record: sli:request:burn_rate_5m
expr: |
(1 - sli:request:success_ratio) / 0.001
# 0.001 = 0.1% SLO target
# Divide by budget to get burn rate
# 1-hour burn rate for multi-window alerting
- record: sli:request:burn_rate_1h
expr: |
(1 - avg(sli:request:success_ratio)) / 0.001
SLO Dashboard in Grafana
Visualize SLO status and error budget:
interface SLODashboard {
title: string;
description: string;
panels: DashboardPanel[];
}
interface DashboardPanel {
title: string;
type: 'graph' | 'stat' | 'gauge';
targets: MetricQuery[];
}
const sloDashboard: SLODashboard = {
title: 'Service SLO Dashboard',
description: 'Real-time SLO status and error budget tracking',
panels: [
{
title: 'API Availability (99.9% SLO)',
type: 'gauge',
targets: [
{
expr: 'sli:request:success_ratio * 100',
legendFormat: 'Current: {{ instance }}',
},
],
},
{
title: 'Error Budget Remaining (Monthly)',
type: 'stat',
targets: [
{
expr: 'sli:error_budget:remaining_percent',
legendFormat: 'Remaining %',
},
],
},
{
title: 'Burn Rate (Multi-Window)',
type: 'graph',
targets: [
{
expr: 'sli:request:burn_rate_5m',
legendFormat: '5min burn rate',
},
{
expr: 'sli:request:burn_rate_1h',
legendFormat: '1h burn rate',
},
],
},
{
title: 'Error Rate (p99 latency)',
type: 'graph',
targets: [
{
expr: 'sli:request:latency_p99',
legendFormat: 'p99 latency (seconds)',
},
],
},
{
title: 'Requests Per Second',
type: 'graph',
targets: [
{
expr: 'sum(rate(http_requests_total[5m]))',
legendFormat: 'Throughput',
},
],
},
],
};
// Grafana API to create/update dashboard
async function syncSLODashboard(dashboard: SLODashboard): Promise<void> {
const response = await fetch('http://grafana:3000/api/dashboards/db', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.GRAFANA_API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
dashboard: {
title: dashboard.title,
description: dashboard.description,
panels: dashboard.panels,
},
}),
});
if (!response.ok) {
throw new Error(`Failed to sync dashboard: ${response.statusText}`);
}
}
Error Budget Policies
How to spend or protect error budget:
interface ErrorBudgetPolicy {
name: string;
condition: string;
action: string;
}
const errorBudgetPolicies: ErrorBudgetPolicy[] = [
{
name: 'Freeze on Critical Budget Depletion',
condition: 'Error budget remaining < 5%',
action: 'Halt all non-critical deployments. Production fixes only.',
},
{
name: 'Incident Review on Burn',
condition: 'Error budget burn rate > 2x per month',
action: 'Mandatory incident postmortem. Identify systemic issues.',
},
{
name: 'Testing Increase on Budget Pressure',
condition: 'Error budget remaining between 10-20%',
action: 'All deployments require explicit QA sign-off and canary testing.',
},
{
name: 'Infrastructure Investment on Consistent Burn',
condition: '3+ months of budget exhaustion',
action: 'Allocate engineering time to infrastructure improvements.',
},
];
async function enforceErrorBudgetPolicy(budget: ErrorBudget): Promise<void> {
const percentageRemaining = (budget.remainingBudget / budget.allowedBadEvents) * 100;
if (percentageRemaining < 5) {
console.warn('CRITICAL: Error budget depleted. Deploy freeze in effect.');
// Prevent non-hotfix deployments
// Notify on-call team
} else if (percentageRemaining < 20) {
console.warn('WARNING: Error budget at risk. Extra caution for deployments.');
// Require explicit approval
// Enable canary deployments
}
}
SLO Review Process
Quarterly retrospectives on SLO targets:
interface SLOReviewMetrics {
sloName: string;
targetAchieved: boolean;
monthlyStatus: Array<{
month: string;
actual: number;
target: number;
}>;
incidents: number;
recommendation: string;
}
const sloReviewProcess = {
quarterly: async () => {
// Collect metrics from last quarter
const reviews: SLOReviewMetrics[] = [];
// Review each SLO
reviews.push({
sloName: 'API Availability',
targetAchieved: true, // Achieved 99.95% vs 99.9% target
monthlyStatus: [
{ month: 'January', actual: 99.92, target: 99.9 },
{ month: 'February', actual: 99.95, target: 99.9 },
{ month: 'March', actual: 99.94, target: 99.9 },
],
incidents: 1, // One incident with 2-hour duration
recommendation: 'SLO is achievable. Consider tightening to 99.95% next quarter.',
});
// Decision matrix
const decisions = {
consistentlyMissing: 'Loosen SLO to realistic target',
consistentlyExceeding: 'Tighten SLO to use resources optimally',
oneIncident: 'Keep as-is, focus on incident prevention',
multipleIncidents: 'Investigate root causes, may need infrastructure investment',
};
return reviews;
},
};
Checklist
- Define SLIs matching user experience (availability, latency, error rate)
- Set SLO targets (99.9% is usually right, not 99.99%)
- Implement error budget tracking (rolling window per month)
- Set up multi-window burn rate alerts (5m, 1h, 6h)
- Create Prometheus recording rules for fast SLI queries
- Build SLO dashboard in Grafana visible to entire team
- Document error budget policies (when to freeze deployments)
- Conduct quarterly SLO reviews with product and engineering
- Link deployment velocity to error budget health
- Use error budget data in postmortem process
Conclusion
SLOs align product expectations with engineering reality. Error budgets quantify the tradeoff between reliability and velocity: with good SLIs and burn rate alerts, you can safely deploy multiple times per day while knowing exactly when to slow down. Product teams get predictable stability, engineering teams get permission to move fast, and the system stays reliable.