- Published on
Chaos Engineering in Practice — From Gamedays to Automated Resilience Testing
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Your system is "highly available" until it isn't. Chaos engineering deliberately breaks things in controlled ways to find bugs and weak points. Netflix runs it at scale. Most teams run it never. The gap between Netflix and your team isn't complexity—it's permission to break things and learn from it. This post covers designing steady-state hypotheses, running AWS Fault Injection Simulator experiments, structuring gamedays that build confidence, and automating chaos to catch regressions.
- Defining Steady-State Hypothesis
- AWS Fault Injection Simulator Setup
- Gameday Structure and Communication
- Automated Chaos with Steady-State Validation
- Learning Reviews and Incident Response
- Chaos Maturity Model
- Checklist
- Conclusion
Defining Steady-State Hypothesis
Before you break anything, define what "working" means. A steady-state hypothesis is an observable property of your system when it's healthy.
// chaos/hypotheses.ts
interface SteadyStateHypothesis {
name: string;
metric: string;
expected: {
minValue?: number;
maxValue?: number;
operator: 'gte' | 'lte' | 'between' | 'eq';
};
window: number; // Time window in seconds to measure
}
const hypotheses: SteadyStateHypothesis[] = [
{
name: 'API response time p99',
metric: 'histogram_quantile(0.99, http_request_duration_seconds)',
expected: {
maxValue: 0.5, // 500ms p99
operator: 'lte',
},
window: 60,
},
{
name: 'Error rate',
metric: 'rate(http_requests_total{status=~"5.."}[1m])',
expected: {
maxValue: 0.01, // < 1% errors
operator: 'lte',
},
window: 60,
},
{
name: 'Database connection pool utilization',
metric: 'db_connection_pool_used / db_connection_pool_size',
expected: {
maxValue: 0.8, // < 80% utilization
operator: 'lte',
},
window: 60,
},
{
name: 'Message queue depth',
metric: 'rabbitmq_queue_messages_ready',
expected: {
maxValue: 10000, // Less than 10k pending
operator: 'lte',
},
window: 60,
},
];
export { hypotheses, SteadyStateHypothesis };
AWS Fault Injection Simulator Setup
Use AWS FIS to inject controlled failures:
// chaos/fis-experiments.ts
import { FisClient, CreateExperimentTemplateCommand } from '@aws-sdk/client-fis';
const fisClient = new FisClient({ region: 'us-east-1' });
export async function setupLatencyExperiment() {
const command = new CreateExperimentTemplateCommand({
clientToken: `latency-exp-${Date.now()}`,
description: 'Inject 500ms latency to API Gateway',
targets: {
'api-target': {
resourceType: 'aws:apigateway:stage',
resourceTags: {
'Environment': 'production',
'ChaosReady': 'true',
},
selectionMode: 'ALL',
},
},
actions: {
'inject-latency': {
actionId: 'aws:apigateway:throttle-api',
parameters: {
duration: 'PT5M', // 5 minutes
rateLimit: '1000', // 1000 req/sec allowed
},
targets: {
'GatewayTarget': 'api-target',
},
},
},
stopConditions: [
{
source: 'aws:cloudwatch',
value: 'p99-latency-high', // Alarm name
},
],
roleArn: 'arn:aws:iam::ACCOUNT:role/FISExperimentRole',
tags: {
'Team': 'platform',
'Type': 'chaos',
},
});
return fisClient.send(command);
}
export async function setupDatabaseFailoverExperiment() {
const command = new CreateExperimentTemplateCommand({
clientToken: `db-failover-${Date.now()}`,
description: 'Force RDS failover during traffic',
targets: {
'rds-target': {
resourceType: 'aws:rds:db',
resourceTags: {
'Failover': 'enabled',
},
selectionMode: 'ALL',
},
},
actions: {
'trigger-failover': {
actionId: 'aws:rds:reboot-db-instance',
parameters: {
rebootType: 'failover', // Force failover, not reboot
},
targets: {
'DBTarget': 'rds-target',
},
},
},
stopConditions: [
{
source: 'aws:cloudwatch',
value: 'error-rate-critical', // Stop if errors spike
},
],
roleArn: 'arn:aws:iam::ACCOUNT:role/FISExperimentRole',
});
return fisClient.send(command);
}
export async function setupNetworkPartitionExperiment() {
// Partition between AZs
const command = new CreateExperimentTemplateCommand({
clientToken: `network-partition-${Date.now()}`,
description: 'Simulate network partition between AZ-A and AZ-B',
targets: {
'az-a-instances': {
resourceType: 'aws:ec2:instance',
resourceTags: {
'AZ': 'us-east-1a',
'ChaosReady': 'true',
},
selectionMode: 'PERCENT',
selectionMode: '50', // 50% of instances
},
},
actions: {
'partition-network': {
actionId: 'aws:ec2:security-group-modification',
parameters: {
groupId: 'sg-partition-deny', // SG that denies cross-AZ
action: 'apply',
},
targets: {
'AZAInstances': 'az-a-instances',
},
},
},
stopConditions: [
{
source: 'aws:cloudwatch',
value: 'availability-critical',
},
],
roleArn: 'arn:aws:iam::ACCOUNT:role/FISExperimentRole',
});
return fisClient.send(command);
}
export async function setupPodKillExperiment() {
const command = new CreateExperimentTemplateCommand({
clientToken: `pod-kill-${Date.now()}`,
description: 'Kill random pods in EKS cluster',
targets: {
'eks-pods': {
resourceType: 'aws:ec2:instance',
resourceTags: {
'KubernetesCluster': 'production',
'ChaosReady': 'true',
},
selectionMode: 'PERCENT',
selectionMode: '25', // 25% of nodes (kills pods running on them)
},
},
actions: {
'terminate-instances': {
actionId: 'aws:ec2:terminate-instances',
parameters: {
force: 'true',
},
targets: {
'Instances': 'eks-pods',
},
},
},
stopConditions: [
{
source: 'aws:cloudwatch',
value: 'critical-service-error-rate',
},
],
roleArn: 'arn:aws:iam::ACCOUNT:role/FISExperimentRole',
});
return fisClient.send(command);
}
Gameday Structure and Communication
A gameday is a scheduled chaos experiment with all hands on deck. Coordinate with PagerDuty escalation policies paused.
// chaos/gameday-template.ts
export interface GamedayPlan {
date: string;
duration: number; // minutes
facilitator: string;
participants: string[];
slack_channel: string;
hypothesis: string;
experiment_actions: ExperimentAction[];
success_criteria: string[];
communication_template: string;
}
export const gameday: GamedayPlan = {
date: '2026-03-20T14:00:00Z',
duration: 120,
facilitator: 'alice@example.com',
participants: [
'backend-team',
'infrastructure-team',
'on-call-engineer',
'product-manager',
],
slack_channel: '#gameday-2026-03-20',
hypothesis:
'System maintains availability when 50% of API instances are killed',
experiment_actions: [
{
time_minute: 0,
action: 'Baseline traffic established (100 req/sec)',
},
{
time_minute: 5,
action: 'Kill 50% of API pods in AZ-A',
},
{
time_minute: 10,
action: 'Verify requests route to surviving pods',
},
{
time_minute: 20,
action: 'Restore killed pods',
},
{
time_minute: 25,
action: 'Return to baseline',
},
],
success_criteria: [
'p99 latency stays below 500ms',
'Error rate stays below 1%',
'No page alert triggered',
'Load balancer routes successfully',
],
communication_template: `
[14:00] Gameday start: Testing pod failure resilience
[14:05] CHAOS EVENT: Killing 50% of API pods
[14:10] Status: Requests routing to surviving pods, p99 latency: 250ms, error rate: 0.1%
[14:20] Chaos event complete
[14:30] Gameday end - SUCCESS
`,
};
export async function notifySlack(message: string, channel: string) {
const response = await fetch(process.env.SLACK_WEBHOOK_URL || '', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
channel,
text: message,
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: message,
},
},
],
}),
});
return response.json();
}
export async function triggerGameday(plan: GamedayPlan) {
const startTime = new Date(plan.date);
const endTime = new Date(startTime.getTime() + plan.duration * 60 * 1000);
await notifySlack(
`🎮 **Gameday Starting**: ${plan.hypothesis}\nDuration: ${
plan.duration
} minutes\nFacilitator: ${plan.facilitator}\nEnd time: ${endTime.toISOString()}`,
plan.slack_channel
);
return {
startTime,
endTime,
hypothesis: plan.hypothesis,
};
}
Automated Chaos with Steady-State Validation
Run chaos experiments continuously without humans:
// chaos/automated-chaos.ts
import { FisClient, StartExperimentCommand } from '@aws-sdk/client-fis';
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';
const fisClient = new FisClient({ region: 'us-east-1' });
const cwClient = new CloudWatchClient({ region: 'us-east-1' });
export async function validateSteadyState(
hypothesis: SteadyStateHypothesis
): Promise<boolean> {
const command = new GetMetricStatisticsCommand({
Namespace: 'CustomMetrics',
MetricName: hypothesis.metric,
StartTime: new Date(Date.now() - hypothesis.window * 1000),
EndTime: new Date(),
Period: hypothesis.window,
Statistics: ['Average'],
});
const result = await cwClient.send(command);
const value = result.Datapoints?.[0]?.Average || 0;
const meetsExpectation =
hypothesis.expected.operator === 'lte'
? value <= (hypothesis.expected.maxValue || Infinity)
: hypothesis.expected.operator === 'gte'
? value >= (hypothesis.expected.minValue || 0)
: hypothesis.expected.operator === 'between'
? value >= (hypothesis.expected.minValue || 0) &&
value <= (hypothesis.expected.maxValue || Infinity)
: value === hypothesis.expected.maxValue;
return meetsExpectation;
}
export async function runAutomatedChaosExperiment(
experimentTemplateId: string,
hypotheses: SteadyStateHypothesis[]
): Promise<{
success: boolean;
findings: string[];
duration: number;
}> {
// Validate baseline
const baselineValid = await Promise.all(
hypotheses.map((h) => validateSteadyState(h))
);
if (!baselineValid.every(Boolean)) {
return {
success: false,
findings: ['Baseline steady state not met'],
duration: 0,
};
}
const startCommand = new StartExperimentCommand({
experimentTemplateId,
});
const experiment = await fisClient.send(startCommand);
const experimentId = experiment.experiment?.id || '';
const startTime = Date.now();
// Monitor during experiment
const findings: string[] = [];
let allHypothesesHeld = true;
const monitorInterval = setInterval(async () => {
const hypothesesResults = await Promise.all(
hypotheses.map(async (h) => {
const valid = await validateSteadyState(h);
if (!valid) {
findings.push(`VIOLATED: ${h.name}`);
allHypothesesHeld = false;
}
return valid;
})
);
}, 10000); // Check every 10 seconds
// Wait for experiment to complete
await new Promise((resolve) => {
const checkInterval = setInterval(async () => {
// Poll experiment status
// In real code, use DescribeExperiment API
const duration = Date.now() - startTime;
if (duration > 5 * 60 * 1000) {
// 5 minute experiment
clearInterval(checkInterval);
clearInterval(monitorInterval);
resolve(null);
}
}, 5000);
});
return {
success: allHypothesesHeld && findings.length === 0,
findings,
duration: Date.now() - startTime,
};
}
Learning Reviews and Incident Response
Document findings and create action items:
// chaos/learning-review.ts
export interface ChaosLearningReview {
experimentId: string;
date: string;
hypothesis: string;
result: 'success' | 'failure' | 'inconclusive';
findings: {
title: string;
severity: 'critical' | 'high' | 'medium' | 'low';
description: string;
rootCause?: string;
}[];
actionItems: {
title: string;
owner: string;
dueDate: string;
priority: 'p0' | 'p1' | 'p2' | 'p3';
}[];
metrics: {
baselineP99Latency: number;
peakP99LatencyDuringChaos: number;
errorRateBaseline: number;
errorRatePeakDuringChaos: number;
timeToRecover: number;
};
attendees: string[];
notesUrl: string;
}
const learningReview: ChaosLearningReview = {
experimentId: 'exp-2026-03-20-pod-kill',
date: '2026-03-20T16:00:00Z',
hypothesis: 'System maintains availability when 50% of API pods killed',
result: 'failure',
findings: [
{
title: 'Session state lost during pod death',
severity: 'critical',
description:
'Users were logged out when their pod died because sessions stored in-memory',
rootCause: 'No distributed session store (Redis/DynamoDB)',
},
{
title: 'Database connection pool exhaustion',
severity: 'high',
description:
'Surviving pods tried to establish new connections; pool hit limit',
rootCause:
'No connection pooling middleware; each pod creates own pool',
},
{
title: 'Load balancer slow to detect dead pods',
severity: 'medium',
description: '~15 seconds before traffic fully shifted to healthy pods',
rootCause: 'Health check interval set to 10 seconds',
},
],
actionItems: [
{
title: 'Implement Redis session store',
owner: 'backend-team',
dueDate: '2026-04-03',
priority: 'p0',
},
{
title: 'Add PgBouncer for DB connection pooling',
owner: 'infrastructure-team',
dueDate: '2026-04-10',
priority: 'p0',
},
{
title: 'Reduce ALB health check interval to 5 seconds',
owner: 'platform-team',
dueDate: '2026-03-27',
priority: 'p1',
},
{
title: 'Document resilience patterns in runbooks',
owner: 'documentation-team',
dueDate: '2026-04-17',
priority: 'p2',
},
],
metrics: {
baselineP99Latency: 250,
peakP99LatencyDuringChaos: 3500,
errorRateBaseline: 0.001,
errorRatePeakDuringChaos: 0.25,
timeToRecover: 45000, // 45 seconds
},
attendees: [
'alice@example.com',
'bob@example.com',
'charlie@example.com',
],
notesUrl:
'https://docs.example.com/learning-reviews/2026-03-20-pod-kill.md',
};
export async function createJiraTickets(review: ChaosLearningReview) {
for (const item of review.actionItems) {
const issuePayload = {
fields: {
project: { key: 'INFRA' },
summary: item.title,
description: `From chaos experiment ${review.experimentId}`,
issuetype: { name: 'Task' },
assignee: { name: item.owner },
duedate: item.dueDate,
labels: ['chaos-outcome', `priority-${item.priority}`],
},
};
// Create via Jira API
const response = await fetch('https://jira.example.com/rest/api/3/issue', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.JIRA_API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(issuePayload),
});
console.log(`Created ticket: ${await response.json()}`);
}
}
export async function postLearningReviewToWiki(review: ChaosLearningReview) {
const markdown = `
# Chaos Learning Review: ${review.experimentId}
**Date**: ${review.date}
**Hypothesis**: ${review.hypothesis}
**Result**: ${review.result.toUpperCase()}
## Findings
${review.findings
.map(
(f) => `
### ${f.title} (${f.severity.toUpperCase()})
${f.description}
**Root Cause**: ${f.rootCause || 'TBD'}
`
)
.join('\n')}
## Action Items
${review.actionItems
.map(
(a) => `
- [ ] **${a.title}** (Owner: ${a.owner}, Due: ${a.dueDate}, ${a.priority})`
)
.join('\n')}
## Metrics
| Metric | Baseline | Peak During Chaos |
|--------|----------|-------------------|
| P99 Latency | ${review.metrics.baselineP99Latency}ms | ${review.metrics.peakP99LatencyDuringChaos}ms |
| Error Rate | ${(review.metrics.errorRateBaseline * 100).toFixed(2)}% | ${(review.metrics.errorRatePeakDuringChaos * 100).toFixed(2)}% |
| Time to Recover | - | ${review.metrics.timeToRecover}ms |
`;
// Post to wiki (Confluence, Notion, etc.)
console.log('Learning review markdown:', markdown);
}
Chaos Maturity Model
// chaos/maturity.ts
export const CHAOS_MATURITY = {
LEVEL_1: {
name: 'Manual Experiments',
description: 'One-off chaos tests run by engineers',
practices: [
'Gamedays with planned scenarios',
'Manual impact assessment',
'Post-event learning reviews',
],
},
LEVEL_2: {
name: 'Automated Execution',
description: 'Chaos experiments run on schedule',
practices: [
'Scheduled FIS experiments',
'Automated steady-state validation',
'Automated alert on hypothesis violation',
],
},
LEVEL_3: {
name: 'Continuous Chaos',
description: 'Chaos runs constantly in non-prod and prod',
practices: [
'Per-deployment chaos test',
'Blast radius limits enforced',
'Automated remediation for safe failures',
'Chaos as code in CI/CD',
],
},
};
Checklist
- Define steady-state hypotheses for critical services
- Run first gameday with all-hands participation
- Schedule monthly chaos experiments
- Set up AWS FIS with stop conditions
- Tag resources as ChaosReady in production
- Create PagerDuty overrides for gameday window
- Document learning reviews in wiki
- Create Jira tickets for findings
- Measure MTTR before/after chaos improvements
- Build institutional knowledge through repeated experiments
Conclusion
Chaos engineering transforms operational knowledge from anecdotal ("the system just handles it") to empirical ("we tested it and know exactly how it fails"). Start small with a gameday. Fail. Learn. Fix. Repeat. The hypotheses you violate become your roadmap to resilience. Every chaos experiment you run is an insurance policy against the failure you didn't anticipate.