Published on

The Overconfident Junior Breaking Prod — Guardrails That Protect Without Demoralizing

Authors

Introduction

Junior engineers break production in predictable ways: they have the right intentions, incomplete mental models of system consequences, and access that should require more friction than it does. The temptation is to respond with surveillance or restriction that demoralizes the engineer and prevents them from growing. The better response is to design systems where the catastrophic mistakes require extra steps — production deploys require peer review, database migrations have mandatory backup steps, and irreversible operations require explicit confirmation — while the engineer retains full development autonomy.

How Junior Engineers Break Production

Common junior-engineer production incidents:

1. "I'll fix this quickly in prod directly"
Direct production database change without a ticket or review
Wrong WHERE clause → deletes wrong rows

2. "This migration is the same as staging"
Runs migration on prod without checking prod-specific conditions
Table size: staging 1000 rows, prod 50M rows → locks table for hours

3. "I need to update this config quickly"
Modifies production environment variable
Forgets to revert → forgotten change causes incident 3 weeks later

4. "I'll just force push to fix the merge conflict"
   → git push --force on main
3 other engineers' commits gone

5. "This is staging, right?"
S3 bucket delete on production bucket
All user uploads gone, no versioning enabled

Prevention philosophy:
- Make the dangerous action require extra steps
- Make the safe action require fewer steps
- Don't remove access — add friction to irreversible operations

Fix 1: Branch Protection and Required Reviews

# .github/branch-protection.yml — protect main from force pushes and direct commits
# Configure in GitHub Settings → Branches → Branch protection rules

# Rules for 'main':
# ✅ Require a pull request before merging
# ✅ Require 2 approvals (1 from senior engineer)
# ✅ Dismiss stale reviews when new commits pushed
# ✅ Require status checks to pass (CI, lint, tests)
# ✅ Restrict who can push (no direct push, even for seniors)
# ✅ Require linear history (no merge commits)
# ✅ Include administrators (no exceptions for "quick fixes")
# ✅ Allow force pushes: NEVER

# CODEOWNERS for sensitive paths — senior engineer must review
# .github/CODEOWNERS
/db/migrations/    @senior-engineer @staff-engineer  # Migration reviewer required
/infrastructure/   @platform-team                    # Infra change reviewer required
/scripts/deploy*   @senior-engineer                  # Deploy script change reviewed

Fix 2: Production Database Safety Rails

#!/bin/bash
# safe-migration.sh — migration script that enforces safety steps

set -euo pipefail

ENV="${1:-}"
if [ -z "$ENV" ]; then
  echo "Usage: ./safe-migration.sh [staging|production]"
  exit 1
fi

# Production requires extra steps
if [ "$ENV" == "production" ]; then
  echo "⚠️  Production migration — extra checks required"

  # Step 1: Mandatory backup
  echo "Creating backup before migration..."
  BACKUP_NAME="pre-migration-$(date +%Y%m%d-%H%M%S)"
  aws rds create-db-snapshot \
    --db-instance-identifier myapp-prod \
    --db-snapshot-identifier "$BACKUP_NAME"

  echo "Backup created: $BACKUP_NAME"
  echo "Waiting for backup to complete..."
  aws rds wait db-snapshot-available \
    --db-snapshot-identifier "$BACKUP_NAME"

  # Step 2: Estimate migration impact
  ROW_COUNT=$(psql "$DATABASE_URL" -t -c "SELECT reltuples::bigint FROM pg_class WHERE relname = '${MIGRATION_TABLE:-unknown}'")
  echo "Estimated rows affected: $ROW_COUNT"

  if [ "$ROW_COUNT" -gt 1000000 ]; then
    echo "⚠️  Large table migration (>1M rows). This may lock the table."
    read -p "Continue? (type 'yes' to confirm): " CONFIRM
    if [ "$CONFIRM" != "yes" ]; then
      echo "Aborted."
      exit 1
    fi
  fi

  # Step 3: Require second engineer confirmation
  read -p "Enter your name (for audit log): " ENGINEER_NAME
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Production migration by $ENGINEER_NAME" >> /var/log/migration-audit.log
fi

# Run the migration
echo "Running migration..."
yarn migrate:latest

echo "✅ Migration complete"

Fix 3: IAM and Access Boundaries

# Principle of least privilege: junior engineers get dev/staging access
# Production access requires explicit request and senior approval

# AWS IAM: Junior engineer policy
# - Full access to dev/staging environments
# - Read-only access to production
# - Cannot: delete production buckets, modify production RDS, change prod security groups

# aws-iam-junior-engineer.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/Environment": ["dev", "staging"]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "ecs:Describe*",
        "ecs:List*",
        "rds:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "logs:Get*",
        "logs:FilterLogEvents"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/Environment": "production"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:DeleteBucket",
        "s3:DeleteObject",
        "rds:DeleteDBInstance",
        "rds:DeleteDBSnapshot",
        "ec2:TerminateInstances"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/Environment": "production"
        }
      }
    }
  ]
}

Fix 4: Incident as Learning, Not Punishment

// When a junior engineer causes a production incident:
// The system failed before the person did

const incidentResponseForJuniorEngineer = {
  immediate: {
    priority: 'Fix the incident, not find the blame',
    engineerRole: 'Junior engineer learns from the fix, doesn\'t own it alone',
    mentorRole: 'Senior engineer guides but doesn\'t push junior aside',
    afterAction: 'Thank the engineer for being transparent about what happened',
  },

  postmortems: {
    avoid: [
      'Calling out individual mistakes in the public postmortem',
      '"This wouldn\'t have happened if they\'d been more careful"',
      'Making the engineer feel they can\'t be trusted',
    ],
    do: [
      'Focus on which guardrail was missing',
      '"The migration script didn\'t require a backup step — that\'s what we\'re fixing"',
      'Include the junior engineer in designing the fix — they\'ll never forget it',
    ],
  },

  systemFixes: [
    'Add the missing guardrail to prevent this class of mistake',
    'Add the scenario to onboarding so future engineers learn before doing',
    'Review what other paths have the same missing guardrail',
  ],
}

Fix 5: Production Access Ladder

Graduated access that grows with demonstrated judgment:

Month 1-3 (Onboarding):
- Full dev environment access
- Staging with senior engineer pair
- Read-only production access
- No direct production deploys

Month 3-6 (Foundation):
- Independent staging deploys
- Production deploys via CI/CD with required approval
- Can access production logs
- Still no direct production database access

Month 6-12 (Growing):
- Production deploys independently (still CI/CD, not manual)
- Can run pre-approved maintenance scripts in production
- Read access to production DB through read replica
- No write access to production DB without approval

Year 1+ (Trusted):
- All above, plus escalated access for incidents
- Can run production DB queries with senior review
- On-call rotation begins

The ladder gives junior engineers a clear path to more access
while protecting the system from mistakes that happen during
the learning period.

Junior Engineer Safety Checklist

  • ✅ Branch protection: no direct push to main, required reviews for merges
  • ✅ CODEOWNERS: migrations and infrastructure require senior reviewer
  • ✅ IAM least privilege: production is read-only until access is earned
  • ✅ Migration scripts have mandatory backup steps for production
  • ✅ Large table migrations warn and require explicit confirmation
  • ✅ Graduated access ladder: responsibility grows with demonstrated judgment
  • ✅ Incidents treated as system failures, not personal failures — junior engineers design the fix

Conclusion

Overconfident juniors breaking production is a systems design problem, not a people problem. The guardrails that prevent most of these incidents — protected branches, required reviews, IAM least privilege, migration safety scripts, and gradual access ladders — take a few hours to implement and eliminate whole categories of incidents. The right question after an incident isn't "how do we watch this engineer more closely?" but "what made this action too easy to take, and how do we add appropriate friction?" Designed well, those guardrails don't restrict junior engineers — they give them a safe environment to learn and build the judgment that earns greater access over time.