Published on

Terraform at Scale — State Management, Module Versioning, and Team Workflows

Authors

Introduction

Terraform is the lingua franca of infrastructure as code. But at scale—multiple teams, multiple environments, multiple AWS accounts—Terraform projects become fragile. Shared state corruption, module drift, untracked changes, and the absence of a clear promotion path lead to configuration entropy. This post covers production patterns: remote state with locking, workspace versus directory strategies, Terragrunt for DRY multi-environment setups, module versioning, import for existing resources, and pre-commit hooks to catch issues early.

Remote State with S3 and DynamoDB Locking

Never store Terraform state locally in production. Local state is invisible to team members, vulnerable to data loss, and has no concurrency protection. Use S3 with DynamoDB for distributed locking.

Create the backend infrastructure (terraform/backend/main.tf):

provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "terraform_state" {
  bucket = "org-terraform-state-${data.aws_caller_identity.current.account_id}"
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name           = "terraform-locks"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }

  tags = {
    Name        = "terraform-locks"
    Environment = "shared"
  }
}

data "aws_caller_identity" "current" {}

output "bucket_name" {
  value = aws_s3_bucket.terraform_state.id
}

output "dynamodb_table" {
  value = aws_dynamodb_table.terraform_locks.name
}

Configure backend in your Terraform project (backends.tf):

terraform {
  backend "s3" {
    bucket         = "org-terraform-state-123456789"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Initialize:

terraform init

Terraform will lock state during apply/destroy, preventing concurrent modifications.

Workspace vs Directory Strategy

Two approaches exist: workspaces and directories. Directories are superior for multi-environment management.

Workspace approach (not recommended for prod):

terraform workspace new staging
terraform workspace select staging
terraform apply

Workspaces share code but separate state. Problem: easy to accidentally apply dev changes to prod.

Directory approach (recommended):

terraform/
├── shared/
│   ├── main.tf         # VPC, networking, shared resources
│   └── outputs.tf
├── prod/
│   ├── main.tf         # Prod-specific config
│   ├── terraform.tfvars
│   └── backend.tf      # prod/terraform.tfstate
├── staging/
│   ├── main.tf
│   ├── terraform.tfvars
│   └── backend.tf      # staging/terraform.tfstate
└── modules/
    ├── network/
    ├── compute/
    └── database/

Each environment is a separate directory with distinct state. This prevents accidental production changes.

Terragrunt for DRY Configuration

Terragrunt eliminates duplication across environments. Manage Terraform files once; Terragrunt instantiates them per environment.

terragrunt.hcl (root):

locals {
  aws_region = "us-east-1"
  environment = get_env("ENVIRONMENT", "dev")
  project    = "myapp"
}

terraform {
  extra_arguments "retry_lock" {
    commands = get_terraform_commands_that_need_vars()
    env_vars = {
      AWS_REGION = local.aws_region
    }
  }
}

remote_state {
  backend = "s3"
  config = {
    bucket         = "${local.project}-terraform-${local.environment}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = local.aws_region
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    provider "aws" {
      region = "${local.aws_region}"
      default_tags {
        tags = {
          Environment = "${local.environment}"
          Project     = "${local.project}"
          ManagedBy   = "Terraform"
        }
      }
    }
  EOF
}

Directory structure:

terragrunt/
├── terragrunt.hcl
├── environments/
│   ├── dev/
│   │   ├── terragrunt.hcl
│   │   └── vpc/
│   │       ├── terragrunt.hcl
│   │       └── main.tf
│   │
│   ├── staging/
│   │   ├── terragrunt.hcl
│   │   └── vpc/
│   │       ├── terragrunt.hcl
│   │       └── main.tf
│   │
│   └── prod/
│       ├── terragrunt.hcl
│       └── vpc/
│           ├── terragrunt.hcl
│           └── main.tf
└── modules/
    └── network/
        ├── main.tf
        ├── variables.tf
        └── outputs.tf

environments/prod/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

locals {
  environment = "prod"
  cidr_block  = "10.0.0.0/16"
}

inputs = {
  environment = local.environment
  cidr_block  = local.cidr_block
  enable_nat  = true
  nat_count   = 3
}

environments/prod/vpc/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "../../../modules/network"
}

dependency "iam" {
  config_path = "../iam"
  skip_outputs = false
}

inputs = {
  iam_role_arn = dependency.iam.outputs.terraform_role_arn
}

Deploy all prod infrastructure:

cd terragrunt/environments/prod
terragrunt run-all apply

Module Versioning with Semantic Tags

Version modules to prevent unexpected breaking changes. Store modules in a Git repository or Terraform Registry.

Module source with version constraints:

module "network" {
  source = "git::https://github.com/org/terraform-modules.git//network?ref=v2.1.0"

  cidr_block = "10.0.0.0/16"
  environment = "prod"
}

module "database" {
  source = "git::https://github.com/org/terraform-modules.git//database?ref=v1.5.2"

  instance_class = "db.r6i.2xlarge"
  allocated_storage = 100
}

Module versioning best practices:

variable "database_version" {
  description = "PostgreSQL version"
  type        = string
  default     = "15.2"

  validation {
    condition     = can(regex("^1[0-9]\\.[0-9]$", var.database_version))
    error_message = "Database version must be in format X.Y"
  }
}

variable "instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.micro"

  validation {
    condition     = contains(["db.t3.micro", "db.t3.small", "db.r6i.large"], var.instance_class)
    error_message = "Instance class not allowed; use approved list."
  }
}

Tag releases semantically:

cd terraform-modules
git tag v2.1.0
git push origin v2.1.0

Drift Detection and Remediation

Drift occurs when infrastructure changes outside Terraform (manual updates, other tools). Detect and fix regularly.

Detect drift:

terraform refresh
terraform plan -refresh-only

If the plan shows changes, drift has occurred. Review the changes:

terraform plan -refresh-only -json | jq '.resource_changes[] | select(.change.actions[] == "update")'

Remediate drift:

Option 1: Apply Terraform (overwrite manual changes):

terraform apply

Option 2: Update Terraform state to match reality:

terraform refresh

Option 3: Re-import modified resource:

terraform state rm aws_instance.web
aws ec2 describe-instances --query 'Reservations[0].Instances[0].InstanceId' --output text | \
  xargs -I {} terraform import aws_instance.web {}

Automate drift checks:

#!/bin/bash
# ci/detect-drift.sh
set -e

terraform init -input=false
terraform refresh

if [[ -n $(terraform plan -refresh-only -json | jq -r '.resource_changes[]') ]]; then
  echo "Drift detected!"
  terraform plan -refresh-only
  exit 1
fi

Import for Existing Resources

Use terraform import to bring existing AWS resources under Terraform management.

# List existing EC2 instances
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].[InstanceId, Tags[?Key==`Name`].Value[0]]' \
  --output table

# Add to Terraform
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

# Import
terraform import aws_instance.web i-1234567890abcdef0

# Verify
terraform state show aws_instance.web

Pre-commit Hooks (tflint, tfsec, Checkov)

Catch issues before commits. Install pre-commit framework and add Terraform hooks.

.pre-commit-config.yaml:

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v4.5.0
  hooks:
  - id: trailing-whitespace
  - id: end-of-file-fixer
  - id: check-yaml
  - id: check-added-large-files
    args: ['--maxkb=1024']

- repo: https://github.com/terraform-linters/tflint
  rev: v0.50.0
  hooks:
  - id: tflint
    args: ["--format", "json"]

- repo: https://github.com/aquasecurity/tfsec
  rev: v1.28.1
  hooks:
  - id: tfsec
    args: ["--format", "json"]

- repo: https://github.com/bridgecrewio/checkov
  rev: 3.1.0
  hooks:
  - id: checkov
    args: ["--framework", "terraform"]

- repo: https://github.com/terraform-docs/terraform-docs
  rev: v0.16.0
  hooks:
  - id: terraform-docs
    args: [--sort-by-required, --format=markdown]

Install and run:

pip install pre-commit
pre-commit install
git add main.tf
git commit -m "Add S3 bucket"  # Hooks run automatically

TFVar File Management

Use .tfvars files to inject environment-specific values without committing secrets.

# prod.tfvars (commit to Git)
environment         = "prod"
instance_count      = 3
instance_type       = "t3.medium"
enable_monitoring   = true
log_retention_days  = 30
backup_enabled      = true
backup_retention    = 30
# prod.auto.tfvars (git-ignored, secrets only)
database_password   = "..."
api_key             = "..."
slack_webhook_url   = "..."

Apply with:

terraform apply -var-file=prod.tfvars -var-file=prod.auto.tfvars

Checklist

  • Remote state backend (S3 + DynamoDB) configured with encryption and versioning
  • State locks enabled and DynamoDB table auto-scaled
  • Separate directories per environment (dev/staging/prod)
  • Terragrunt configured for DRY multi-environment setup
  • All modules versioned with semantic tags
  • Drift detection running weekly (via CI or cron)
  • tflint, tfsec, and Checkov integrated in pre-commit hooks
  • All secrets in .auto.tfvars (git-ignored)
  • Non-sensitive .tfvars files committed to Git
  • Documentation and runbooks for import, state recovery
  • Terraform state backup strategy tested
  • State access restricted via IAM policies

Conclusion

Terraform at scale requires discipline. Remote state with locking eliminates data corruption. Directory-per-environment architecture prevents cross-contamination. Terragrunt eliminates duplication. Versioned modules provide stability. Pre-commit hooks catch errors before they reach production. Regular drift detection keeps reality aligned with code. Invest these hours upfront, and your infrastructure becomes reproducible, auditable, and maintainable at any scale.