- Published on
Terraform at Scale — State Management, Module Versioning, and Team Workflows
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Terraform is the lingua franca of infrastructure as code. But at scale—multiple teams, multiple environments, multiple AWS accounts—Terraform projects become fragile. Shared state corruption, module drift, untracked changes, and the absence of a clear promotion path lead to configuration entropy. This post covers production patterns: remote state with locking, workspace versus directory strategies, Terragrunt for DRY multi-environment setups, module versioning, import for existing resources, and pre-commit hooks to catch issues early.
- Remote State with S3 and DynamoDB Locking
- Workspace vs Directory Strategy
- Terragrunt for DRY Configuration
- Module Versioning with Semantic Tags
- Drift Detection and Remediation
- Import for Existing Resources
- Pre-commit Hooks (tflint, tfsec, Checkov)
- TFVar File Management
- Checklist
- Conclusion
Remote State with S3 and DynamoDB Locking
Never store Terraform state locally in production. Local state is invisible to team members, vulnerable to data loss, and has no concurrency protection. Use S3 with DynamoDB for distributed locking.
Create the backend infrastructure (terraform/backend/main.tf):
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "terraform_state" {
bucket = "org-terraform-state-${data.aws_caller_identity.current.account_id}"
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
point_in_time_recovery {
enabled = true
}
tags = {
Name = "terraform-locks"
Environment = "shared"
}
}
data "aws_caller_identity" "current" {}
output "bucket_name" {
value = aws_s3_bucket.terraform_state.id
}
output "dynamodb_table" {
value = aws_dynamodb_table.terraform_locks.name
}
Configure backend in your Terraform project (backends.tf):
terraform {
backend "s3" {
bucket = "org-terraform-state-123456789"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Initialize:
terraform init
Terraform will lock state during apply/destroy, preventing concurrent modifications.
Workspace vs Directory Strategy
Two approaches exist: workspaces and directories. Directories are superior for multi-environment management.
Workspace approach (not recommended for prod):
terraform workspace new staging
terraform workspace select staging
terraform apply
Workspaces share code but separate state. Problem: easy to accidentally apply dev changes to prod.
Directory approach (recommended):
terraform/
├── shared/
│ ├── main.tf # VPC, networking, shared resources
│ └── outputs.tf
├── prod/
│ ├── main.tf # Prod-specific config
│ ├── terraform.tfvars
│ └── backend.tf # prod/terraform.tfstate
├── staging/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf # staging/terraform.tfstate
└── modules/
├── network/
├── compute/
└── database/
Each environment is a separate directory with distinct state. This prevents accidental production changes.
Terragrunt for DRY Configuration
Terragrunt eliminates duplication across environments. Manage Terraform files once; Terragrunt instantiates them per environment.
terragrunt.hcl (root):
locals {
aws_region = "us-east-1"
environment = get_env("ENVIRONMENT", "dev")
project = "myapp"
}
terraform {
extra_arguments "retry_lock" {
commands = get_terraform_commands_that_need_vars()
env_vars = {
AWS_REGION = local.aws_region
}
}
}
remote_state {
backend = "s3"
config = {
bucket = "${local.project}-terraform-${local.environment}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.aws_region
dynamodb_table = "terraform-locks"
encrypt = true
}
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
provider "aws" {
region = "${local.aws_region}"
default_tags {
tags = {
Environment = "${local.environment}"
Project = "${local.project}"
ManagedBy = "Terraform"
}
}
}
EOF
}
Directory structure:
terragrunt/
├── terragrunt.hcl
├── environments/
│ ├── dev/
│ │ ├── terragrunt.hcl
│ │ └── vpc/
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ │
│ ├── staging/
│ │ ├── terragrunt.hcl
│ │ └── vpc/
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ │
│ └── prod/
│ ├── terragrunt.hcl
│ └── vpc/
│ ├── terragrunt.hcl
│ └── main.tf
└── modules/
└── network/
├── main.tf
├── variables.tf
└── outputs.tf
environments/prod/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
locals {
environment = "prod"
cidr_block = "10.0.0.0/16"
}
inputs = {
environment = local.environment
cidr_block = local.cidr_block
enable_nat = true
nat_count = 3
}
environments/prod/vpc/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "../../../modules/network"
}
dependency "iam" {
config_path = "../iam"
skip_outputs = false
}
inputs = {
iam_role_arn = dependency.iam.outputs.terraform_role_arn
}
Deploy all prod infrastructure:
cd terragrunt/environments/prod
terragrunt run-all apply
Module Versioning with Semantic Tags
Version modules to prevent unexpected breaking changes. Store modules in a Git repository or Terraform Registry.
Module source with version constraints:
module "network" {
source = "git::https://github.com/org/terraform-modules.git//network?ref=v2.1.0"
cidr_block = "10.0.0.0/16"
environment = "prod"
}
module "database" {
source = "git::https://github.com/org/terraform-modules.git//database?ref=v1.5.2"
instance_class = "db.r6i.2xlarge"
allocated_storage = 100
}
Module versioning best practices:
variable "database_version" {
description = "PostgreSQL version"
type = string
default = "15.2"
validation {
condition = can(regex("^1[0-9]\\.[0-9]$", var.database_version))
error_message = "Database version must be in format X.Y"
}
}
variable "instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.micro"
validation {
condition = contains(["db.t3.micro", "db.t3.small", "db.r6i.large"], var.instance_class)
error_message = "Instance class not allowed; use approved list."
}
}
Tag releases semantically:
cd terraform-modules
git tag v2.1.0
git push origin v2.1.0
Drift Detection and Remediation
Drift occurs when infrastructure changes outside Terraform (manual updates, other tools). Detect and fix regularly.
Detect drift:
terraform refresh
terraform plan -refresh-only
If the plan shows changes, drift has occurred. Review the changes:
terraform plan -refresh-only -json | jq '.resource_changes[] | select(.change.actions[] == "update")'
Remediate drift:
Option 1: Apply Terraform (overwrite manual changes):
terraform apply
Option 2: Update Terraform state to match reality:
terraform refresh
Option 3: Re-import modified resource:
terraform state rm aws_instance.web
aws ec2 describe-instances --query 'Reservations[0].Instances[0].InstanceId' --output text | \
xargs -I {} terraform import aws_instance.web {}
Automate drift checks:
#!/bin/bash
# ci/detect-drift.sh
set -e
terraform init -input=false
terraform refresh
if [[ -n $(terraform plan -refresh-only -json | jq -r '.resource_changes[]') ]]; then
echo "Drift detected!"
terraform plan -refresh-only
exit 1
fi
Import for Existing Resources
Use terraform import to bring existing AWS resources under Terraform management.
# List existing EC2 instances
aws ec2 describe-instances \
--query 'Reservations[].Instances[].[InstanceId, Tags[?Key==`Name`].Value[0]]' \
--output table
# Add to Terraform
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t2.micro"
}
# Import
terraform import aws_instance.web i-1234567890abcdef0
# Verify
terraform state show aws_instance.web
Pre-commit Hooks (tflint, tfsec, Checkov)
Catch issues before commits. Install pre-commit framework and add Terraform hooks.
.pre-commit-config.yaml:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
args: ['--maxkb=1024']
- repo: https://github.com/terraform-linters/tflint
rev: v0.50.0
hooks:
- id: tflint
args: ["--format", "json"]
- repo: https://github.com/aquasecurity/tfsec
rev: v1.28.1
hooks:
- id: tfsec
args: ["--format", "json"]
- repo: https://github.com/bridgecrewio/checkov
rev: 3.1.0
hooks:
- id: checkov
args: ["--framework", "terraform"]
- repo: https://github.com/terraform-docs/terraform-docs
rev: v0.16.0
hooks:
- id: terraform-docs
args: [--sort-by-required, --format=markdown]
Install and run:
pip install pre-commit
pre-commit install
git add main.tf
git commit -m "Add S3 bucket" # Hooks run automatically
TFVar File Management
Use .tfvars files to inject environment-specific values without committing secrets.
# prod.tfvars (commit to Git)
environment = "prod"
instance_count = 3
instance_type = "t3.medium"
enable_monitoring = true
log_retention_days = 30
backup_enabled = true
backup_retention = 30
# prod.auto.tfvars (git-ignored, secrets only)
database_password = "..."
api_key = "..."
slack_webhook_url = "..."
Apply with:
terraform apply -var-file=prod.tfvars -var-file=prod.auto.tfvars
Checklist
- Remote state backend (S3 + DynamoDB) configured with encryption and versioning
- State locks enabled and DynamoDB table auto-scaled
- Separate directories per environment (dev/staging/prod)
- Terragrunt configured for DRY multi-environment setup
- All modules versioned with semantic tags
- Drift detection running weekly (via CI or cron)
- tflint, tfsec, and Checkov integrated in pre-commit hooks
- All secrets in
.auto.tfvars(git-ignored) - Non-sensitive
.tfvarsfiles committed to Git - Documentation and runbooks for import, state recovery
- Terraform state backup strategy tested
- State access restricted via IAM policies
Conclusion
Terraform at scale requires discipline. Remote state with locking eliminates data corruption. Directory-per-environment architecture prevents cross-contamination. Terragrunt eliminates duplication. Versioned modules provide stability. Pre-commit hooks catch errors before they reach production. Regular drift detection keeps reality aligned with code. Invest these hours upfront, and your infrastructure becomes reproducible, auditable, and maintainable at any scale.