Published on

Container Security — From Dockerfile to Runtime Protection

Authors

Introduction

Containers concentrate risk. A single compromised image deployed to thousands of pods is a catastrophic failure. A root-running process with access to the entire filesystem is an attacker's dream.

Container security spans the full lifecycle: build-time (image composition), distribution (registry security), and runtime (pod policies, network rules). This guide covers production-grade practices that fit into your pipeline.

Non-Root User in Dockerfile

Never run containers as root. Root access is a privilege escalation risk and violates principle of least privilege.

# ❌ BAD: Runs as root
FROM node:18
WORKDIR /app
COPY . .
RUN npm install --production
EXPOSE 3000
CMD ["node", "server.js"]

# ✓ GOOD: Runs as unprivileged user
FROM node:18-alpine

# Create dedicated user (UID >1000 avoids system user conflicts)
RUN addgroup -g 1001 app && \
    adduser -D -u 1001 -G app app

WORKDIR /app

# Copy with correct ownership
COPY --chown=app:app package.json package-lock.json ./
RUN npm install --production

COPY --chown=app:app . .

# Switch to unprivileged user
USER app

EXPOSE 3000
CMD ["node", "server.js"]

Kubernetes enforcement:

# k8s-deployment.yaml
apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1001
    runAsGroup: 1001
    fsGroup: 1001

  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE # Only if needed
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/.cache

  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}

Distroless Base Images

Minimize attack surface by using distroless images: only application + runtime, no package manager or shell.

# Traditional image: ~900MB, includes shell, package manager, build tools
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y nodejs npm
COPY . /app
CMD ["node", "/app/server.js"]

# Distroless image: ~150MB, only application + glibc
FROM node:18 AS builder
WORKDIR /app
COPY . .
RUN npm install --production

FROM gcr.io/distroless/nodejs18-debian11
COPY --from=builder /app /app
WORKDIR /app
ENTRYPOINT ["node", "server.js"]

# Distroless Node image sizes:
# - node:18-alpine: 171MB
# - gcr.io/distroless/nodejs18: 97MB (includes Node but no shell)
# - scratch: 0MB (only application, for static binaries)

Distroless tradeoffs:

  • Pro: No shell, no package manager, minimal attack surface
  • Pro: Fast startup (fewer layers, smaller image)
  • Con: Harder to debug inside container (no /bin/bash)
  • Con: Runtime dependency mismatches harder to diagnose

Recommended: use distroless for production, alpine or ubuntu for staging.

Multi-Stage Builds to Minimize Attack Surface

Build artifacts that could contain vulnerabilities or secrets need not exist in final image.

# Multi-stage Node.js + build tools build
FROM node:18-alpine AS builder

WORKDIR /app
COPY package.json package-lock.json ./
RUN npm install

COPY . .

# Compile TypeScript, bundle with esbuild
RUN npx tsc && npx esbuild dist/server.js --bundle --outfile=dist/app.js

# Optional: run tests
RUN npm run test

# Final stage: minimal distroless image
FROM gcr.io/distroless/nodejs18-debian11

COPY --from=builder --chown=1001:1001 /app/dist/app.js /app/
COPY --from=builder --chown=1001:1001 /app/node_modules /app/node_modules

WORKDIR /app
ENTRYPOINT ["node", "app.js"]

Multi-stage benefits:

  • TypeScript compiler: not in production image
  • Build dependencies: npm, node-gyp not included
  • Test frameworks: jest, vitest removed
  • Source code: .ts files excluded, only compiled .js remains

Secrets NOT in Image Layers

Secrets baked into images are permanent, immutable, and copied to every container.

# ❌ BAD: Secret in image layer (git history is permanent!)
FROM node:18
RUN npm config set //registry.npmjs.org/:_authToken=$NPM_TOKEN
COPY . .
RUN npm install

# ✓ GOOD: Use Docker BuildKit secrets
# Build command:
# DOCKER_BUILDKIT=1 docker build \
#   --secret npm_token=~/.npmjs/token \
#   -t myapp .

FROM node:18
RUN --mount=type=secret,id=npm_token \
    npm config set //registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token) && \
    npm install

COPY . .

Runtime secrets (environment variables, mounted secrets):

# Secrets come from Kubernetes Secrets or external vault
FROM gcr.io/distroless/nodejs18

COPY . /app
WORKDIR /app

# At runtime, Kubernetes injects:
# env:
#   - name: DATABASE_URL
#     valueFrom:
#       secretKeyRef:
#         name: db-secret
#         key: url
#   - name: API_KEY
#     valueFrom:
#       secretKeyRef:
#         name: api-secret
#         key: key

ENTRYPOINT ["node", "server.js"]

Read-Only Filesystem

Restrict write access to only directories that must be writable.

# k8s-deployment.yaml
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: app
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false

      volumeMounts:
        # Writable temp directory for Node.js (not /tmp)
        - name: tmp
          mountPath: /tmp
          readOnly: false

        # Writable cache directory
        - name: cache
          mountPath: /app/.cache
          readOnly: false

        # Everything else is read-only

  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}

Seccomp Profiles

Restrict system calls available to containers, reducing kernel attack surface.

# k8s-deployment.yaml with seccomp
apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault # Use container runtime's default

  containers:
    - name: app
      image: myapp:latest
      securityContext:
        seccompProfile:
          type: Localhost
          localhostProfile: myapp-seccomp.json

---
# Custom seccomp profile
apiVersion: v1
kind: ConfigMap
metadata:
  name: seccomp-profiles
data:
  myapp-seccomp.json: |
    {
      "defaultAction": "SCMP_ACT_ERRNO",
      "defaultErrnoRet": 1,
      "archMap": [
        {
          "architecture": "SCMP_ARCH_X86_64",
          "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
        }
      ],
      "syscalls": [
        {
          "names": [
            "accept4",
            "arch_prctl",
            "bind",
            "brk",
            "clone",
            "close",
            "connect",
            "dup",
            "dup2",
            "epoll_create1",
            "epoll_ctl",
            "epoll_wait",
            "exit",
            "exit_group",
            "fcntl",
            "fstat",
            "fstatfs",
            "futex",
            "getcwd",
            "getegid",
            "getgid",
            "getpeername",
            "getpid",
            "getrandom",
            "getrlimit",
            "getrusage",
            "getsockname",
            "getsockopt",
            "gettimeofday",
            "listen",
            "lseek",
            "madvise",
            "mmap",
            "mprotect",
            "msan_check_mem_is_initialized",
            "msan_memory_is_poisoned",
            "munmap",
            "open",
            "openat",
            "poll",
            "pread64",
            "prlimit64",
            "pselect6",
            "read",
            "readlink",
            "readlinkat",
            "recvfrom",
            "recvmsg",
            "rt_sigaction",
            "rt_sigprocmask",
            "rt_sigreturn",
            "sched_getaffinity",
            "sched_yield",
            "select",
            "sendmsg",
            "sendto",
            "set_robust_list",
            "set_tid_address",
            "setgid",
            "setgroups",
            "setsockopt",
            "setuid",
            "sigaction",
            "sigaltstack",
            "sigprocmask",
            "sigreturn",
            "socket",
            "socketpair",
            "stat",
            "statfs",
            "statx",
            "write",
            "writev"
          ],
          "action": "SCMP_ACT_ALLOW"
        }
      ]
    }

Network Policies in Kubernetes

Restrict pod-to-pod communication by default.

# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes:
    - Ingress

---
# Allow ingress from ingress controller
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-ingress
spec:
  podSelector:
    matchLabels:
      app: myapp
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000

---
# Allow specific egress (outbound)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-to-database
spec:
  podSelector:
    matchLabels:
      app: myapp
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

    # Allow to database
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432

    # Allow to external API
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443

Image Scanning With Trivy in CI

Scan images for known vulnerabilities before pushing to registry.

# .github/workflows/container-security.yaml
name: Container Security

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  scan:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

      - name: Fail on critical vulnerabilities
        run: |
          CRITICAL=$(docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
            aquasec/trivy:latest image --severity CRITICAL --exit-code 1 \
            myapp:${{ github.sha }})
          if [ $? -ne 0 ]; then
            echo "Critical vulnerabilities found"
            exit 1
          fi

SBOM Generation

Generate Software Bill of Materials for compliance and vulnerability tracking.

# .github/workflows/sbom.yaml
name: Generate SBOM

on:
  push:
    branches: [main]

jobs:
  sbom:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Generate SBOM with Syft
        uses: anchore/sbom-action@v0
        with:
          image: myapp:${{ github.sha }}
          format: spdx-json
          output-file: sbom-${{ github.sha }}.spdx.json

      - name: Upload SBOM
        uses: actions/upload-artifact@v3
        with:
          name: sbom
          path: sbom-${{ github.sha }}.spdx.json

      - name: Check for known vulnerabilities in SBOM
        run: |
          # Use grype to check SBOM against CVE database
          grype sbom:sbom-${{ github.sha }}.spdx.json --fail-on high

Dockerfile for minimal SBOM:

# Build with syft annotation
FROM node:18 AS builder
LABEL "sbom.syft"="included"

WORKDIR /app
COPY package.json package-lock.json ./
RUN npm install --production

COPY . .
RUN npm run build

FROM gcr.io/distroless/nodejs18-debian11

COPY --from=builder /app/dist /app
COPY --from=builder /app/node_modules /app/node_modules

WORKDIR /app
ENTRYPOINT ["node", "index.js"]

Container Security Checklist

## Container Security Audit Checklist

### Dockerfile Build-Time
- [ ] Non-root user specified with USER directive
- [ ] Base image is minimal (distroless or alpine)
- [ ] Multi-stage build used to exclude build artifacts
- [ ] No secrets in image layers (using BuildKit secrets if needed)
- [ ] No unnecessary packages installed
- [ ] Image scanned with Trivy, no high/critical vulnerabilities
- [ ] SBOM generated and tracked

### Kubernetes Runtime
- [ ] securityContext.runAsNonRoot: true
- [ ] securityContext.readOnlyRootFilesystem: true
- [ ] securityContext.allowPrivilegeEscalation: false
- [ ] securityContext.capabilities.drop: ["ALL"]
- [ ] Necessary capabilities added explicitly (rare)
- [ ] seccompProfile set to RuntimeDefault or custom
- [ ] Resource limits set (memory, CPU)
- [ ] NetworkPolicy restricts ingress/egress by default
- [ ] PSP (Pod Security Policy) or Pod Security Standards enforced

### Image Registry
- [ ] Registry requires authentication
- [ ] Images signed (cosign, Sigstore)
- [ ] Image push requires approval (only trusted CI/CD)
- [ ] Old images purged after retention period
- [ ] Registry scanned for vulnerabilities on pull

### Runtime Monitoring
- [ ] Audit logging enabled (kubectl logs)
- [ ] Container runtime (CRI) configured to log syscalls
- [ ] Alerts for privilege escalation attempts
- [ ] Alerts for unsigned container images deployed

Conclusion

Container security is defense in depth. At build time, use distroless images, multi-stage builds, and scan for vulnerabilities. At runtime, enforce non-root users, read-only filesystems, seccomp profiles, and network policies.

No single layer guarantees security. Every layer prevents one class of attack. Stack them together and build systems that are harder to compromise than they are to fix.