Published on

Running LLM Workloads on Kubernetes — GPU Scheduling, vLLM, and Autoscaling

Authors

Introduction

Kubernetes orchestrates stateless services beautifully. LLM inference is different: GPUs are scarce, models consume gigabytes of memory, and batching requests yields 10x throughput. Running LLM workloads on Kubernetes requires GPU node pools, careful resource requests, and autoscaling beyond CPU metrics. The payoff: cost-effective, highly available inference at scale.

GPU Node Pools in EKS/GKE/AKS

Create dedicated GPU node pools separate from CPU workloads:

EKS (AWS):

# Create GPU node group with on-demand instances
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name gpu-nodes-a100 \
  --subnets subnet-12345 \
  --instance-types p4d.24xlarge \
  --desired-size 2 \
  --min-size 1 \
  --max-size 5 \
  --tags "workload=gpu-inference"

# Create spot GPU node group (cheaper, interruptible)
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name gpu-spot-v100 \
  --subnets subnet-12345 \
  --instance-types p3.8xlarge \
  --capacity-type SPOT \
  --desired-size 3

GKE (Google Cloud):

gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --num-nodes 2 \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-a100,count=2 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10

AKS (Azure):

az aks nodepool add \
  --resource-group rg-name \
  --cluster-name aks-cluster \
  --name gpunodes \
  --node-count 2 \
  --machine-type Standard_NC24ads_A100_v4 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 10

GPU instances are expensive; separate pools prevent CPU workloads from consuming GPU capacity.

NVIDIA Device Plugin for K8s

Install the NVIDIA Device Plugin to expose GPUs as schedulable resources:

# For NVIDIA GPUs
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPUs are visible
kubectl get nodes -o wide
# Shows: nvidia.com/gpu: 8 on GPU nodes

The plugin discovers GPUs and registers them with kubelet.

vLLM Deployment on Kubernetes

vLLM serves large language models with batching and KV cache optimization. Deploy as a Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
      model: llama-70b
  template:
    metadata:
      labels:
        app: vllm-server
        model: llama-70b
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload
                operator: In
                values:
                - gpu-inference

      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model=/models/meta-llama/Llama-2-70b-hf"
          - "--tensor-parallel-size=8"
          - "--gpu-memory-utilization=0.9"
          - "--max-model-len=4096"
          - "--serve-legacy-api-openai"

        ports:
        - name: http
          containerPort: 8000

        resources:
          requests:
            nvidia.com/gpu: "8"
            memory: "100Gi"
            cpu: "16"
          limits:
            nvidia.com/gpu: "8"
            memory: "120Gi"
            cpu: "16"

        env:
        - name: HF_HOME
          value: "/models"
        - name: HUGGINGFACE_HUB_CACHE
          value: "/models"

        livenessProbe:
          httpGet:
            path: /v1/models
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5

        volumeMounts:
        - name: model-cache
          mountPath: /models

      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-storage

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-70b
  namespace: llm-inference
spec:
  selector:
    app: vllm-server
    model: llama-70b
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
    name: http

vLLM's tensor parallelism distributes computation across 8 GPUs. GPU memory utilization at 90% maximizes throughput without OOM errors.

Horizontal Pod Autoscaling for LLM Inference

Scale inference pods based on request queue depth or latency:

apiVersion: autoscaling.custom.metrics.k8s.io/v1beta1
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-70b
  minReplicas: 1
  maxReplicas: 5
  metrics:
  # Scale on GPU utilization
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 70
  # Scale on request queue depth (custom metric)
  - type: Pods
    pods:
      metric:
        name: vllm_request_queue_depth
      target:
        type: AverageValue
        averageValue: "30"

HPA watches GPU utilization and custom Prometheus metrics. When utilization exceeds 70%, new pods spawn.

KEDA for Request-Queue-Based Scaling

KEDA (Kubernetes Event Driven Autoscaling) scales based on external metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-scaler-config
  namespace: llm-inference
data:
  prometheus-url: "http://prometheus:9090"

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-queue-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-llama-70b
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_request_queue_length
      threshold: "50"
      query: |
        rate(vllm_request_queue_length[1m])

KEDA queries Prometheus for queue depth. When queue exceeds 50 requests, replicas scale up in seconds.

Resource Requests/Limits for GPU Pods

Specify GPU, memory, and CPU carefully:

resources:
  requests:
    nvidia.com/gpu: "8"        # 8 GPUs required
    memory: "100Gi"            # Model + KV cache
    cpu: "16"                  # Tensor operations
    ephemeral-storage: "50Gi"  # Temp files during inference
  limits:
    nvidia.com/gpu: "8"
    memory: "120Gi"            # 20Gi headroom
    cpu: "16"

Requests ensure scheduling only on nodes with sufficient resources. Limits prevent resource hogging.

Triton Inference Server for Model Serving

Deploy multiple models (Llama, Mistral, embeddings) on a single Triton instance:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
  namespace: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.02-py3
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics

        resources:
          requests:
            nvidia.com/gpu: "2"
            memory: "50Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: "2"
            memory: "60Gi"

        volumeMounts:
        - name: model-repository
          mountPath: /models

        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10

      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: triton-models

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: triton-model-config
  namespace: llm-inference
data:
  # Model configuration files
  ensemble-config.pbtxt: |
    name: "ensemble_model"
    platform: "ensemble"
    input: [...]
    output: [...]

Triton handles batching, model versioning, and A/B testing automatically.

Multi-Model Serving With Model Routing

Route requests to different models based on payload:

// Kubernetes Service with model routing
import fastify from 'fastify';
import axios from 'axios';

const app = fastify();

app.post('/infer', async (request, reply) => {
  const { prompt, model_type } = request.body;

  // Route to appropriate model
  let service_url;
  if (model_type === 'fast') {
    service_url = 'http://vllm-mistral-7b:8000';
  } else if (model_type === 'powerful') {
    service_url = 'http://vllm-llama-70b:8000';
  } else {
    service_url = 'http://vllm-llama-13b:8000'; // Default
  }

  const response = await axios.post(`${service_url}/v1/completions`, {
    model: 'default',
    prompt,
    max_tokens: 256,
  });

  return reply.send(response.data);
});

app.listen({ port: 3000 });

Clients specify model preference; router dispatches to appropriate service.

Spot GPU Instances for Cost Reduction

Mix on-demand and spot instances for 70% cost savings:

# Spot GPU node group (v100, interruptible)
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name gpu-spot \
  --instance-types p3.8xlarge p3.16xlarge p3dn.24xlarge \
  --capacity-type SPOT \
  --desired-size 5 \
  --max-size 20 \
  --on-demand-percentage 30  # 30% on-demand, 70% spot

Spot instances cost 70% less but can be terminated with 2 minutes notice. Kubernetes PodDisruptionBudgets ensure graceful shutdowns:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: llm-inference
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-server

When a spot instance is reclaimed, affected pods migrate to on-demand nodes without dropping requests.

Monitoring GPU Utilisation With DCGM

NVIDIA's DCGM (Data Center GPU Manager) exports GPU metrics to Prometheus:

# Install DCGM exporter
helm repo add nvidia https://nvidia.github.io/gpu-monitoring-tools/helm-charts
helm install dcgm-exporter nvidia/dcgm-exporter --namespace monitoring

Prometheus scrapes GPU metrics:

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

Grafana dashboards track:

  • GPU utilization (target: 80-90%)
  • GPU memory usage
  • GPU temperature
  • Tensor utilization (compute vs memory-bound)
  • Model inference latency

Checklist

  • Create dedicated GPU node pools in EKS/GKE/AKS
  • Install NVIDIA device plugin on cluster
  • Deploy vLLM with tensor parallelism configured
  • Set up PersistentVolume for model storage
  • Configure HPA for GPU utilization-based scaling
  • Install KEDA and configure request-queue-based scaling
  • Set resource requests/limits for GPU pods
  • Deploy Triton for multi-model serving (optional)
  • Mix on-demand and spot GPU instances
  • Install DCGM exporter for GPU monitoring
  • Create Grafana dashboards for GPU metrics
  • Configure pod disruption budgets for graceful shutdowns

Conclusion

Kubernetes excels at orchestrating LLM inference when you understand GPU constraints. vLLM maximizes throughput through batching and KV cache optimization. Autoscaling based on queue depth prevents latency spikes. Spot instances reduce costs by 70%. Monitoring GPU utilization ensures efficient resource usage. The result: highly available, cost-effective inference for demanding workloads.