- Published on
Running LLM Workloads on Kubernetes — GPU Scheduling, vLLM, and Autoscaling
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Kubernetes orchestrates stateless services beautifully. LLM inference is different: GPUs are scarce, models consume gigabytes of memory, and batching requests yields 10x throughput. Running LLM workloads on Kubernetes requires GPU node pools, careful resource requests, and autoscaling beyond CPU metrics. The payoff: cost-effective, highly available inference at scale.
- GPU Node Pools in EKS/GKE/AKS
- NVIDIA Device Plugin for K8s
- vLLM Deployment on Kubernetes
- Horizontal Pod Autoscaling for LLM Inference
- KEDA for Request-Queue-Based Scaling
- Resource Requests/Limits for GPU Pods
- Triton Inference Server for Model Serving
- Multi-Model Serving With Model Routing
- Spot GPU Instances for Cost Reduction
- Monitoring GPU Utilisation With DCGM
- Checklist
- Conclusion
GPU Node Pools in EKS/GKE/AKS
Create dedicated GPU node pools separate from CPU workloads:
EKS (AWS):
# Create GPU node group with on-demand instances
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name gpu-nodes-a100 \
--subnets subnet-12345 \
--instance-types p4d.24xlarge \
--desired-size 2 \
--min-size 1 \
--max-size 5 \
--tags "workload=gpu-inference"
# Create spot GPU node group (cheaper, interruptible)
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name gpu-spot-v100 \
--subnets subnet-12345 \
--instance-types p3.8xlarge \
--capacity-type SPOT \
--desired-size 3
GKE (Google Cloud):
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--num-nodes 2 \
--machine-type n1-standard-8 \
--accelerator type=nvidia-tesla-a100,count=2 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10
AKS (Azure):
az aks nodepool add \
--resource-group rg-name \
--cluster-name aks-cluster \
--name gpunodes \
--node-count 2 \
--machine-type Standard_NC24ads_A100_v4 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10
GPU instances are expensive; separate pools prevent CPU workloads from consuming GPU capacity.
NVIDIA Device Plugin for K8s
Install the NVIDIA Device Plugin to expose GPUs as schedulable resources:
# For NVIDIA GPUs
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPUs are visible
kubectl get nodes -o wide
# Shows: nvidia.com/gpu: 8 on GPU nodes
The plugin discovers GPUs and registers them with kubelet.
vLLM Deployment on Kubernetes
vLLM serves large language models with batching and KV cache optimization. Deploy as a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
namespace: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-server
model: llama-70b
template:
metadata:
labels:
app: vllm-server
model: llama-70b
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values:
- gpu-inference
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model=/models/meta-llama/Llama-2-70b-hf"
- "--tensor-parallel-size=8"
- "--gpu-memory-utilization=0.9"
- "--max-model-len=4096"
- "--serve-legacy-api-openai"
ports:
- name: http
containerPort: 8000
resources:
requests:
nvidia.com/gpu: "8"
memory: "100Gi"
cpu: "16"
limits:
nvidia.com/gpu: "8"
memory: "120Gi"
cpu: "16"
env:
- name: HF_HOME
value: "/models"
- name: HUGGINGFACE_HUB_CACHE
value: "/models"
livenessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-storage
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama-70b
namespace: llm-inference
spec:
selector:
app: vllm-server
model: llama-70b
type: LoadBalancer
ports:
- port: 80
targetPort: 8000
name: http
vLLM's tensor parallelism distributes computation across 8 GPUs. GPU memory utilization at 90% maximizes throughput without OOM errors.
Horizontal Pod Autoscaling for LLM Inference
Scale inference pods based on request queue depth or latency:
apiVersion: autoscaling.custom.metrics.k8s.io/v1beta1
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama-70b
minReplicas: 1
maxReplicas: 5
metrics:
# Scale on GPU utilization
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 70
# Scale on request queue depth (custom metric)
- type: Pods
pods:
metric:
name: vllm_request_queue_depth
target:
type: AverageValue
averageValue: "30"
HPA watches GPU utilization and custom Prometheus metrics. When utilization exceeds 70%, new pods spawn.
KEDA for Request-Queue-Based Scaling
KEDA (Kubernetes Event Driven Autoscaling) scales based on external metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-scaler-config
namespace: llm-inference
data:
prometheus-url: "http://prometheus:9090"
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-queue-scaler
namespace: llm-inference
spec:
scaleTargetRef:
name: vllm-llama-70b
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_request_queue_length
threshold: "50"
query: |
rate(vllm_request_queue_length[1m])
KEDA queries Prometheus for queue depth. When queue exceeds 50 requests, replicas scale up in seconds.
Resource Requests/Limits for GPU Pods
Specify GPU, memory, and CPU carefully:
resources:
requests:
nvidia.com/gpu: "8" # 8 GPUs required
memory: "100Gi" # Model + KV cache
cpu: "16" # Tensor operations
ephemeral-storage: "50Gi" # Temp files during inference
limits:
nvidia.com/gpu: "8"
memory: "120Gi" # 20Gi headroom
cpu: "16"
Requests ensure scheduling only on nodes with sufficient resources. Limits prevent resource hogging.
Triton Inference Server for Model Serving
Deploy multiple models (Llama, Mistral, embeddings) on a single Triton instance:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
namespace: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.02-py3
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
requests:
nvidia.com/gpu: "2"
memory: "50Gi"
cpu: "8"
limits:
nvidia.com/gpu: "2"
memory: "60Gi"
volumeMounts:
- name: model-repository
mountPath: /models
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: model-repository
persistentVolumeClaim:
claimName: triton-models
---
apiVersion: v1
kind: ConfigMap
metadata:
name: triton-model-config
namespace: llm-inference
data:
# Model configuration files
ensemble-config.pbtxt: |
name: "ensemble_model"
platform: "ensemble"
input: [...]
output: [...]
Triton handles batching, model versioning, and A/B testing automatically.
Multi-Model Serving With Model Routing
Route requests to different models based on payload:
// Kubernetes Service with model routing
import fastify from 'fastify';
import axios from 'axios';
const app = fastify();
app.post('/infer', async (request, reply) => {
const { prompt, model_type } = request.body;
// Route to appropriate model
let service_url;
if (model_type === 'fast') {
service_url = 'http://vllm-mistral-7b:8000';
} else if (model_type === 'powerful') {
service_url = 'http://vllm-llama-70b:8000';
} else {
service_url = 'http://vllm-llama-13b:8000'; // Default
}
const response = await axios.post(`${service_url}/v1/completions`, {
model: 'default',
prompt,
max_tokens: 256,
});
return reply.send(response.data);
});
app.listen({ port: 3000 });
Clients specify model preference; router dispatches to appropriate service.
Spot GPU Instances for Cost Reduction
Mix on-demand and spot instances for 70% cost savings:
# Spot GPU node group (v100, interruptible)
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name gpu-spot \
--instance-types p3.8xlarge p3.16xlarge p3dn.24xlarge \
--capacity-type SPOT \
--desired-size 5 \
--max-size 20 \
--on-demand-percentage 30 # 30% on-demand, 70% spot
Spot instances cost 70% less but can be terminated with 2 minutes notice. Kubernetes PodDisruptionBudgets ensure graceful shutdowns:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: llm-inference
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-server
When a spot instance is reclaimed, affected pods migrate to on-demand nodes without dropping requests.
Monitoring GPU Utilisation With DCGM
NVIDIA's DCGM (Data Center GPU Manager) exports GPU metrics to Prometheus:
# Install DCGM exporter
helm repo add nvidia https://nvidia.github.io/gpu-monitoring-tools/helm-charts
helm install dcgm-exporter nvidia/dcgm-exporter --namespace monitoring
Prometheus scrapes GPU metrics:
apiVersion: v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
Grafana dashboards track:
- GPU utilization (target: 80-90%)
- GPU memory usage
- GPU temperature
- Tensor utilization (compute vs memory-bound)
- Model inference latency
Checklist
- Create dedicated GPU node pools in EKS/GKE/AKS
- Install NVIDIA device plugin on cluster
- Deploy vLLM with tensor parallelism configured
- Set up PersistentVolume for model storage
- Configure HPA for GPU utilization-based scaling
- Install KEDA and configure request-queue-based scaling
- Set resource requests/limits for GPU pods
- Deploy Triton for multi-model serving (optional)
- Mix on-demand and spot GPU instances
- Install DCGM exporter for GPU monitoring
- Create Grafana dashboards for GPU metrics
- Configure pod disruption budgets for graceful shutdowns
Conclusion
Kubernetes excels at orchestrating LLM inference when you understand GPU constraints. vLLM maximizes throughput through batching and KV cache optimization. Autoscaling based on queue depth prevents latency spikes. Spot instances reduce costs by 70%. Monitoring GPU utilization ensures efficient resource usage. The result: highly available, cost-effective inference for demanding workloads.