- Published on
Self-Hosting LLMs With vLLM — Running Open-Source Models in Production
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Self-hosting LLMs offers cost savings, data privacy, and model flexibility. vLLM, a production-grade inference engine, has become the standard for high-performance deployments. This guide covers framework comparisons, optimization techniques, and when self-hosting beats APIs.
- vLLM vs Ollama vs TGI vs llama.cpp
- PagedAttention for Memory Efficiency
- Continuous Batching
- Quantization: GPTQ, AWQ, bitsandbytes
- GPU Selection for Model Size
- Multi-GPU Tensor Parallelism
- OpenAI-Compatible API Endpoint
- Autoscaling With Kubernetes
- Cost Comparison: Self-Hosted vs APIs
- When Self-Hosting Beats APIs
- Checklist
- Conclusion
vLLM vs Ollama vs TGI vs llama.cpp
Compare the major LLM serving frameworks:
| Framework | Deployment | Latency | Throughput | Ease | Production Ready |
|---|---|---|---|---|---|
| vLLM | GPU | <50ms | 100+ QPS | Moderate | Yes |
| Ollama | Local | <100ms | 10-50 QPS | Very Easy | Development |
| TGI (Text-Gen) | GPU | <30ms | 200+ QPS | Hard | Yes |
| llama.cpp | CPU | <200ms | 5-20 QPS | Very Easy | Development |
vLLM advantages:
- OpenAI-compatible API (drop-in replacement)
- Continuous batching for high throughput
- Multi-LoRA support
- Dynamic batching reduces latency
Ollama advantages:
- Single binary, no dependencies
- Perfect for local testing
- Huge model library
TGI advantages:
- Optimized for batch inference
- Token streaming
- Excellent for chatbots
llama.cpp advantages:
- CPU inference (no GPU required)
- Extreme efficiency
For production: vLLM dominates. For prototyping: Ollama. For batch processing: TGI.
PagedAttention for Memory Efficiency
vLLM's PagedAttention is game-changing. Traditional attention is O(n²) memory:
# Traditional attention: allocates full KV cache upfront
# For 32B model + batch 256: ~40GB per request
# vLLM PagedAttention: shared KV cache pages
# Same scenario: ~6GB (6× memory reduction)
from vllm import LLM, SamplingParams
# Initialize with paged attention (default)
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # 2 GPUs
gpu_memory_utilization=0.9, # Use 90% of VRAM
)
# vLLM manages KV cache pages automatically
# You don't need to think about it
prompts = [
"What is semantic search?",
"Explain vector databases",
"How do embeddings work?",
] * 100 # 300 requests
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
# Continuous batching: processes all 300 without OOM
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
PagedAttention enables:
- Larger batch sizes (32 → 256)
- Better throughput (50 QPS → 200 QPS)
- Lower latency (tokens block each other less)
Critical for production inference at scale.
Continuous Batching
Traditional batching waits for slowest request. Continuous batching releases early:
# Traditional static batching:
# Request 1: 100 tokens (slow)
# Request 2: 10 tokens (blocked)
# Request 3: 50 tokens (blocked)
# Total latency: slowest request
# vLLM continuous batching:
# Request 2 finishes (10 tokens)
# Request 3 finishes (50 tokens)
# Request 1 finishes (100 tokens)
# Requests 2,3 don't wait for 1
from vllm import LLM, SamplingParams
import asyncio
import time
llm = LLM(
model="meta-llama/Llama-2-13b-hf",
max_model_len=4096,
max_num_batched_tokens=8192, # Process many short tokens
max_num_seqs=256, # Max concurrent requests
)
async def process_request(prompt: str, request_id: int):
start = time.time()
outputs = llm.generate([prompt])
latency = time.time() - start
print(f"Request {request_id}: {latency:.2f}s")
return outputs
# Simulate concurrent requests
tasks = [
process_request("What is AI?", 1),
process_request("Explain", 2),
process_request("Vector databases for semantic search", 3),
]
results = asyncio.run(asyncio.gather(*tasks))
Continuous batching improves throughput (requests/sec) and keeps latency low (no head-of-line blocking).
Quantization: GPTQ, AWQ, bitsandbytes
Reduce model size without sacrificing quality:
from vllm import LLM
from transformers import AutoModelForCausalLM, AutoTokenizer
# GPTQ quantization (int4): 4× memory reduction
# Load pre-quantized GPTQ model
llm_gptq = LLM(
model="TheBloke/Llama-2-70B-GPTQ",
dtype="half", # Use float16 for weights already quantized
gpu_memory_utilization=0.95,
)
# AWQ quantization (int4): better quality than GPTQ
# Find pre-quantized AWQ models on HuggingFace
llm_awq = LLM(
model="TheBloke/Llama-2-70B-AWQ",
quantization="awq",
gpu_memory_utilization=0.95,
)
# bitsandbytes (8-bit or 4-bit): on-the-fly quantization
from bitsandbytes.nn import Linear8bitLt
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
load_in_8bit=True, # 8-bit quantization
device_map="auto",
)
# bitsandbytes 4-bit (NF4)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
)
# Memory usage comparison:
# fp32: 280 GB (70B params × 4 bytes)
# fp16: 140 GB
# GPTQ int4: 35 GB
# NF4: 35 GB
# Quality trade-off:
# GPTQ/AWQ: 1-2% perplexity increase
# NF4: 2-3% perplexity increase
When to quantize:
- Model doesn't fit in VRAM unquantized
- Cost per inference matters (>1000 QPS)
- Latency < 50ms requirement
Don't quantize if:
- Quality is critical (legal, medical)
- Budget allows full precision
- Model fits with headroom
GPU Selection for Model Size
Match GPU to model size:
# Model size estimates (in GB, fp16)
model_sizes = {
"Llama 2 7B": 14,
"Llama 2 13B": 26,
"Llama 2 70B": 140,
"Mistral 7B": 14,
"Code Llama 34B": 68,
}
gpu_vram = {
"H100": 80,
"A100 40GB": 40,
"A100 80GB": 80,
"L40": 48,
"RTX 4090": 24,
}
# Rule of thumb: GPU VRAM should be 1.5-2× model size
# (headroom for activations, KV cache)
# Recommendations:
# 7B model: RTX 4090 (24GB)
# 13B model: A100 40GB or 2× RTX 4090
# 70B model: A100 80GB or 2× H100
# 100B+ model: 4× H100 or 8× A100
# For cost-effective inference:
# Use A100 80GB ($3-4/hour on cloud)
# Serve multiple customers per GPU
# Quantize to fit more models
Avoid over-provisioning GPUs. A100 40GB usually optimal for 13-30B models.
Multi-GPU Tensor Parallelism
Split large models across GPUs:
from vllm import LLM, SamplingParams
# Llama 2 70B needs 140GB vRAM
# Split across 2× A100 80GB with tensor parallelism
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
)
# vLLM distributes layers across GPUs
# Layer 0-15: GPU 0
# Layer 16-31: GPU 1
# (simplified; actual sharding is more sophisticated)
prompts = [
"Explain neural networks",
"What is machine learning?",
] * 50
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)
# Inference spans both GPUs transparently
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Generated: {output.outputs[0].text}")
# Throughput scaling:
# 1× A100: ~50 QPS
# 2× A100 (TP=2): ~90 QPS (not perfect 2×, comm overhead)
# 4× A100 (TP=4): ~150 QPS
Tensor parallelism overhead: ~10-20% per additional GPU. Don't over-shard.
OpenAI-Compatible API Endpoint
vLLM's API is drop-in compatible with OpenAI:
# Start vLLM server
# python -m vllm.entrypoints.openai_api_server \
# --model meta-llama/Llama-2-70b-hf \
# --tensor-parallel-size 2 \
# --gpu-memory-utilization 0.9 \
# --port 8000
import openai
from typing import Optional
# Point OpenAI client to vLLM server
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "not-needed" # Dummy key
def chat_completion(
prompt: str,
temperature: float = 0.7,
max_tokens: int = 512,
) -> str:
completion = openai.ChatCompletion.create(
model="meta-llama/Llama-2-70b-hf",
messages=[
{"role": "system", "content": "You are helpful assistant"},
{"role": "user", "content": prompt},
],
temperature=temperature,
max_tokens=max_tokens,
)
return completion.choices[0].message.content
# Drop-in replacement for OpenAI API
response = chat_completion("What is semantic search?")
print(response)
# Cost comparison:
# OpenAI GPT-4: $0.03 per 1K tokens
# Self-hosted Llama 2 70B: $0.001 per 1K tokens (H100 at $3/hour)
# Savings: 30× at scale
vLLM's API compatibility means zero code changes. Switch providers easily.
Autoscaling With Kubernetes
Deploy and scale vLLM on K8s:
# k8s deployment with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-70b
spec:
replicas: 3
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-70b-hf
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.9"
- --port
- "8000"
ports:
- containerPort: 8000
resources:
requests:
nvidia.com/gpu: 2 # Request 2 GPUs
limits:
nvidia.com/gpu: 2
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: LoadBalancer
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama2-70b
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Autoscaling with K8s enables:
- Scale up during high traffic
- Scale down for cost savings
- Run multiple model replicas
Cost Comparison: Self-Hosted vs APIs
Break down costs:
# Inference cost analysis
class InferenceCostCalculator:
def __init__(self, daily_tokens: int):
self.daily_tokens = daily_tokens
self.monthly_tokens = daily_tokens * 30
def openai_gpt4_cost(self) -> float:
# $0.03 per 1K input tokens
return (self.monthly_tokens / 1000) * 0.03
def openai_gpt35_cost(self) -> float:
# $0.0005 per 1K input tokens
return (self.monthly_tokens / 1000) * 0.0005
def self_hosted_cost(self, hourly_rate: float = 3.0) -> float:
# Assuming model runs 24/7 on GPU
# H100 on cloud: $3/hour
monthly_hours = 730
return hourly_rate * monthly_hours
def self_hosted_with_autoscaling(
self, hourly_rate: float = 3.0, utilization: float = 0.3
) -> float:
# Only pay for GPU when in use
# 30% average utilization
monthly_hours = 730
return hourly_rate * monthly_hours * utilization
# Example: 100M tokens per day
calculator = InferenceCostCalculator(daily_tokens=100_000_000)
print(f"OpenAI GPT-4: ${calculator.openai_gpt4_cost():,.2f}/month")
print(f"OpenAI GPT-3.5: ${calculator.openai_gpt35_cost():,.2f}/month")
print(f"Self-hosted (24/7): ${calculator.self_hosted_cost():,.2f}/month")
print(f"Self-hosted (30% util): ${calculator.self_hosted_with_autoscaling():,.2f}/month")
# Output (hypothetical):
# OpenAI GPT-4: $90,000/month
# OpenAI GPT-3.5: $1,500/month
# Self-hosted (24/7): $2,190/month
# Self-hosted (30% util): $657/month
# Self-hosting wins at ~10M+ tokens/day
Autoscaling makes self-hosting economical at even lower volumes.
When Self-Hosting Beats APIs
Self-host if:
- Volume: >50M tokens/day (cost savings >50%)
- Latency: <50ms requirement (API adds roundtrip)
- Data privacy: PII, medical, legal docs
- Customization: Fine-tuned models, custom system prompts
- Control: Need reproducibility, audit logs
Don't self-host if:
- Volume: <10M tokens/day (cost savings minimal)
- DevOps budget: Limited infrastructure expertise
- Flexibility: Need latest GPT-4 model immediately
- Reliability: Can't maintain uptime SLA
Checklist
- Benchmark throughput on your hardware (QPS, latency)
- Select quantization strategy (int4, int8, or full precision)
- Size GPU cluster for peak load
- Set up vLLM with OpenAI API compatibility
- Implement request queuing and timeout handling
- Monitor GPU utilization and throughput
- Test failover across GPU instances
- Calculate ROI: API costs vs self-hosted
- Plan model update strategy
- Set up cost monitoring and alerts
Conclusion
vLLM is the production standard for self-hosted LLMs. PagedAttention and continuous batching unlock high throughput. Quantization reduces cost. Multi-GPU scaling handles large models. OpenAI-compatible API enables zero-friction adoption. At >50M tokens/day, self-hosting saves significantly. At lower volumes, APIs remain cost-effective. Choose based on volume, latency, and data privacy needs.