Self-Hosting LLMs With vLLM — Running Open-Source Models in Production

Introduction

Self-hosting LLMs offers cost savings, data privacy, and model flexibility. vLLM, a production-grade inference engine, has become the standard for high-performance deployments. This guide covers framework comparisons, optimization techniques, and when self-hosting beats APIs.

vLLM vs Ollama vs TGI vs llama.cpp
PagedAttention for Memory Efficiency
Continuous Batching
Quantization: GPTQ, AWQ, bitsandbytes
GPU Selection for Model Size
Multi-GPU Tensor Parallelism
OpenAI-Compatible API Endpoint
Autoscaling With Kubernetes
Cost Comparison: Self-Hosted vs APIs
When Self-Hosting Beats APIs
Checklist
Conclusion

vLLM vs Ollama vs TGI vs llama.cpp

Compare the major LLM serving frameworks:

Framework	Deployment	Latency	Throughput	Ease	Production Ready
vLLM	GPU	<50ms	100+ QPS	Moderate	Yes
Ollama	Local	<100ms	10-50 QPS	Very Easy	Development
TGI (Text-Gen)	GPU	<30ms	200+ QPS	Hard	Yes
llama.cpp	CPU	<200ms	5-20 QPS	Very Easy	Development

vLLM advantages:

OpenAI-compatible API (drop-in replacement)
Continuous batching for high throughput
Multi-LoRA support
Dynamic batching reduces latency

Ollama advantages:

Single binary, no dependencies
Perfect for local testing
Huge model library

TGI advantages:

Optimized for batch inference
Token streaming
Excellent for chatbots

llama.cpp advantages:

CPU inference (no GPU required)
Extreme efficiency

For production: vLLM dominates. For prototyping: Ollama. For batch processing: TGI.

PagedAttention for Memory Efficiency

vLLM's PagedAttention is game-changing. Traditional attention is O(n²) memory:

# Traditional attention: allocates full KV cache upfront
# For 32B model + batch 256: ~40GB per request

# vLLM PagedAttention: shared KV cache pages
# Same scenario: ~6GB (6× memory reduction)

from vllm import LLM, SamplingParams

# Initialize with paged attention (default)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=2,  # 2 GPUs
    gpu_memory_utilization=0.9,  # Use 90% of VRAM
)

# vLLM manages KV cache pages automatically
# You don't need to think about it

prompts = [
    "What is semantic search?",
    "Explain vector databases",
    "How do embeddings work?",
] * 100  # 300 requests

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)

# Continuous batching: processes all 300 without OOM
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

PagedAttention enables:

Larger batch sizes (32 → 256)
Better throughput (50 QPS → 200 QPS)
Lower latency (tokens block each other less)

Critical for production inference at scale.

Continuous Batching

Traditional batching waits for slowest request. Continuous batching releases early:

# Traditional static batching:
# Request 1: 100 tokens (slow)
# Request 2: 10 tokens (blocked)
# Request 3: 50 tokens (blocked)
# Total latency: slowest request

# vLLM continuous batching:
# Request 2 finishes (10 tokens)
# Request 3 finishes (50 tokens)
# Request 1 finishes (100 tokens)
# Requests 2,3 don't wait for 1

from vllm import LLM, SamplingParams
import asyncio
import time

llm = LLM(
    model="meta-llama/Llama-2-13b-hf",
    max_model_len=4096,
    max_num_batched_tokens=8192,  # Process many short tokens
    max_num_seqs=256,  # Max concurrent requests
)

async def process_request(prompt: str, request_id: int):
    start = time.time()
    outputs = llm.generate([prompt])
    latency = time.time() - start
    print(f"Request {request_id}: {latency:.2f}s")
    return outputs

# Simulate concurrent requests
tasks = [
    process_request("What is AI?", 1),
    process_request("Explain", 2),
    process_request("Vector databases for semantic search", 3),
]

results = asyncio.run(asyncio.gather(*tasks))

Continuous batching improves throughput (requests/sec) and keeps latency low (no head-of-line blocking).

Quantization: GPTQ, AWQ, bitsandbytes

Reduce model size without sacrificing quality:

from vllm import LLM
from transformers import AutoModelForCausalLM, AutoTokenizer

# GPTQ quantization (int4): 4× memory reduction
# Load pre-quantized GPTQ model
llm_gptq = LLM(
    model="TheBloke/Llama-2-70B-GPTQ",
    dtype="half",  # Use float16 for weights already quantized
    gpu_memory_utilization=0.95,
)

# AWQ quantization (int4): better quality than GPTQ
# Find pre-quantized AWQ models on HuggingFace
llm_awq = LLM(
    model="TheBloke/Llama-2-70B-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.95,
)

# bitsandbytes (8-bit or 4-bit): on-the-fly quantization
from bitsandbytes.nn import Linear8bitLt

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    load_in_8bit=True,  # 8-bit quantization
    device_map="auto",
)

# bitsandbytes 4-bit (NF4)
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
)

# Memory usage comparison:
# fp32: 280 GB (70B params × 4 bytes)
# fp16: 140 GB
# GPTQ int4: 35 GB
# NF4: 35 GB

# Quality trade-off:
# GPTQ/AWQ: 1-2% perplexity increase
# NF4: 2-3% perplexity increase

When to quantize:

Model doesn't fit in VRAM unquantized
Cost per inference matters (>1000 QPS)
Latency < 50ms requirement

Don't quantize if:

Quality is critical (legal, medical)
Budget allows full precision
Model fits with headroom

GPU Selection for Model Size

Match GPU to model size:

# Model size estimates (in GB, fp16)

model_sizes = {
    "Llama 2 7B": 14,
    "Llama 2 13B": 26,
    "Llama 2 70B": 140,
    "Mistral 7B": 14,
    "Code Llama 34B": 68,
}

gpu_vram = {
    "H100": 80,
    "A100 40GB": 40,
    "A100 80GB": 80,
    "L40": 48,
    "RTX 4090": 24,
}

# Rule of thumb: GPU VRAM should be 1.5-2× model size
# (headroom for activations, KV cache)

# Recommendations:
# 7B model: RTX 4090 (24GB)
# 13B model: A100 40GB or 2× RTX 4090
# 70B model: A100 80GB or 2× H100
# 100B+ model: 4× H100 or 8× A100

# For cost-effective inference:
# Use A100 80GB ($3-4/hour on cloud)
# Serve multiple customers per GPU
# Quantize to fit more models

Avoid over-provisioning GPUs. A100 40GB usually optimal for 13-30B models.

Multi-GPU Tensor Parallelism

Split large models across GPUs:

from vllm import LLM, SamplingParams

# Llama 2 70B needs 140GB vRAM
# Split across 2× A100 80GB with tensor parallelism

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,
)

# vLLM distributes layers across GPUs
# Layer 0-15: GPU 0
# Layer 16-31: GPU 1
# (simplified; actual sharding is more sophisticated)

prompts = [
    "Explain neural networks",
    "What is machine learning?",
] * 50

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

# Inference spans both GPUs transparently
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Generated: {output.outputs[0].text}")

# Throughput scaling:
# 1× A100: ~50 QPS
# 2× A100 (TP=2): ~90 QPS (not perfect 2×, comm overhead)
# 4× A100 (TP=4): ~150 QPS

Tensor parallelism overhead: ~10-20% per additional GPU. Don't over-shard.

OpenAI-Compatible API Endpoint

vLLM's API is drop-in compatible with OpenAI:

# Start vLLM server
# python -m vllm.entrypoints.openai_api_server \
#   --model meta-llama/Llama-2-70b-hf \
#   --tensor-parallel-size 2 \
#   --gpu-memory-utilization 0.9 \
#   --port 8000

import openai
from typing import Optional

# Point OpenAI client to vLLM server
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "not-needed"  # Dummy key

def chat_completion(
    prompt: str,
    temperature: float = 0.7,
    max_tokens: int = 512,
) -> str:
    completion = openai.ChatCompletion.create(
        model="meta-llama/Llama-2-70b-hf",
        messages=[
            {"role": "system", "content": "You are helpful assistant"},
            {"role": "user", "content": prompt},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return completion.choices[0].message.content

# Drop-in replacement for OpenAI API
response = chat_completion("What is semantic search?")
print(response)

# Cost comparison:
# OpenAI GPT-4: $0.03 per 1K tokens
# Self-hosted Llama 2 70B: $0.001 per 1K tokens (H100 at $3/hour)
# Savings: 30× at scale

vLLM's API compatibility means zero code changes. Switch providers easily.

Autoscaling With Kubernetes

Deploy and scale vLLM on K8s:

# k8s deployment with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - meta-llama/Llama-2-70b-hf
          - --tensor-parallel-size
          - "2"
          - --gpu-memory-utilization
          - "0.9"
          - --port
          - "8000"
        ports:
        - containerPort: 8000
        resources:
          requests:
            nvidia.com/gpu: 2  # Request 2 GPUs
          limits:
            nvidia.com/gpu: 2
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: LoadBalancer
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama2-70b
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Autoscaling with K8s enables:

Scale up during high traffic
Scale down for cost savings
Run multiple model replicas

Cost Comparison: Self-Hosted vs APIs

Break down costs:

# Inference cost analysis

class InferenceCostCalculator:
    def __init__(self, daily_tokens: int):
        self.daily_tokens = daily_tokens
        self.monthly_tokens = daily_tokens * 30

    def openai_gpt4_cost(self) -> float:
        # $0.03 per 1K input tokens
        return (self.monthly_tokens / 1000) * 0.03

    def openai_gpt35_cost(self) -> float:
        # $0.0005 per 1K input tokens
        return (self.monthly_tokens / 1000) * 0.0005

    def self_hosted_cost(self, hourly_rate: float = 3.0) -> float:
        # Assuming model runs 24/7 on GPU
        # H100 on cloud: $3/hour
        monthly_hours = 730
        return hourly_rate * monthly_hours

    def self_hosted_with_autoscaling(
        self, hourly_rate: float = 3.0, utilization: float = 0.3
    ) -> float:
        # Only pay for GPU when in use
        # 30% average utilization
        monthly_hours = 730
        return hourly_rate * monthly_hours * utilization

# Example: 100M tokens per day
calculator = InferenceCostCalculator(daily_tokens=100_000_000)

print(f"OpenAI GPT-4: ${calculator.openai_gpt4_cost():,.2f}/month")
print(f"OpenAI GPT-3.5: ${calculator.openai_gpt35_cost():,.2f}/month")
print(f"Self-hosted (24/7): ${calculator.self_hosted_cost():,.2f}/month")
print(f"Self-hosted (30% util): ${calculator.self_hosted_with_autoscaling():,.2f}/month")

# Output (hypothetical):
# OpenAI GPT-4: $90,000/month
# OpenAI GPT-3.5: $1,500/month
# Self-hosted (24/7): $2,190/month
# Self-hosted (30% util): $657/month

# Self-hosting wins at ~10M+ tokens/day

Autoscaling makes self-hosting economical at even lower volumes.

When Self-Hosting Beats APIs

Self-host if:

Volume: >50M tokens/day (cost savings >50%)
Latency: <50ms requirement (API adds roundtrip)
Data privacy: PII, medical, legal docs
Customization: Fine-tuned models, custom system prompts
Control: Need reproducibility, audit logs

Don't self-host if:

Volume: <10M tokens/day (cost savings minimal)
DevOps budget: Limited infrastructure expertise
Flexibility: Need latest GPT-4 model immediately
Reliability: Can't maintain uptime SLA

Checklist

Conclusion

vLLM is the production standard for self-hosted LLMs. PagedAttention and continuous batching unlock high throughput. Quantization reduces cost. Multi-GPU scaling handles large models. OpenAI-compatible API enables zero-friction adoption. At >50M tokens/day, self-hosting saves significantly. At lower volumes, APIs remain cost-effective. Choose based on volume, latency, and data privacy needs.