Running Open-Source LLMs in Production — Llama 3, Mistral, and Qwen on Your Own Infrastructure

Introduction

Paying OpenAI for every API call adds up. At scale, the cost becomes prohibitive. A 100K request/month workload costs thousands with API calls but hundreds with self-hosted Llama 3.

The question isn't "can I run LLMs myself?" It's "should I?" The answer depends on your scale, privacy requirements, and customization needs. Let's break it down.

When Self-Hosting LLMs Beats Paying OpenAI
Ollama for Development and Small-Scale
vLLM for Production Scale
Hardware Requirements by Model Size
Quantization Trade-offs
Serving Multiple Models with a Single vLLM Instance
Load Balancing Across GPU Nodes
Model Warm-up and Cold Start
Benchmarking Self-Hosted vs API
Monitoring and Observability
Checklist
Conclusion

When Self-Hosting LLMs Beats Paying OpenAI

Privacy and Compliance: Your data stays on your infrastructure. HIPAA, PCI-DSS, and GDPR become easier. No third-party access.

Cost at Scale: At 1M requests/month, API costs are $1000-5000. Self-hosted costs are roughly $500-2000 (infrastructure + people). The math flips at scale.

Customization: Fine-tune on your domain data. Adjust system prompts. A/B test models easily.

Latency: Direct inference is faster than API calls. Relevant for real-time features.

Control: No rate limits, no provider changes, no surprise price increases.

The trade-off: operational complexity. You manage the infrastructure, monitoring, and updates.

Ollama for Development and Small-Scale

Ollama is the right starting point. It's a single binary that downloads and runs LLMs locally.

Setup:

ollama pull llama2
ollama serve

Use it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Pros:

Single command setup
No Docker complexity
Works on Mac, Linux, Windows
Supports multiple models concurrently
Streaming responses
Good for development and small-scale (<100 req/s)

Cons:

Not optimized for production scale
No batching or request queuing
Limited observability
Stops when the process stops

Use Ollama for local development, prototyping, and testing. Move to vLLM for production.

vLLM for Production Scale

vLLM is the production-grade inference engine. It's what companies like Replicate, Anyscale, and Together use.

Key features:

PagedAttention (30x memory efficiency)
Continuous batching (higher throughput)
Multi-GPU support
Tensor parallelism (split models across GPUs)
Comprehensive monitoring
Quantization support

Setup (roughly):

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --tensor-parallel-size 4 \
  --quantization awq

Performance: With 4x A100 GPUs, vLLM handles < 1000 req/s per GPU comfortably.

Pros:

Production-ready
High throughput
Low latency
Excellent observability
Auto-scales gracefully

Cons:

Higher operational burden
Requires GPU infrastructure
Monitoring and alerts needed

Hardware Requirements by Model Size

Choose hardware based on model size and throughput needs.

7B Models (Llama 3 7B, Mistral 7B):

Development: Single GPU (RTX 4090 or A100 40GB)
Production: Single A100 40GB or H100, handles < 100 req/s
Cost: $0.30-0.50 per hour on cloud

13B Models (Mistral 13B, Llama 3.1 13B):

Development: Single A100 40GB
Production: 2x A100 40GB or single H100, handles < 200 req/s
Cost: $0.60-1.00 per hour

70B Models (Llama 3 70B, Qwen 72B):

Development: Single A100 80GB or H100
Production: 2x H100 or 8x A100 40GB, handles < 500 req/s
Cost: $2-4 per hour

Rule of thumb: 1 GPU < 1 model 70B in inference mode. 2-4 GPUs for production scale.

Quantization Trade-offs

Quantization reduces model size and memory. The trade-off: slightly lower quality.

Full Precision (FP32): 28GB for 7B model. Highest quality. Slowest.

Half Precision (FP16): 14GB for 7B model. Excellent quality. Standard for inference.

8-bit Quantization: 7GB for 7B model. Quality loss is minimal. 20% faster.

4-bit Quantization (Q4_K_M): 3.5GB for 7B model. Quality loss is small. 50% faster. Recommended for production.

Recommendation: Q4_K_M for most use cases. Measure quality on your domain data. If acceptable, use it. If not, go FP16.

ollama run llama2:7b-q4_K_M

Serving Multiple Models with a Single vLLM Instance

You can run multiple models on the same hardware using model switching.

Architecture:

Load Model A into VRAM
When Model B is requested, evict Model A (save weights)
Load Model B
When Model A is requested again, swap in from disk

Cost: Swap latency is high (seconds) but amortized. Use for low-frequency model switching.

Better approach: Dedicated GPUs per model if you have heterogeneous workloads.

Load Balancing Across GPU Nodes

At scale, run multiple inference servers.

Architecture:

Load balancer (nginx, HAProxy)
3-5 inference nodes (each with GPUs)
Shared model cache (optional)
Monitoring (< CPU, GPU utilization, latency)

Request routing:

Round-robin for simple cases
Least-loaded (monitor queue depth)
By model type (if different models on different nodes)

Model Warm-up and Cold Start

Cold start is the first request after a model loads. It's slow.

Typical cold start times:

Ollama: 5-10s
vLLM: 2-5s
After warmup: 50-200ms

Minimize cold start:

Pre-load models on startup
Keep models in memory between requests
Use model caching
Load-test before going to production

Benchmarking Self-Hosted vs API

Compare cost, latency, and quality.

Scenario: 100K requests/month

OpenAI GPT-3.5:

Cost: $0.0005 per request < $50/month
Latency: 500-1000ms
Quality: High

OpenAI GPT-4:

Cost: $0.03 per request < $3000/month
Latency: 1-2s
Quality: Very high

Llama 3 70B (Self-Hosted on H100):

Cost: $2/hour < ~$700/month (includes overhead)
Latency: 100-300ms
Quality: High

Llama 3 7B (Self-Hosted on A100):

Cost: $0.30/hour < ~$100/month
Latency: 50-150ms
Quality: Good

At 100K requests, Llama 3 7B is < 10% the cost of GPT-3.5 and < 5% the cost of GPT-4.

Monitoring and Observability

Track these metrics:

Throughput: requests/second per model
Latency: p50, p95, p99
GPU utilization: < 80% is good
VRAM usage: < 90% headroom
Queue depth: monitor for bottlenecks
Token throughput: tokens/second (more relevant than requests)

Use Prometheus + Grafana for dashboards. vLLM exports metrics automatically.

Checklist

Conclusion

Self-hosting is practical at scale. Start with Ollama to understand the trade-offs. Move to vLLM for production. Benchmark cost vs API before committing. The math strongly favors self-hosting at < 100K requests/month and scales better from there.