- Published on
Running Open-Source LLMs in Production — Llama 3, Mistral, and Qwen on Your Own Infrastructure
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Paying OpenAI for every API call adds up. At scale, the cost becomes prohibitive. A 100K request/month workload costs thousands with API calls but hundreds with self-hosted Llama 3.
The question isn't "can I run LLMs myself?" It's "should I?" The answer depends on your scale, privacy requirements, and customization needs. Let's break it down.
- When Self-Hosting LLMs Beats Paying OpenAI
- Ollama for Development and Small-Scale
- vLLM for Production Scale
- Hardware Requirements by Model Size
- Quantization Trade-offs
- Serving Multiple Models with a Single vLLM Instance
- Load Balancing Across GPU Nodes
- Model Warm-up and Cold Start
- Benchmarking Self-Hosted vs API
- Monitoring and Observability
- Checklist
- Conclusion
When Self-Hosting LLMs Beats Paying OpenAI
Privacy and Compliance: Your data stays on your infrastructure. HIPAA, PCI-DSS, and GDPR become easier. No third-party access.
Cost at Scale: At 1M requests/month, API costs are $1000-5000. Self-hosted costs are roughly $500-2000 (infrastructure + people). The math flips at scale.
Customization: Fine-tune on your domain data. Adjust system prompts. A/B test models easily.
Latency: Direct inference is faster than API calls. Relevant for real-time features.
Control: No rate limits, no provider changes, no surprise price increases.
The trade-off: operational complexity. You manage the infrastructure, monitoring, and updates.
Ollama for Development and Small-Scale
Ollama is the right starting point. It's a single binary that downloads and runs LLMs locally.
Setup:
ollama pull llama2
ollama serve
Use it:
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Pros:
- Single command setup
- No Docker complexity
- Works on Mac, Linux, Windows
- Supports multiple models concurrently
- Streaming responses
- Good for development and small-scale (<100 req/s)
Cons:
- Not optimized for production scale
- No batching or request queuing
- Limited observability
- Stops when the process stops
Use Ollama for local development, prototyping, and testing. Move to vLLM for production.
vLLM for Production Scale
vLLM is the production-grade inference engine. It's what companies like Replicate, Anyscale, and Together use.
Key features:
- PagedAttention (30x memory efficiency)
- Continuous batching (higher throughput)
- Multi-GPU support
- Tensor parallelism (split models across GPUs)
- Comprehensive monitoring
- Quantization support
Setup (roughly):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 4 \
--quantization awq
Performance: With 4x A100 GPUs, vLLM handles < 1000 req/s per GPU comfortably.
Pros:
- Production-ready
- High throughput
- Low latency
- Excellent observability
- Auto-scales gracefully
Cons:
- Higher operational burden
- Requires GPU infrastructure
- Monitoring and alerts needed
Hardware Requirements by Model Size
Choose hardware based on model size and throughput needs.
7B Models (Llama 3 7B, Mistral 7B):
- Development: Single GPU (RTX 4090 or A100 40GB)
- Production: Single A100 40GB or H100, handles < 100 req/s
- Cost:
$0.30-0.50 per hour on cloud
13B Models (Mistral 13B, Llama 3.1 13B):
- Development: Single A100 40GB
- Production: 2x A100 40GB or single H100, handles < 200 req/s
- Cost:
$0.60-1.00 per hour
70B Models (Llama 3 70B, Qwen 72B):
- Development: Single A100 80GB or H100
- Production: 2x H100 or 8x A100 40GB, handles < 500 req/s
- Cost:
$2-4 per hour
Rule of thumb: 1 GPU < 1 model 70B in inference mode. 2-4 GPUs for production scale.
Quantization Trade-offs
Quantization reduces model size and memory. The trade-off: slightly lower quality.
Full Precision (FP32): 28GB for 7B model. Highest quality. Slowest.
Half Precision (FP16): 14GB for 7B model. Excellent quality. Standard for inference.
8-bit Quantization: 7GB for 7B model. Quality loss is minimal. 20% faster.
4-bit Quantization (Q4_K_M): 3.5GB for 7B model. Quality loss is small. 50% faster. Recommended for production.
Recommendation: Q4_K_M for most use cases. Measure quality on your domain data. If acceptable, use it. If not, go FP16.
ollama run llama2:7b-q4_K_M
Serving Multiple Models with a Single vLLM Instance
You can run multiple models on the same hardware using model switching.
Architecture:
- Load Model A into VRAM
- When Model B is requested, evict Model A (save weights)
- Load Model B
- When Model A is requested again, swap in from disk
Cost: Swap latency is high (seconds) but amortized. Use for low-frequency model switching.
Better approach: Dedicated GPUs per model if you have heterogeneous workloads.
Load Balancing Across GPU Nodes
At scale, run multiple inference servers.
Architecture:
- Load balancer (nginx, HAProxy)
- 3-5 inference nodes (each with GPUs)
- Shared model cache (optional)
- Monitoring (< CPU, GPU utilization, latency)
Request routing:
- Round-robin for simple cases
- Least-loaded (monitor queue depth)
- By model type (if different models on different nodes)
Model Warm-up and Cold Start
Cold start is the first request after a model loads. It's slow.
Typical cold start times:
- Ollama: 5-10s
- vLLM: 2-5s
- After warmup: 50-200ms
Minimize cold start:
- Pre-load models on startup
- Keep models in memory between requests
- Use model caching
- Load-test before going to production
Benchmarking Self-Hosted vs API
Compare cost, latency, and quality.
Scenario: 100K requests/month
OpenAI GPT-3.5:
- Cost:
$0.0005per request <$50/month - Latency: 500-1000ms
- Quality: High
OpenAI GPT-4:
- Cost:
$0.03per request <$3000/month - Latency: 1-2s
- Quality: Very high
Llama 3 70B (Self-Hosted on H100):
- Cost:
$2/hour< ~$700/month(includes overhead) - Latency: 100-300ms
- Quality: High
Llama 3 7B (Self-Hosted on A100):
- Cost:
$0.30/hour< ~$100/month - Latency: 50-150ms
- Quality: Good
At 100K requests, Llama 3 7B is < 10% the cost of GPT-3.5 and < 5% the cost of GPT-4.
Monitoring and Observability
Track these metrics:
- Throughput: requests/second per model
- Latency: p50, p95, p99
- GPU utilization: < 80% is good
- VRAM usage: < 90% headroom
- Queue depth: monitor for bottlenecks
- Token throughput: tokens/second (more relevant than requests)
Use Prometheus + Grafana for dashboards. vLLM exports metrics automatically.
Checklist
- Determine model size (7B, 13B, 70B based on latency/quality)
- Calculate hardware cost vs API cost
- Benchmark on your domain data
- Test with Ollama locally
- Set up vLLM for production
- Configure quantization (recommend Q4_K_M)
- Load balance across multiple nodes
- Monitor GPU utilization and latency
- Set up cold start mitigation
- Document runbooks for model updates
Conclusion
Self-hosting is practical at scale. Start with Ollama to understand the trade-offs. Move to vLLM for production. Benchmark cost vs API before committing. The math strongly favors self-hosting at < 100K requests/month and scales better from there.