Published on

Running Open-Source LLMs in Production — Llama 3, Mistral, and Qwen on Your Own Infrastructure

Authors

Introduction

Paying OpenAI for every API call adds up. At scale, the cost becomes prohibitive. A 100K request/month workload costs thousands with API calls but hundreds with self-hosted Llama 3.

The question isn't "can I run LLMs myself?" It's "should I?" The answer depends on your scale, privacy requirements, and customization needs. Let's break it down.

When Self-Hosting LLMs Beats Paying OpenAI

Privacy and Compliance: Your data stays on your infrastructure. HIPAA, PCI-DSS, and GDPR become easier. No third-party access.

Cost at Scale: At 1M requests/month, API costs are $1000-5000. Self-hosted costs are roughly $500-2000 (infrastructure + people). The math flips at scale.

Customization: Fine-tune on your domain data. Adjust system prompts. A/B test models easily.

Latency: Direct inference is faster than API calls. Relevant for real-time features.

Control: No rate limits, no provider changes, no surprise price increases.

The trade-off: operational complexity. You manage the infrastructure, monitoring, and updates.

Ollama for Development and Small-Scale

Ollama is the right starting point. It's a single binary that downloads and runs LLMs locally.

Setup:

ollama pull llama2
ollama serve

Use it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Pros:

  • Single command setup
  • No Docker complexity
  • Works on Mac, Linux, Windows
  • Supports multiple models concurrently
  • Streaming responses
  • Good for development and small-scale (<100 req/s)

Cons:

  • Not optimized for production scale
  • No batching or request queuing
  • Limited observability
  • Stops when the process stops

Use Ollama for local development, prototyping, and testing. Move to vLLM for production.

vLLM for Production Scale

vLLM is the production-grade inference engine. It's what companies like Replicate, Anyscale, and Together use.

Key features:

  • PagedAttention (30x memory efficiency)
  • Continuous batching (higher throughput)
  • Multi-GPU support
  • Tensor parallelism (split models across GPUs)
  • Comprehensive monitoring
  • Quantization support

Setup (roughly):

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --tensor-parallel-size 4 \
  --quantization awq

Performance: With 4x A100 GPUs, vLLM handles < 1000 req/s per GPU comfortably.

Pros:

  • Production-ready
  • High throughput
  • Low latency
  • Excellent observability
  • Auto-scales gracefully

Cons:

  • Higher operational burden
  • Requires GPU infrastructure
  • Monitoring and alerts needed

Hardware Requirements by Model Size

Choose hardware based on model size and throughput needs.

7B Models (Llama 3 7B, Mistral 7B):

  • Development: Single GPU (RTX 4090 or A100 40GB)
  • Production: Single A100 40GB or H100, handles < 100 req/s
  • Cost: $0.30-0.50 per hour on cloud

13B Models (Mistral 13B, Llama 3.1 13B):

  • Development: Single A100 40GB
  • Production: 2x A100 40GB or single H100, handles < 200 req/s
  • Cost: $0.60-1.00 per hour

70B Models (Llama 3 70B, Qwen 72B):

  • Development: Single A100 80GB or H100
  • Production: 2x H100 or 8x A100 40GB, handles < 500 req/s
  • Cost: $2-4 per hour

Rule of thumb: 1 GPU < 1 model 70B in inference mode. 2-4 GPUs for production scale.

Quantization Trade-offs

Quantization reduces model size and memory. The trade-off: slightly lower quality.

Full Precision (FP32): 28GB for 7B model. Highest quality. Slowest.

Half Precision (FP16): 14GB for 7B model. Excellent quality. Standard for inference.

8-bit Quantization: 7GB for 7B model. Quality loss is minimal. 20% faster.

4-bit Quantization (Q4_K_M): 3.5GB for 7B model. Quality loss is small. 50% faster. Recommended for production.

Recommendation: Q4_K_M for most use cases. Measure quality on your domain data. If acceptable, use it. If not, go FP16.

ollama run llama2:7b-q4_K_M

Serving Multiple Models with a Single vLLM Instance

You can run multiple models on the same hardware using model switching.

Architecture:

  • Load Model A into VRAM
  • When Model B is requested, evict Model A (save weights)
  • Load Model B
  • When Model A is requested again, swap in from disk

Cost: Swap latency is high (seconds) but amortized. Use for low-frequency model switching.

Better approach: Dedicated GPUs per model if you have heterogeneous workloads.

Load Balancing Across GPU Nodes

At scale, run multiple inference servers.

Architecture:

  • Load balancer (nginx, HAProxy)
  • 3-5 inference nodes (each with GPUs)
  • Shared model cache (optional)
  • Monitoring (< CPU, GPU utilization, latency)

Request routing:

  • Round-robin for simple cases
  • Least-loaded (monitor queue depth)
  • By model type (if different models on different nodes)

Model Warm-up and Cold Start

Cold start is the first request after a model loads. It's slow.

Typical cold start times:

  • Ollama: 5-10s
  • vLLM: 2-5s
  • After warmup: 50-200ms

Minimize cold start:

  • Pre-load models on startup
  • Keep models in memory between requests
  • Use model caching
  • Load-test before going to production

Benchmarking Self-Hosted vs API

Compare cost, latency, and quality.

Scenario: 100K requests/month

OpenAI GPT-3.5:

  • Cost: $0.0005 per request < $50/month
  • Latency: 500-1000ms
  • Quality: High

OpenAI GPT-4:

  • Cost: $0.03 per request < $3000/month
  • Latency: 1-2s
  • Quality: Very high

Llama 3 70B (Self-Hosted on H100):

  • Cost: $2/hour < ~$700/month (includes overhead)
  • Latency: 100-300ms
  • Quality: High

Llama 3 7B (Self-Hosted on A100):

  • Cost: $0.30/hour < ~$100/month
  • Latency: 50-150ms
  • Quality: Good

At 100K requests, Llama 3 7B is < 10% the cost of GPT-3.5 and < 5% the cost of GPT-4.

Monitoring and Observability

Track these metrics:

  • Throughput: requests/second per model
  • Latency: p50, p95, p99
  • GPU utilization: < 80% is good
  • VRAM usage: < 90% headroom
  • Queue depth: monitor for bottlenecks
  • Token throughput: tokens/second (more relevant than requests)

Use Prometheus + Grafana for dashboards. vLLM exports metrics automatically.

Checklist

  • Determine model size (7B, 13B, 70B based on latency/quality)
  • Calculate hardware cost vs API cost
  • Benchmark on your domain data
  • Test with Ollama locally
  • Set up vLLM for production
  • Configure quantization (recommend Q4_K_M)
  • Load balance across multiple nodes
  • Monitor GPU utilization and latency
  • Set up cold start mitigation
  • Document runbooks for model updates

Conclusion

Self-hosting is practical at scale. Start with Ollama to understand the trade-offs. Move to vLLM for production. Benchmark cost vs API before committing. The math strongly favors self-hosting at < 100K requests/month and scales better from there.