- Published on
AI Gateway With LiteLLM — Unified Interface for 100+ LLM Providers
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Managing multiple LLM providers at scale becomes complex quickly. LiteLLM abstracts provider differences, enabling seamless fallback, load balancing, and cost control. This guide covers production deployment patterns for unified LLM routing.
- LiteLLM Proxy Setup
- Provider Credential Management
- Routing Rules: Cost, Latency, Model Capability
- Fallback Configuration
- Request/Response Logging
- Rate Limiting Per API Key
- Budget Enforcement
- Load Balancing Across Providers
- Virtual API Keys for Teams
- Custom Provider Integration
- Checklist
- Conclusion
LiteLLM Proxy Setup
Deploy LiteLLM as a reverse proxy for all LLM requests:
# Install
pip install litellm
# Start proxy server
litellm --model gpt-4 --port 8000
# Or use config file for multiple models
# config.yaml
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: "${OPENAI_API_KEY}"
- model_name: claude-3-opus
litellm_params:
model: claude-3-opus-20240229
api_key: "${ANTHROPIC_API_KEY}"
- model_name: cohere-command
litellm_params:
model: command
api_key: "${COHERE_API_KEY}"
- model_name: local-llama
litellm_params:
model: vllm/meta-llama/Llama-2-70b-hf
api_base: "http://vllm-server:8000/v1"
general_settings:
master_key: "sk-1234"
# Start with config
litellm --config config.yaml --port 8000
Now all requests go through LiteLLM:
from openai import OpenAI
# Point to LiteLLM proxy instead of OpenAI
client = OpenAI(
api_key="sk-1234", # master_key from config
base_url="http://localhost:8000/v1",
)
# Works exactly like OpenAI API
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": "What is AI?"},
],
)
print(response.choices[0].message.content)
LiteLLM translates the request to the appropriate provider. No code changes needed.
Provider Credential Management
Store credentials securely:
# config.yaml with environment variables
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: "${OPENAI_API_KEY}" # Load from environment
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: "${ANTHROPIC_API_KEY}"
# Export before starting
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..."
litellm --config config.yaml --port 8000
For Kubernetes:
apiVersion: v1
kind: Secret
metadata:
name: llm-credentials
type: Opaque
stringData:
OPENAI_API_KEY: sk-...
ANTHROPIC_API_KEY: sk-ant-...
COHERE_API_KEY: ...
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
spec:
containers:
- name: litellm
image: litellm/litellm:latest
envFrom:
- secretRef:
name: llm-credentials
args:
- --config
- /app/config.yaml
- --port
- "8000"
Never hardcode API keys. Use secrets management systems.
Routing Rules: Cost, Latency, Model Capability
Implement intelligent routing:
# config.yaml with routing
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: "${OPENAI_API_KEY}"
metadata:
cost: 0.03 # $ per 1K tokens
latency: 500 # ms
max_tokens: 8000
capabilities: ["long-context", "reasoning"]
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: "${OPENAI_API_KEY}"
metadata:
cost: 0.0005
latency: 300
max_tokens: 4000
capabilities: ["speed"]
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: "${ANTHROPIC_API_KEY}"
metadata:
cost: 0.015
latency: 800
max_tokens: 100000
capabilities: ["long-context", "analysis"]
router:
type: latency # or cost-based, capability-based
# Cost-based routing example
routing_strategy: cost_optimized
# Route <100 token requests to fast/cheap model
cost_threshold: 0.01
# Route >50K tokens to high-context model
context_threshold: 50000
Python routing logic:
from litellm import Router
router = Router(config_file="config.yaml")
def route_request(prompt: str, context_size: int, priority: str):
"""Route based on request characteristics"""
if context_size > 50000:
# Use long-context model
model = "claude-opus"
elif priority == "cost":
# Use cheapest model
model = "gpt-3.5-turbo"
elif priority == "quality":
# Use best model
model = "gpt-4"
else:
# Default: balance cost and quality
model = "gpt-4" # or router.auto_select()
response = router.completion(
model=model,
messages=[
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": prompt},
],
temperature=0.7,
)
return response
# Usage
response = route_request(
"Analyze this large document",
context_size=60000,
priority="cost",
)
Route intelligently to balance cost, latency, and quality.
Fallback Configuration
Enable automatic failover:
# config.yaml with fallback
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: "${OPENAI_API_KEY}"
- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: "${ANTHROPIC_API_KEY}"
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: "${OPENAI_API_KEY}"
fallback_config:
- model_name: gpt-4
fallback_to: ["claude-opus", "gpt-3.5-turbo"]
retry_attempts: 3
retry_delay: 2 # seconds
Python fallback:
from litellm import Router
router = Router(config_file="config.yaml")
try:
response = router.completion(
model="gpt-4",
messages=[...],
fallback=True, # Enable automatic fallback
)
except Exception as e:
print(f"All fallbacks exhausted: {e}")
Fallback ensures reliability even when primary provider fails.
Request/Response Logging
Monitor all traffic for debugging and compliance:
# config.yaml with logging
database:
type: postgres
connection_string: "postgresql://user:pass@localhost:5432/litellm"
litellm_settings:
log_level: "DEBUG"
log_to_file: true
log_file_path: "/var/log/litellm.log"
# Log completion token usage
log_completion_tokens: true
import logging
from litellm import set_verbose
# Enable verbose logging
set_verbose(True)
# Custom logging handler
def log_request(request, response, latency_ms):
"""Log all LLM requests and responses"""
import json
from datetime import datetime
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"model": request.get("model"),
"prompt_tokens": response.get("usage", {}).get("prompt_tokens"),
"completion_tokens": response.get("usage", {}).get("completion_tokens"),
"latency_ms": latency_ms,
"cost": calculate_cost(request, response),
}
# Send to logging backend
print(json.dumps(log_entry))
# Configure
from litellm import Router
router = Router(config_file="config.yaml")
# Attach logging
router.set_callback(log_request)
Store logs in PostgreSQL, Datadog, or CloudWatch for analytics.
Rate Limiting Per API Key
Prevent abuse and control costs:
# config.yaml with rate limiting
key_settings:
- api_key: "user-key-123"
rate_limit:
requests_per_minute: 60
tokens_per_minute: 100000
budget:
monthly_limit: 100 # $100/month
- api_key: "user-key-456"
rate_limit:
requests_per_minute: 120
tokens_per_minute: 500000
budget:
monthly_limit: 1000
Python implementation:
from litellm import Router
from datetime import datetime, timedelta
import redis
class RateLimiter:
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
def check_limit(
self,
api_key: str,
rpm_limit: int = 60,
tpm_limit: int = 100000,
) -> bool:
"""Check if request is within limits"""
now = datetime.utcnow()
minute_key = f"ratelimit:{api_key}:{now.strftime('%Y%m%d%H%M')}"
# Increment request count
requests = self.redis.incr(minute_key)
if requests == 1:
self.redis.expire(minute_key, 60)
return requests <= rpm_limit
def track_tokens(self, api_key: str, token_count: int):
"""Track token usage for billing"""
now = datetime.utcnow()
month_key = f"tokens:{api_key}:{now.strftime('%Y%m')}"
self.redis.incrby(month_key, token_count)
limiter = RateLimiter()
# Use with router
router = Router(config_file="config.yaml")
def completion_with_limits(api_key: str, messages: list):
# Check rate limit
if not limiter.check_limit(api_key):
raise Exception("Rate limit exceeded")
# Make request
response = router.completion(
model="gpt-4",
messages=messages,
)
# Track tokens
limiter.track_tokens(
api_key,
response.usage.total_tokens,
)
return response
Rate limiting protects against abuse and runaway costs.
Budget Enforcement
Prevent bills from spiraling:
from litellm import Router
import redis
class BudgetManager:
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
def set_monthly_budget(self, api_key: str, budget_usd: float):
"""Set max monthly spend in USD"""
key = f"budget:{api_key}:2026-03" # Current month
self.redis.set(key, budget_usd)
def add_cost(self, api_key: str, cost_usd: float) -> bool:
"""Add cost, return True if within budget"""
now = datetime.utcnow()
budget_key = f"budget:{api_key}:{now.strftime('%Y-%m')}"
spent_key = f"spent:{api_key}:{now.strftime('%Y-%m')}"
budget = float(self.redis.get(budget_key) or 0)
spent = float(self.redis.get(spent_key) or 0)
if spent + cost_usd > budget:
return False # Over budget
self.redis.incrbyfloat(spent_key, cost_usd)
return True
def get_usage(self, api_key: str) -> dict:
"""Get current usage for API key"""
now = datetime.utcnow()
budget_key = f"budget:{api_key}:{now.strftime('%Y-%m')}"
spent_key = f"spent:{api_key}:{now.strftime('%Y-%m')}"
budget = float(self.redis.get(budget_key) or 0)
spent = float(self.redis.get(spent_key) or 0)
return {
"monthly_budget": budget,
"spent": spent,
"remaining": budget - spent,
"percentage_used": (spent / budget * 100) if budget else 0,
}
budget_manager = BudgetManager()
# Set budgets
budget_manager.set_monthly_budget("user-key-123", 100)
budget_manager.set_monthly_budget("user-key-456", 1000)
# Check before making request
if not budget_manager.add_cost("user-key-123", 0.50):
raise Exception("Monthly budget exceeded")
Enforce budgets to prevent surprise bills.
Load Balancing Across Providers
Distribute traffic for resilience:
# config.yaml with multiple instances
model_list:
- model_name: gpt-4-us
litellm_params:
model: gpt-4
api_key: "${OPENAI_API_KEY}"
api_base: "https://api.openai.com/v1"
- model_name: gpt-4-eu
litellm_params:
model: gpt-4
api_key: "${OPENAI_EU_API_KEY}"
api_base: "https://api.openai-eu.com/v1"
router:
type: weighted_round_robin
weights:
gpt-4-us: 0.6
gpt-4-eu: 0.4
The router distributes requests across instances:
# Request 1 → gpt-4-us
# Request 2 → gpt-4-eu
# Request 3 → gpt-4-us
# Request 4 → gpt-4-eu
# Request 5 → gpt-4-us
# (60% US, 40% EU)
Load balancing improves availability and geographic resilience.
Virtual API Keys for Teams
Issue per-team API keys with independent rate limits:
class TeamAPIKeyManager:
def __init__(self):
self.teams = {}
def create_team_key(
self,
team_name: str,
monthly_budget: float,
rpm_limit: int = 60,
) -> str:
"""Create a team API key"""
import uuid
api_key = f"team-{uuid.uuid4().hex[:16]}"
self.teams[api_key] = {
"team_name": team_name,
"budget": monthly_budget,
"rpm_limit": rpm_limit,
"created_at": datetime.utcnow(),
}
return api_key
def get_team_config(self, api_key: str) -> dict:
"""Get configuration for API key"""
return self.teams.get(api_key)
key_manager = TeamAPIKeyManager()
# Create keys for teams
data_team_key = key_manager.create_team_key(
"data-science",
monthly_budget=500,
rpm_limit=100,
)
ml_team_key = key_manager.create_team_key(
"ml-platform",
monthly_budget=2000,
rpm_limit=500,
)
# Each team gets isolated rate limiting and budget
Virtual API keys enable multi-team governance without managing credentials.
Custom Provider Integration
Add unsupported providers:
from litellm import Router, completion
import requests
class CustomLLMProvider:
"""Integrate custom LLM provider"""
def __init__(self, api_key: str, base_url: str):
self.api_key = api_key
self.base_url = base_url
def completion(
self,
model: str,
messages: list,
temperature: float = 0.7,
max_tokens: int = 1000,
) -> dict:
"""Make completion request to custom provider"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
payload = {
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
}
response = requests.post(
f"{self.base_url}/completions",
json=payload,
headers=headers,
)
return response.json()
# Register custom provider
custom_provider = CustomLLMProvider(
api_key="custom-key",
base_url="https://custom-llm.example.com",
)
result = custom_provider.completion(
model="custom-model",
messages=[
{"role": "user", "content": "What is AI?"},
],
)
Custom providers extend LiteLLM to any API.
Checklist
- Deploy LiteLLM proxy with multiple providers configured
- Store all credentials in environment variables or secrets manager
- Implement routing strategy (cost, latency, or capability-based)
- Configure fallback providers for high availability
- Enable request/response logging to database
- Set rate limits per API key
- Enforce monthly budgets with enforcement
- Monitor provider status and latency
- Document team API key allocation
- Test failover scenarios
Conclusion
LiteLLM abstracts provider complexity, enabling flexible, cost-optimized LLM routing. Consolidate credentials, implement intelligent routing, and enforce budgets. Add fallback for resilience. Monitor costs and usage. Scale from startup to enterprise with a single proxy layer.