AI Gateway With LiteLLM — Unified Interface for 100+ LLM Providers

Introduction

Managing multiple LLM providers at scale becomes complex quickly. LiteLLM abstracts provider differences, enabling seamless fallback, load balancing, and cost control. This guide covers production deployment patterns for unified LLM routing.

LiteLLM Proxy Setup
Provider Credential Management
Routing Rules: Cost, Latency, Model Capability
Fallback Configuration
Request/Response Logging
Rate Limiting Per API Key
Budget Enforcement
Load Balancing Across Providers
Virtual API Keys for Teams
Custom Provider Integration
Checklist
Conclusion

LiteLLM Proxy Setup

Deploy LiteLLM as a reverse proxy for all LLM requests:

# Install
pip install litellm

# Start proxy server
litellm --model gpt-4 --port 8000

# Or use config file for multiple models

# config.yaml
model_list:
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_API_KEY}"

  - model_name: claude-3-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: "${ANTHROPIC_API_KEY}"

  - model_name: cohere-command
    litellm_params:
      model: command
      api_key: "${COHERE_API_KEY}"

  - model_name: local-llama
    litellm_params:
      model: vllm/meta-llama/Llama-2-70b-hf
      api_base: "http://vllm-server:8000/v1"

general_settings:
  master_key: "sk-1234"

# Start with config
litellm --config config.yaml --port 8000

Now all requests go through LiteLLM:

from openai import OpenAI

# Point to LiteLLM proxy instead of OpenAI
client = OpenAI(
    api_key="sk-1234",  # master_key from config
    base_url="http://localhost:8000/v1",
)

# Works exactly like OpenAI API
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are helpful"},
        {"role": "user", "content": "What is AI?"},
    ],
)

print(response.choices[0].message.content)

LiteLLM translates the request to the appropriate provider. No code changes needed.

Provider Credential Management

Store credentials securely:

# config.yaml with environment variables
model_list:
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_API_KEY}"  # Load from environment

  - model_name: claude-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: "${ANTHROPIC_API_KEY}"

# Export before starting
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..."

litellm --config config.yaml --port 8000

For Kubernetes:

apiVersion: v1
kind: Secret
metadata:
  name: llm-credentials
type: Opaque
stringData:
  OPENAI_API_KEY: sk-...
  ANTHROPIC_API_KEY: sk-ant-...
  COHERE_API_KEY: ...
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
spec:
  containers:
  - name: litellm
    image: litellm/litellm:latest
    envFrom:
    - secretRef:
        name: llm-credentials
    args:
      - --config
      - /app/config.yaml
      - --port
      - "8000"

Never hardcode API keys. Use secrets management systems.

Routing Rules: Cost, Latency, Model Capability

Implement intelligent routing:

# config.yaml with routing
model_list:
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_API_KEY}"
    metadata:
      cost: 0.03  # $ per 1K tokens
      latency: 500  # ms
      max_tokens: 8000
      capabilities: ["long-context", "reasoning"]

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: "${OPENAI_API_KEY}"
    metadata:
      cost: 0.0005
      latency: 300
      max_tokens: 4000
      capabilities: ["speed"]

  - model_name: claude-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: "${ANTHROPIC_API_KEY}"
    metadata:
      cost: 0.015
      latency: 800
      max_tokens: 100000
      capabilities: ["long-context", "analysis"]

router:
  type: latency  # or cost-based, capability-based

# Cost-based routing example
routing_strategy: cost_optimized

# Route &lt;100 token requests to fast/cheap model
cost_threshold: 0.01

# Route &gt;50K tokens to high-context model
context_threshold: 50000

Python routing logic:

from litellm import Router

router = Router(config_file="config.yaml")

def route_request(prompt: str, context_size: int, priority: str):
    """Route based on request characteristics"""

    if context_size > 50000:
        # Use long-context model
        model = "claude-opus"
    elif priority == "cost":
        # Use cheapest model
        model = "gpt-3.5-turbo"
    elif priority == "quality":
        # Use best model
        model = "gpt-4"
    else:
        # Default: balance cost and quality
        model = "gpt-4"  # or router.auto_select()

    response = router.completion(
        model=model,
        messages=[
            {"role": "system", "content": "You are helpful"},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
    )

    return response

# Usage
response = route_request(
    "Analyze this large document",
    context_size=60000,
    priority="cost",
)

Route intelligently to balance cost, latency, and quality.

Fallback Configuration

Enable automatic failover:

# config.yaml with fallback
model_list:
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_API_KEY}"

  - model_name: claude-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: "${ANTHROPIC_API_KEY}"

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: "${OPENAI_API_KEY}"

fallback_config:
  - model_name: gpt-4
    fallback_to: ["claude-opus", "gpt-3.5-turbo"]
    retry_attempts: 3
    retry_delay: 2  # seconds

Python fallback:

from litellm import Router

router = Router(config_file="config.yaml")

try:
    response = router.completion(
        model="gpt-4",
        messages=[...],
        fallback=True,  # Enable automatic fallback
    )
except Exception as e:
    print(f"All fallbacks exhausted: {e}")

Fallback ensures reliability even when primary provider fails.

Request/Response Logging

Monitor all traffic for debugging and compliance:

# config.yaml with logging
database:
  type: postgres
  connection_string: "postgresql://user:pass@localhost:5432/litellm"

litellm_settings:
  log_level: "DEBUG"
  log_to_file: true
  log_file_path: "/var/log/litellm.log"

# Log completion token usage
log_completion_tokens: true

import logging
from litellm import set_verbose

# Enable verbose logging
set_verbose(True)

# Custom logging handler
def log_request(request, response, latency_ms):
    """Log all LLM requests and responses"""
    import json
    from datetime import datetime

    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model": request.get("model"),
        "prompt_tokens": response.get("usage", {}).get("prompt_tokens"),
        "completion_tokens": response.get("usage", {}).get("completion_tokens"),
        "latency_ms": latency_ms,
        "cost": calculate_cost(request, response),
    }

    # Send to logging backend
    print(json.dumps(log_entry))

# Configure
from litellm import Router
router = Router(config_file="config.yaml")

# Attach logging
router.set_callback(log_request)

Store logs in PostgreSQL, Datadog, or CloudWatch for analytics.

Rate Limiting Per API Key

Prevent abuse and control costs:

# config.yaml with rate limiting
key_settings:
  - api_key: "user-key-123"
    rate_limit:
      requests_per_minute: 60
      tokens_per_minute: 100000
    budget:
      monthly_limit: 100  # $100/month

  - api_key: "user-key-456"
    rate_limit:
      requests_per_minute: 120
      tokens_per_minute: 500000
    budget:
      monthly_limit: 1000

Python implementation:

from litellm import Router
from datetime import datetime, timedelta
import redis

class RateLimiter:
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)

    def check_limit(
        self,
        api_key: str,
        rpm_limit: int = 60,
        tpm_limit: int = 100000,
    ) -> bool:
        """Check if request is within limits"""
        now = datetime.utcnow()
        minute_key = f"ratelimit:{api_key}:{now.strftime('%Y%m%d%H%M')}"

        # Increment request count
        requests = self.redis.incr(minute_key)
        if requests == 1:
            self.redis.expire(minute_key, 60)

        return requests <= rpm_limit

    def track_tokens(self, api_key: str, token_count: int):
        """Track token usage for billing"""
        now = datetime.utcnow()
        month_key = f"tokens:{api_key}:{now.strftime('%Y%m')}"
        self.redis.incrby(month_key, token_count)

limiter = RateLimiter()

# Use with router
router = Router(config_file="config.yaml")

def completion_with_limits(api_key: str, messages: list):
    # Check rate limit
    if not limiter.check_limit(api_key):
        raise Exception("Rate limit exceeded")

    # Make request
    response = router.completion(
        model="gpt-4",
        messages=messages,
    )

    # Track tokens
    limiter.track_tokens(
        api_key,
        response.usage.total_tokens,
    )

    return response

Rate limiting protects against abuse and runaway costs.

Budget Enforcement

Prevent bills from spiraling:

from litellm import Router
import redis

class BudgetManager:
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)

    def set_monthly_budget(self, api_key: str, budget_usd: float):
        """Set max monthly spend in USD"""
        key = f"budget:{api_key}:2026-03"  # Current month
        self.redis.set(key, budget_usd)

    def add_cost(self, api_key: str, cost_usd: float) -> bool:
        """Add cost, return True if within budget"""
        now = datetime.utcnow()
        budget_key = f"budget:{api_key}:{now.strftime('%Y-%m')}"
        spent_key = f"spent:{api_key}:{now.strftime('%Y-%m')}"

        budget = float(self.redis.get(budget_key) or 0)
        spent = float(self.redis.get(spent_key) or 0)

        if spent + cost_usd > budget:
            return False  # Over budget

        self.redis.incrbyfloat(spent_key, cost_usd)
        return True

    def get_usage(self, api_key: str) -> dict:
        """Get current usage for API key"""
        now = datetime.utcnow()
        budget_key = f"budget:{api_key}:{now.strftime('%Y-%m')}"
        spent_key = f"spent:{api_key}:{now.strftime('%Y-%m')}"

        budget = float(self.redis.get(budget_key) or 0)
        spent = float(self.redis.get(spent_key) or 0)

        return {
            "monthly_budget": budget,
            "spent": spent,
            "remaining": budget - spent,
            "percentage_used": (spent / budget * 100) if budget else 0,
        }

budget_manager = BudgetManager()

# Set budgets
budget_manager.set_monthly_budget("user-key-123", 100)
budget_manager.set_monthly_budget("user-key-456", 1000)

# Check before making request
if not budget_manager.add_cost("user-key-123", 0.50):
    raise Exception("Monthly budget exceeded")

Enforce budgets to prevent surprise bills.

Load Balancing Across Providers

Distribute traffic for resilience:

# config.yaml with multiple instances
model_list:
  - model_name: gpt-4-us
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_API_KEY}"
      api_base: "https://api.openai.com/v1"

  - model_name: gpt-4-eu
    litellm_params:
      model: gpt-4
      api_key: "${OPENAI_EU_API_KEY}"
      api_base: "https://api.openai-eu.com/v1"

router:
  type: weighted_round_robin
  weights:
    gpt-4-us: 0.6
    gpt-4-eu: 0.4

The router distributes requests across instances:

# Request 1 → gpt-4-us
# Request 2 → gpt-4-eu
# Request 3 → gpt-4-us
# Request 4 → gpt-4-eu
# Request 5 → gpt-4-us
# (60% US, 40% EU)

Load balancing improves availability and geographic resilience.

Virtual API Keys for Teams

Issue per-team API keys with independent rate limits:

class TeamAPIKeyManager:
    def __init__(self):
        self.teams = {}

    def create_team_key(
        self,
        team_name: str,
        monthly_budget: float,
        rpm_limit: int = 60,
    ) -> str:
        """Create a team API key"""
        import uuid
        api_key = f"team-{uuid.uuid4().hex[:16]}"

        self.teams[api_key] = {
            "team_name": team_name,
            "budget": monthly_budget,
            "rpm_limit": rpm_limit,
            "created_at": datetime.utcnow(),
        }

        return api_key

    def get_team_config(self, api_key: str) -> dict:
        """Get configuration for API key"""
        return self.teams.get(api_key)

key_manager = TeamAPIKeyManager()

# Create keys for teams
data_team_key = key_manager.create_team_key(
    "data-science",
    monthly_budget=500,
    rpm_limit=100,
)
ml_team_key = key_manager.create_team_key(
    "ml-platform",
    monthly_budget=2000,
    rpm_limit=500,
)

# Each team gets isolated rate limiting and budget

Virtual API keys enable multi-team governance without managing credentials.

Custom Provider Integration

Add unsupported providers:

from litellm import Router, completion
import requests

class CustomLLMProvider:
    """Integrate custom LLM provider"""
    def __init__(self, api_key: str, base_url: str):
        self.api_key = api_key
        self.base_url = base_url

    def completion(
        self,
        model: str,
        messages: list,
        temperature: float = 0.7,
        max_tokens: int = 1000,
    ) -> dict:
        """Make completion request to custom provider"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

        payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
        }

        response = requests.post(
            f"{self.base_url}/completions",
            json=payload,
            headers=headers,
        )

        return response.json()

# Register custom provider
custom_provider = CustomLLMProvider(
    api_key="custom-key",
    base_url="https://custom-llm.example.com",
)

result = custom_provider.completion(
    model="custom-model",
    messages=[
        {"role": "user", "content": "What is AI?"},
    ],
)

Custom providers extend LiteLLM to any API.

Checklist

Conclusion

LiteLLM abstracts provider complexity, enabling flexible, cost-optimized LLM routing. Consolidate credentials, implement intelligent routing, and enforce budgets. Add fallback for resilience. Monitor costs and usage. Scale from startup to enterprise with a single proxy layer.