Hugging Face Inference API — Free LLM Hosting

Introduction

Hugging Face Inference API provides free hosting for models without managing infrastructure. This guide covers using the API, rate limits, and production considerations.

Getting Started
Setup
Get Your API Token
Basic Text Generation
Available Models
Text Classification
Summarization
Question Answering
Named Entity Recognition
Conversational AI
Building a Chat Application
Streaming Responses
Rate Limits and Quotas
Error Handling
Building a Simple Web App
Cost Optimization
Conclusion
FAQ

Getting Started

Setup

pip install huggingface-hub requests

from huggingface_hub import InferenceClient

# Create client
client = InferenceClient(api_key="your-token")

# Or use environment variable
# export HF_TOKEN="your-token"

Get Your API Token

Go to huggingface.co and create account
Navigate to Settings → API tokens
Create new token with read permissions
Copy token to environment

Basic Text Generation

from huggingface_hub import InferenceClient

client = InferenceClient()

# Generate text
response = client.text_generation(
    prompt="Machine learning is",
    model="mistralai/Mistral-7B-Instruct-v0.1",
    max_new_tokens=100
)

print(response)

Available Models

# Popular free models
models = {
    "Mistral-7B-Instruct": "mistralai/Mistral-7B-Instruct-v0.1",
    "Llama-2-7B-Chat": "meta-llama/Llama-2-7b-chat-hf",
    "Zephyr-7B": "HuggingFaceH4/zephyr-7b-beta",
    "Falcon-7B": "tiiuae/falcon-7b-instruct",
}

# Use any of these with the API
for name, model_id in models.items():
    print(f"{name}: {model_id}")

Text Classification

# Classify sentiment
response = client.text_classification(
    text="I absolutely love this product!",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

print(response)
# [{'label': 'POSITIVE', 'score': 0.9998}]

Summarization

# Summarize text
text = """
Machine learning is a subset of artificial intelligence (AI).
It involves training algorithms on data to make predictions.
Deep learning uses neural networks with multiple layers.
"""

response = client.summarization(
    text=text,
    model="facebook/bart-large-cnn"
)

print(response["summary_text"])

Question Answering

# QA with context
response = client.question_answering(
    question="What is machine learning?",
    context="Machine learning is a subset of AI that learns from data.",
    model="deepset/roberta-base-squad2"
)

print(response)
# {'answer': 'subset of AI', 'score': 0.95, 'start': 30, 'end': 41}

Named Entity Recognition

# Extract entities
response = client.token_classification(
    text="My name is John and I work at Google.",
    model="dslim/bert-base-NER"
)

print(response)
# [
#   {'entity': 'B-PER', 'score': 0.99, 'word': 'John', ...},
#   {'entity': 'B-ORG', 'score': 0.98, 'word': 'Google', ...}
# ]

Conversational AI

# Chat interface
messages = [
    {"role": "user", "content": "What is the capital of France?"},
]

response = client.text_generation(
    prompt="You are a helpful assistant.\n\n" +
           "User: " + messages[0]["content"] + "\n" +
           "Assistant:",
    model="HuggingFaceH4/zephyr-7b-beta",
    max_new_tokens=100
)

print(response)

Building a Chat Application

class ChatBot:
    def __init__(self, model_id="HuggingFaceH4/zephyr-7b-beta"):
        self.client = InferenceClient()
        self.model_id = model_id
        self.conversation_history = []

    def chat(self, user_message):
        # Add user message to history
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })

        # Build prompt from history
        prompt = "You are a helpful AI assistant.\n\n"
        for msg in self.conversation_history:
            prompt += f"{msg['role'].capitalize()}: {msg['content']}\nAssistant: "

        # Generate response
        response = self.client.text_generation(
            prompt=prompt,
            model=self.model_id,
            max_new_tokens=100,
            stop=["\nUser:"]
        )

        # Add to history
        self.conversation_history.append({
            "role": "assistant",
            "content": response
        })

        return response

# Usage
bot = ChatBot()
print(bot.chat("What is Python?"))
print(bot.chat("Tell me more about it"))

Streaming Responses

# Stream long responses
client = InferenceClient()

for token in client.text_generation(
    prompt="Write a short story about AI:",
    model="mistralai/Mistral-7B-Instruct-v0.1",
    max_new_tokens=500,
    stream=True
):
    print(token, end="", flush=True)

Rate Limits and Quotas

# Free tier has rate limits:
# - 30,000 requests per month
# - ~1 request per second
# - No concurrent requests

# For production, upgrade to Pro ($9/month):
# - Unlimited requests
# - Faster inference
# - Priority queue

# Check your usage
from huggingface_hub import get_inference_api_status

status = get_inference_api_status()
print(status)

Error Handling

from huggingface_hub import InferenceTimeoutError, HfHubHTTPError

try:
    response = client.text_generation(
        prompt="Hello",
        model="mistralai/Mistral-7B-Instruct-v0.1",
        timeout=10
    )
except InferenceTimeoutError:
    print("Model loading, please retry in a few seconds")
except HfHubHTTPError as e:
    print(f"API error: {e}")

Building a Simple Web App

from flask import Flask, request, jsonify
from huggingface_hub import InferenceClient

app = Flask(__name__)
client = InferenceClient()

@app.route("/generate", methods=["POST"])
def generate():
    data = request.json
    prompt = data.get("prompt")
    max_tokens = data.get("max_tokens", 100)

    try:
        response = client.text_generation(
            prompt=prompt,
            model="mistralai/Mistral-7B-Instruct-v0.1",
            max_new_tokens=max_tokens
        )
        return jsonify({"response": response})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(debug=True)

Cost Optimization

# Free tier strategy:
# 1. Cache responses for repeated queries
# 2. Use smaller models when possible
# 3. Batch requests efficiently
# 4. Consider local models for volume

# Cache implementation
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_generation(prompt):
    response = client.text_generation(
        prompt=prompt,
        model="mistralai/Mistral-7B-Instruct-v0.1",
        max_new_tokens=100
    )
    return response

Conclusion

Hugging Face Inference API enables free access to powerful models. Perfect for prototyping, hobby projects, and learning. For production use at scale, consider self-hosting or upgrading to Pro.

FAQ

Q: Can I use the free API in production? A: Not recommended due to rate limits (30k requests/month). Upgrade to Pro ($9/month) for unlimited access.

Q: How do I improve response quality? A: Use better prompts, try different models, adjust temperature and max_tokens parameters.

Q: Can I use the API with LangChain? A: Yes, LangChain has Hugging Face integration for seamless API usage.