Hugging Face Inference API — Free LLM Hosting
Advertisement
Introduction
Hugging Face Inference API provides free hosting for models without managing infrastructure. This guide covers using the API, rate limits, and production considerations.
- Getting Started
- Setup
- Get Your API Token
- Basic Text Generation
- Available Models
- Text Classification
- Summarization
- Question Answering
- Named Entity Recognition
- Conversational AI
- Building a Chat Application
- Streaming Responses
- Rate Limits and Quotas
- Error Handling
- Building a Simple Web App
- Cost Optimization
- Conclusion
- FAQ
Getting Started
Setup
pip install huggingface-hub requests
from huggingface_hub import InferenceClient
# Create client
client = InferenceClient(api_key="your-token")
# Or use environment variable
# export HF_TOKEN="your-token"
Get Your API Token
- Go to huggingface.co and create account
- Navigate to Settings → API tokens
- Create new token with read permissions
- Copy token to environment
Basic Text Generation
from huggingface_hub import InferenceClient
client = InferenceClient()
# Generate text
response = client.text_generation(
prompt="Machine learning is",
model="mistralai/Mistral-7B-Instruct-v0.1",
max_new_tokens=100
)
print(response)
Available Models
# Popular free models
models = {
"Mistral-7B-Instruct": "mistralai/Mistral-7B-Instruct-v0.1",
"Llama-2-7B-Chat": "meta-llama/Llama-2-7b-chat-hf",
"Zephyr-7B": "HuggingFaceH4/zephyr-7b-beta",
"Falcon-7B": "tiiuae/falcon-7b-instruct",
}
# Use any of these with the API
for name, model_id in models.items():
print(f"{name}: {model_id}")
Text Classification
# Classify sentiment
response = client.text_classification(
text="I absolutely love this product!",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
print(response)
# [{'label': 'POSITIVE', 'score': 0.9998}]
Summarization
# Summarize text
text = """
Machine learning is a subset of artificial intelligence (AI).
It involves training algorithms on data to make predictions.
Deep learning uses neural networks with multiple layers.
"""
response = client.summarization(
text=text,
model="facebook/bart-large-cnn"
)
print(response["summary_text"])
Question Answering
# QA with context
response = client.question_answering(
question="What is machine learning?",
context="Machine learning is a subset of AI that learns from data.",
model="deepset/roberta-base-squad2"
)
print(response)
# {'answer': 'subset of AI', 'score': 0.95, 'start': 30, 'end': 41}
Named Entity Recognition
# Extract entities
response = client.token_classification(
text="My name is John and I work at Google.",
model="dslim/bert-base-NER"
)
print(response)
# [
# {'entity': 'B-PER', 'score': 0.99, 'word': 'John', ...},
# {'entity': 'B-ORG', 'score': 0.98, 'word': 'Google', ...}
# ]
Conversational AI
# Chat interface
messages = [
{"role": "user", "content": "What is the capital of France?"},
]
response = client.text_generation(
prompt="You are a helpful assistant.\n\n" +
"User: " + messages[0]["content"] + "\n" +
"Assistant:",
model="HuggingFaceH4/zephyr-7b-beta",
max_new_tokens=100
)
print(response)
Building a Chat Application
class ChatBot:
def __init__(self, model_id="HuggingFaceH4/zephyr-7b-beta"):
self.client = InferenceClient()
self.model_id = model_id
self.conversation_history = []
def chat(self, user_message):
# Add user message to history
self.conversation_history.append({
"role": "user",
"content": user_message
})
# Build prompt from history
prompt = "You are a helpful AI assistant.\n\n"
for msg in self.conversation_history:
prompt += f"{msg['role'].capitalize()}: {msg['content']}\nAssistant: "
# Generate response
response = self.client.text_generation(
prompt=prompt,
model=self.model_id,
max_new_tokens=100,
stop=["\nUser:"]
)
# Add to history
self.conversation_history.append({
"role": "assistant",
"content": response
})
return response
# Usage
bot = ChatBot()
print(bot.chat("What is Python?"))
print(bot.chat("Tell me more about it"))
Streaming Responses
# Stream long responses
client = InferenceClient()
for token in client.text_generation(
prompt="Write a short story about AI:",
model="mistralai/Mistral-7B-Instruct-v0.1",
max_new_tokens=500,
stream=True
):
print(token, end="", flush=True)
Rate Limits and Quotas
# Free tier has rate limits:
# - 30,000 requests per month
# - ~1 request per second
# - No concurrent requests
# For production, upgrade to Pro ($9/month):
# - Unlimited requests
# - Faster inference
# - Priority queue
# Check your usage
from huggingface_hub import get_inference_api_status
status = get_inference_api_status()
print(status)
Error Handling
from huggingface_hub import InferenceTimeoutError, HfHubHTTPError
try:
response = client.text_generation(
prompt="Hello",
model="mistralai/Mistral-7B-Instruct-v0.1",
timeout=10
)
except InferenceTimeoutError:
print("Model loading, please retry in a few seconds")
except HfHubHTTPError as e:
print(f"API error: {e}")
Building a Simple Web App
from flask import Flask, request, jsonify
from huggingface_hub import InferenceClient
app = Flask(__name__)
client = InferenceClient()
@app.route("/generate", methods=["POST"])
def generate():
data = request.json
prompt = data.get("prompt")
max_tokens = data.get("max_tokens", 100)
try:
response = client.text_generation(
prompt=prompt,
model="mistralai/Mistral-7B-Instruct-v0.1",
max_new_tokens=max_tokens
)
return jsonify({"response": response})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == "__main__":
app.run(debug=True)
Cost Optimization
# Free tier strategy:
# 1. Cache responses for repeated queries
# 2. Use smaller models when possible
# 3. Batch requests efficiently
# 4. Consider local models for volume
# Cache implementation
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_generation(prompt):
response = client.text_generation(
prompt=prompt,
model="mistralai/Mistral-7B-Instruct-v0.1",
max_new_tokens=100
)
return response
Conclusion
Hugging Face Inference API enables free access to powerful models. Perfect for prototyping, hobby projects, and learning. For production use at scale, consider self-hosting or upgrading to Pro.
FAQ
Q: Can I use the free API in production? A: Not recommended due to rate limits (30k requests/month). Upgrade to Pro ($9/month) for unlimited access.
Q: How do I improve response quality? A: Use better prompts, try different models, adjust temperature and max_tokens parameters.
Q: Can I use the API with LangChain? A: Yes, LangChain has Hugging Face integration for seamless API usage.
Advertisement