Groq — Fastest LLM Inference API Guide
Advertisement
Introduction
Groq provides the fastest LLM inference available, with response times measured in milliseconds. This guide covers setup and optimization for ultra-low-latency applications.
- Why Groq is Fast
- Setup
- Basic Completion
- Chat Completion
- Streaming Responses
- Available Models
- Multi-turn Conversation
- Integration with LangChain
- Real-time Application Example
- Comparison: Groq vs Alternatives
- Caching and Rate Limits
- Building a Real-time Chat App
- Conclusion
- FAQ
Why Groq is Fast
Groq uses specialized LPU (Language Processing Unit) chips instead of GPUs, achieving:
- 500+ tokens/second throughput
- Sub-second first token latency
- 10x faster than typical GPU inference
Setup
pip install groq
from groq import Groq
client = Groq(api_key="your-api-key")
# List available models
models = client.models.list()
for model in models.data:
print(model.id)
Basic Completion
# Text generation
completion = client.completions.create(
model="mixtral-8x7b-32768",
prompt="What is machine learning?",
max_tokens=256,
temperature=0.7
)
print(completion.choices[0].text)
Chat Completion
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain quantum computing in one paragraph."
}
]
response = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response.choices[0].message.content)
Streaming Responses
# Stream for real-time output
stream = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[
{"role": "user", "content": "Write a short story"}
],
max_tokens=512,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Available Models
mixtral-8x7b-32768: Best for complex reasoning
llama2-70b-4096: High quality
gemma-7b-it: Efficient
Multi-turn Conversation
class GroqChatBot:
def __init__(self):
self.client = Groq()
self.messages = []
def chat(self, user_message: str) -> str:
self.messages.append({
"role": "user",
"content": user_message
})
response = self.client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=self.messages,
max_tokens=256
)
assistant_response = response.choices[0].message.content
self.messages.append({
"role": "assistant",
"content": assistant_response
})
return assistant_response
# Usage
bot = GroqChatBot()
print(bot.chat("What is AI?"))
print(bot.chat("Give examples"))
Integration with LangChain
from langchain_groq import ChatGroq
llm = ChatGroq(
temperature=0.7,
model_name="mixtral-8x7b-32768"
)
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"Explain {topic} in simple terms"
)
chain = prompt | llm
result = chain.invoke({"topic": "neural networks"})
print(result.content)
Real-time Application Example
import time
from groq import Groq
def low_latency_qa(question: str) -> dict:
"""Get answer with millisecond latency."""
client = Groq()
start = time.time()
response = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[{"role": "user", "content": question}],
max_tokens=256
)
latency = (time.time() - start) * 1000
return {
"answer": response.choices[0].message.content,
"latency_ms": latency
}
# Example
result = low_latency_qa("What is the capital of France?")
print(f"Answer: {result['answer']}")
print(f"Latency: {result['latency_ms']:.0f}ms")
Comparison: Groq vs Alternatives
Provider | Latency | Speed | Cost
Groq | <500ms | 500+ tokens/s | Moderate
OpenAI | 1-2s | 50-100 tokens/s| High
Together | 1-2s | 100-200 tokens/s| Low
Local (GPU) | 2-5s | 50-200 tokens/s| Hardware cost
Caching and Rate Limits
# Groq has generous rate limits
# Free: 10,000 requests/week
# Paid: Higher limits
# Implement caching for cost
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_response(question: str) -> str:
"""Cache identical queries."""
response = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[{"role": "user", "content": question}],
max_tokens=256
)
return response.choices[0].message.content
# First call hits API
result1 = cached_response("What is AI?")
# Second call uses cache
result2 = cached_response("What is AI?")
Building a Real-time Chat App
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from groq import Groq
app = FastAPI()
client = Groq()
@app.get("/stream/{question}")
async def stream_answer(question: str):
"""Stream answer in real-time."""
def generate():
stream = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[{"role": "user", "content": question}],
max_tokens=512,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
return StreamingResponse(generate(), media_type="text/event-stream")
# Run: uvicorn app:app --reload
# Access: http://localhost:8000/stream/What%20is%20AI
Conclusion
Groq is ideal for applications requiring real-time inference. With millisecond latencies and high throughput, it enables new possibilities for LLM applications.
FAQ
Q: When should I use Groq vs other APIs? A: Use Groq for applications prioritizing latency: chatbots, real-time assistants, interactive systems.
Q: Is Groq more expensive? A: Pricing is competitive with OpenAI for inference speed advantage.
Q: Can I self-host Groq? A: No, Groq is API-only. You need their specialized hardware.
Advertisement