Groq — Fastest LLM Inference API Guide

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Introduction

Groq provides the fastest LLM inference available, with response times measured in milliseconds. This guide covers setup and optimization for ultra-low-latency applications.

Why Groq is Fast

Groq uses specialized LPU (Language Processing Unit) chips instead of GPUs, achieving:

  • 500+ tokens/second throughput
  • Sub-second first token latency
  • 10x faster than typical GPU inference

Setup

pip install groq
from groq import Groq

client = Groq(api_key="your-api-key")

# List available models
models = client.models.list()
for model in models.data:
    print(model.id)

Basic Completion

# Text generation
completion = client.completions.create(
    model="mixtral-8x7b-32768",
    prompt="What is machine learning?",
    max_tokens=256,
    temperature=0.7
)

print(completion.choices[0].text)

Chat Completion

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": "Explain quantum computing in one paragraph."
    }
]

response = client.chat.completions.create(
    model="mixtral-8x7b-32768",
    messages=messages,
    max_tokens=256,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

# Stream for real-time output
stream = client.chat.completions.create(
    model="mixtral-8x7b-32768",
    messages=[
        {"role": "user", "content": "Write a short story"}
    ],
    max_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Available Models

mixtral-8x7b-32768: Best for complex reasoning
llama2-70b-4096: High quality
gemma-7b-it: Efficient

Multi-turn Conversation

class GroqChatBot:
    def __init__(self):
        self.client = Groq()
        self.messages = []

    def chat(self, user_message: str) -> str:
        self.messages.append({
            "role": "user",
            "content": user_message
        })

        response = self.client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=self.messages,
            max_tokens=256
        )

        assistant_response = response.choices[0].message.content

        self.messages.append({
            "role": "assistant",
            "content": assistant_response
        })

        return assistant_response

# Usage
bot = GroqChatBot()
print(bot.chat("What is AI?"))
print(bot.chat("Give examples"))

Integration with LangChain

from langchain_groq import ChatGroq

llm = ChatGroq(
    temperature=0.7,
    model_name="mixtral-8x7b-32768"
)

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Explain {topic} in simple terms"
)

chain = prompt | llm

result = chain.invoke({"topic": "neural networks"})
print(result.content)

Real-time Application Example

import time
from groq import Groq

def low_latency_qa(question: str) -> dict:
    """Get answer with millisecond latency."""
    client = Groq()

    start = time.time()

    response = client.chat.completions.create(
        model="mixtral-8x7b-32768",
        messages=[{"role": "user", "content": question}],
        max_tokens=256
    )

    latency = (time.time() - start) * 1000

    return {
        "answer": response.choices[0].message.content,
        "latency_ms": latency
    }

# Example
result = low_latency_qa("What is the capital of France?")
print(f"Answer: {result['answer']}")
print(f"Latency: {result['latency_ms']:.0f}ms")

Comparison: Groq vs Alternatives

Provider      | Latency  | Speed      | Cost
Groq          | <500ms   | 500+ tokens/s | Moderate
OpenAI        | 1-2s     | 50-100 tokens/s| High
Together      | 1-2s     | 100-200 tokens/s| Low
Local (GPU)   | 2-5s     | 50-200 tokens/s| Hardware cost

Caching and Rate Limits

# Groq has generous rate limits
# Free: 10,000 requests/week
# Paid: Higher limits

# Implement caching for cost
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_response(question: str) -> str:
    """Cache identical queries."""
    response = client.chat.completions.create(
        model="mixtral-8x7b-32768",
        messages=[{"role": "user", "content": question}],
        max_tokens=256
    )
    return response.choices[0].message.content

# First call hits API
result1 = cached_response("What is AI?")

# Second call uses cache
result2 = cached_response("What is AI?")

Building a Real-time Chat App

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from groq import Groq

app = FastAPI()
client = Groq()

@app.get("/stream/{question}")
async def stream_answer(question: str):
    """Stream answer in real-time."""
    def generate():
        stream = client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=[{"role": "user", "content": question}],
            max_tokens=512,
            stream=True
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

    return StreamingResponse(generate(), media_type="text/event-stream")

# Run: uvicorn app:app --reload
# Access: http://localhost:8000/stream/What%20is%20AI

Conclusion

Groq is ideal for applications requiring real-time inference. With millisecond latencies and high throughput, it enables new possibilities for LLM applications.

FAQ

Q: When should I use Groq vs other APIs? A: Use Groq for applications prioritizing latency: chatbots, real-time assistants, interactive systems.

Q: Is Groq more expensive? A: Pricing is competitive with OpenAI for inference speed advantage.

Q: Can I self-host Groq? A: No, Groq is API-only. You need their specialized hardware.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro