Gemma — Google Open Source LLM Guide
Advertisement
Introduction
Google's Gemma models represent a new generation of efficient, responsible open-source LLMs. This guide covers setup, usage, and deployment of Gemma across various platforms.
- Gemma Model Variants
- Quick Start
- Chat Implementation
- Quantization
- Safety Features
- Integration with LangChain
- Fine-tuning Gemma
- Deployment
- Benchmarks
- Model Comparison
- Conclusion
- FAQ
Gemma Model Variants
Gemma-2B: Ultra-efficient
- Parameters: 2B
- Context: 8K tokens
- Use: Mobile, embedded systems
Gemma-7B: Balanced
- Parameters: 7B
- Context: 8K tokens
- Use: Production systems
Gemma 2: Improved versions
- 9B and 27B variants
- Better instruction following
- Improved quality
Quick Start
# Request access at kaggle.com/settings/account
# Download weights
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "google/gemma-7b-it" # Instruction-tuned version
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Generate
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Chat Implementation
def chat_with_gemma(messages: list) -> str:
"""Chat with Gemma instruction-tuned model."""
# Format: <start_of_turn>user\nMessage<end_of_turn>\n<start_of_turn>model\n
formatted = ""
for msg in messages:
role = "user" if msg["role"] == "user" else "model"
formatted += f"<start_of_turn>{role}\n{msg['content']}<end_of_turn>\n"
formatted += "<start_of_turn>model\n"
inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response.split("<start_of_turn>model\n")[-1]
# Usage
messages = [{"role": "user", "content": "What is AI?"}]
response = chat_with_gemma(messages)
print(response)
Quantization
from transformers import BitsAndBytesConfig
# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-7b-it",
quantization_config=bnb_config,
device_map="auto"
)
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-7b-it",
quantization_config=bnb_config,
device_map="auto"
)
Safety Features
Gemma emphasizes responsible AI:
# Gemma has built-in safety features
# Some prompts may be declined
# Handle safety refusals
def safe_generate(prompt: str, max_retries: int = 3):
"""Generate with safety handling."""
for attempt in range(max_retries):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Check if response is empty (safety decline)
if response.strip():
return response
# Modify prompt for retry
prompt = f"Please provide factual information about: {prompt}"
return "Unable to generate response"
Integration with LangChain
from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
# Use Ollama for convenience
llm = Ollama(model="gemma:7b")
# Or use Transformers directly
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline
hf_pipeline = pipeline(
"text-generation",
model="google/gemma-7b-it",
torch_dtype=torch.float16,
device_map="auto"
)
llm = HuggingFacePipeline(model_id="google/gemma-7b-it")
# Use in chain
prompt = ChatPromptTemplate.from_template(
"Explain {topic}"
)
chain = prompt | llm
result = chain.invoke({"topic": "neural networks"})
Fine-tuning Gemma
from peft import get_peft_model, LoraConfig
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-7b-it"
)
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Fine-tune with Trainer (see previous guides)
Deployment
# Docker deployment
cat > app.py << 'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 256
@app.post("/generate")
async def generate(request: GenerateRequest):
inputs = tokenizer(request.prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=request.max_tokens)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response}
EOF
# Run
pip install fastapi uvicorn
uvicorn app:app
Benchmarks
import time
models = [
"google/gemma-2b",
"google/gemma-7b-it",
]
for model_id in models:
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")
prompt = "Explain machine learning"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
elapsed = time.time() - start
print(f"{model_id}: {elapsed:.2f}s")
# Expected:
# Gemma-2B: 1-2s
# Gemma-7B: 2-3s
Model Comparison
Model | Size | Speed | Quality | Safety
Gemma-2B | 2B | Very Fast| Good | High
Gemma-7B | 7B | Fast | Excellent| High
Mistral-7B | 7B | Fast | Excellent| Medium
Llama2-7B | 7B | Fast | Very Good| Medium
Phi-3-mini | 3.8B | Very Fast| Good | High
Conclusion
Google's Gemma prioritizes responsible AI while maintaining competitive performance. Ideal for safety-conscious deployments and applications requiring ethical considerations.
FAQ
Q: When does Gemma decline requests? A: On potentially harmful prompts (violence, illegal content, etc.). This is by design for responsible AI.
Q: Is Gemma better than Mistral? A: Comparable quality. Gemma emphasizes safety; Mistral emphasizes speed. Choose based on your priorities.
Q: Can I use Gemma commercially? A: Yes, with proper licensing. Check Google's license for commercial use terms.
Advertisement