Gemma — Google Open Source LLM Guide

Introduction

Google's Gemma models represent a new generation of efficient, responsible open-source LLMs. This guide covers setup, usage, and deployment of Gemma across various platforms.

Gemma Model Variants
Quick Start
Chat Implementation
Quantization
Safety Features
Integration with LangChain
Fine-tuning Gemma
Deployment
Benchmarks
Model Comparison
Conclusion
FAQ

Gemma Model Variants

Gemma-2B: Ultra-efficient
- Parameters: 2B
- Context: 8K tokens
- Use: Mobile, embedded systems

Gemma-7B: Balanced
- Parameters: 7B
- Context: 8K tokens
- Use: Production systems

Gemma 2: Improved versions
- 9B and 27B variants
- Better instruction following
- Improved quality

Quick Start

# Request access at kaggle.com/settings/account
# Download weights

pip install transformers torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-7b-it"  # Instruction-tuned version

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Implementation

def chat_with_gemma(messages: list) -> str:
    """Chat with Gemma instruction-tuned model."""
    # Format: <start_of_turn>user\nMessage<end_of_turn>\n<start_of_turn>model\n

    formatted = ""
    for msg in messages:
        role = "user" if msg["role"] == "user" else "model"
        formatted += f"<start_of_turn>{role}\n{msg['content']}<end_of_turn>\n"

    formatted += "<start_of_turn>model\n"

    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("<start_of_turn>model\n")[-1]

# Usage
messages = [{"role": "user", "content": "What is AI?"}]
response = chat_with_gemma(messages)
print(response)

Quantization

from transformers import BitsAndBytesConfig

# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b-it",
    quantization_config=bnb_config,
    device_map="auto"
)

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b-it",
    quantization_config=bnb_config,
    device_map="auto"
)

Safety Features

Gemma emphasizes responsible AI:

# Gemma has built-in safety features
# Some prompts may be declined

# Handle safety refusals
def safe_generate(prompt: str, max_retries: int = 3):
    """Generate with safety handling."""
    for attempt in range(max_retries):
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7
        )

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Check if response is empty (safety decline)
        if response.strip():
            return response

        # Modify prompt for retry
        prompt = f"Please provide factual information about: {prompt}"

    return "Unable to generate response"

Integration with LangChain

from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

# Use Ollama for convenience
llm = Ollama(model="gemma:7b")

# Or use Transformers directly
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline

hf_pipeline = pipeline(
    "text-generation",
    model="google/gemma-7b-it",
    torch_dtype=torch.float16,
    device_map="auto"
)

llm = HuggingFacePipeline(model_id="google/gemma-7b-it")

# Use in chain
prompt = ChatPromptTemplate.from_template(
    "Explain {topic}"
)

chain = prompt | llm

result = chain.invoke({"topic": "neural networks"})

Fine-tuning Gemma

from peft import get_peft_model, LoraConfig

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b-it"
)

peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Fine-tune with Trainer (see previous guides)

Deployment

# Docker deployment
cat > app.py << 'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

@app.post("/generate")
async def generate(request: GenerateRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=request.max_tokens)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"response": response}
EOF

# Run
pip install fastapi uvicorn
uvicorn app:app

Benchmarks

import time

models = [
    "google/gemma-2b",
    "google/gemma-7b-it",
]

for model_id in models:
    model = AutoModelForCausalLM.from_pretrained(model_id)
    model = model.to("cuda")

    prompt = "Explain machine learning"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    start = time.time()
    outputs = model.generate(**inputs, max_new_tokens=100)
    elapsed = time.time() - start

    print(f"{model_id}: {elapsed:.2f}s")

# Expected:
# Gemma-2B: 1-2s
# Gemma-7B: 2-3s

Model Comparison

Model         | Size | Speed    | Quality | Safety
Gemma-2B      | 2B   | Very Fast| Good    | High
Gemma-7B      | 7B   | Fast     | Excellent| High
Mistral-7B    | 7B   | Fast     | Excellent| Medium
Llama2-7B     | 7B   | Fast     | Very Good| Medium
Phi-3-mini    | 3.8B | Very Fast| Good    | High

Conclusion

Google's Gemma prioritizes responsible AI while maintaining competitive performance. Ideal for safety-conscious deployments and applications requiring ethical considerations.

FAQ

Q: When does Gemma decline requests? A: On potentially harmful prompts (violence, illegal content, etc.). This is by design for responsible AI.

Q: Is Gemma better than Mistral? A: Comparable quality. Gemma emphasizes safety; Mistral emphasizes speed. Choose based on your priorities.

Q: Can I use Gemma commercially? A: Yes, with proper licensing. Check Google's license for commercial use terms.