Phi-3 — Microsoft Small Language Model Guide

Introduction

Microsoft's Phi-3 family proves that smaller models can be remarkably capable. These 3-14B parameter models achieve near-GPT-3.5 quality through careful training. This guide covers everything you need to deploy Phi-3.

Phi-3 Model Variants
Quick Start with Ollama
Installation with Transformers
Structured Output
Memory Efficient Setup
Chat Format
Integration with LangChain
Deployment on Edge Devices
Benchmarks
Fine-tuning Phi-3
Comparison Table
Production Deployment
Conclusion
FAQ

Phi-3 Model Variants

Phi-3-mini (3.8B): Smallest, fastest
- Context: 4K tokens
- Use: Mobile, embedded, edge

Phi-3-small (7B): Balanced
- Context: 8K tokens
- Use: Consumer hardware, servers

Phi-3-medium (14B): Most capable
- Context: 4K tokens
- Use: High-quality applications

Quick Start with Ollama

# Download and run
ollama pull phi3
ollama run phi3

# Interactive usage
>>> What is quantum computing?

Installation with Transformers

pip install transformers torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Phi-3-mini (smallest, most efficient)
model_id = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate
inputs = tokenizer("What is AI?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Structured Output

def generate_json(topic: str):
    """Generate structured JSON with Phi-3."""
    prompt = f"""Generate JSON about {topic}.
Fields: name, description, examples (list).
Return only JSON:"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.3
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract JSON
    import json
    try:
        json_start = response.find('{')
        json_end = response.rfind('}') + 1
        return json.loads(response[json_start:json_end])
    except:
        return None

result = generate_json("machine learning")
print(result)

Memory Efficient Setup

# No quantization needed (only 3.8B)
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct"
)

# Still works great on CPU
model = model.to("cpu")

# Or 8-bit if GPU limited
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=bnb_config
)

Chat Format

def chat_phi3(messages: list) -> str:
    """Chat with Phi-3."""
    # Build prompt from messages
    chat_prompt = ""
    for msg in messages:
        if msg["role"] == "user":
            chat_prompt += f"<|user|>\n{msg['content']}<|end|>\n"
        elif msg["role"] == "assistant":
            chat_prompt += f"<|assistant|>\n{msg['content']}<|end|>\n"

    chat_prompt += "<|assistant|>\n"

    inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=256)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("<|assistant|>\n")[-1].replace("<|end|>", "")

# Usage
messages = [{"role": "user", "content": "What is Python?"}]
print(chat_phi3(messages))

Integration with LangChain

from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

# Use Ollama for easy integration
llm = Ollama(model="phi3")

# Create chain
prompt = ChatPromptTemplate.from_template(
    "Explain {topic} in one sentence"
)

chain = prompt | llm

result = chain.invoke({"topic": "neural networks"})
print(result)

Deployment on Edge Devices

# Phi-3-mini runs on:
# - iPhones (via ONNX Runtime)
# - Android phones (via TensorFlow Lite)
# - Raspberry Pi 4 (with quantization)
# - AWS Greengrass edge devices

# Example: Running on Raspberry Pi
import onnx
from onnxruntime import InferenceSession

# Load ONNX model
model_path = "phi-3-mini-onnx/model.onnx"
session = InferenceSession(model_path)

# Run inference
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

result = session.run([output_name], {input_name: input_data})

Benchmarks

import time

def benchmark_phi3():
    """Benchmark Phi-3 models."""
    prompt = "Explain quantum computing in 100 words"

    for model_name in ["microsoft/Phi-3-mini-4k-instruct",
                       "microsoft/Phi-3-small-8k-instruct"]:
        model = AutoModelForCausalLM.from_pretrained(model_name)
        model = model.to("cuda")

        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        start = time.time()
        outputs = model.generate(**inputs, max_new_tokens=100)
        elapsed = time.time() - start

        print(f"{model_name}: {elapsed:.2f}s")

benchmark_phi3()

# Expected:
# Phi-3-mini: 2-3s
# Phi-3-small: 3-4s

Fine-tuning Phi-3

from peft import get_peft_model, LoraConfig

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct"
)

# Apply LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Fine-tune on your data (see previous guides)

Comparison Table

Model         | Size | Speed    | Quality | Memory
Phi-3-mini    | 3.8B | Very Fast| Good    | 2GB
Phi-3-small   | 7B   | Fast     | Very Good| 4GB
Mistral-7B    | 7B   | Fast     | Excellent| 4GB
Llama2-7B     | 7B   | Fast     | Very Good| 4GB
GPT-3.5-turbo | N/A  | Slow     | Excellent| N/A (API)

Production Deployment

# Docker deployment
cat > Dockerfile << 'EOF'
FROM python:3.10

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app.py .

CMD ["python", "app.py"]
EOF

# Run
docker build -t phi3-app .
docker run -it phi3-app

Conclusion

Phi-3 demonstrates that size isn't everything. With careful training, small models can achieve impressive capabilities while remaining deployable on consumer hardware and edge devices.

FAQ

Q: Should I use Phi-3 or Mistral? A: Phi-3-mini for speed and small footprint. Mistral-7B for highest quality in 7B range.

Q: Can Phi-3 replace ChatGPT? A: For specific tasks yes. For general use, GPT-4 is still superior, but Phi-3 is excellent value.

Q: How do I deploy Phi-3 on mobile? A: Use ONNX Runtime for iOS or TensorFlow Lite for Android. Phi-3-mini fits in ~2GB.