Phi-3 — Microsoft Small Language Model Guide
Advertisement
Introduction
Microsoft's Phi-3 family proves that smaller models can be remarkably capable. These 3-14B parameter models achieve near-GPT-3.5 quality through careful training. This guide covers everything you need to deploy Phi-3.
- Phi-3 Model Variants
- Quick Start with Ollama
- Installation with Transformers
- Structured Output
- Memory Efficient Setup
- Chat Format
- Integration with LangChain
- Deployment on Edge Devices
- Benchmarks
- Fine-tuning Phi-3
- Comparison Table
- Production Deployment
- Conclusion
- FAQ
Phi-3 Model Variants
Phi-3-mini (3.8B): Smallest, fastest
- Context: 4K tokens
- Use: Mobile, embedded, edge
Phi-3-small (7B): Balanced
- Context: 8K tokens
- Use: Consumer hardware, servers
Phi-3-medium (14B): Most capable
- Context: 4K tokens
- Use: High-quality applications
Quick Start with Ollama
# Download and run
ollama pull phi3
ollama run phi3
# Interactive usage
>>> What is quantum computing?
Installation with Transformers
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Phi-3-mini (smallest, most efficient)
model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Generate
inputs = tokenizer("What is AI?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Structured Output
def generate_json(topic: str):
"""Generate structured JSON with Phi-3."""
prompt = f"""Generate JSON about {topic}.
Fields: name, description, examples (list).
Return only JSON:"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.3
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract JSON
import json
try:
json_start = response.find('{')
json_end = response.rfind('}') + 1
return json.loads(response[json_start:json_end])
except:
return None
result = generate_json("machine learning")
print(result)
Memory Efficient Setup
# No quantization needed (only 3.8B)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
# Still works great on CPU
model = model.to("cpu")
# Or 8-bit if GPU limited
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
quantization_config=bnb_config
)
Chat Format
def chat_phi3(messages: list) -> str:
"""Chat with Phi-3."""
# Build prompt from messages
chat_prompt = ""
for msg in messages:
if msg["role"] == "user":
chat_prompt += f"<|user|>\n{msg['content']}<|end|>\n"
elif msg["role"] == "assistant":
chat_prompt += f"<|assistant|>\n{msg['content']}<|end|>\n"
chat_prompt += "<|assistant|>\n"
inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response.split("<|assistant|>\n")[-1].replace("<|end|>", "")
# Usage
messages = [{"role": "user", "content": "What is Python?"}]
print(chat_phi3(messages))
Integration with LangChain
from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
# Use Ollama for easy integration
llm = Ollama(model="phi3")
# Create chain
prompt = ChatPromptTemplate.from_template(
"Explain {topic} in one sentence"
)
chain = prompt | llm
result = chain.invoke({"topic": "neural networks"})
print(result)
Deployment on Edge Devices
# Phi-3-mini runs on:
# - iPhones (via ONNX Runtime)
# - Android phones (via TensorFlow Lite)
# - Raspberry Pi 4 (with quantization)
# - AWS Greengrass edge devices
# Example: Running on Raspberry Pi
import onnx
from onnxruntime import InferenceSession
# Load ONNX model
model_path = "phi-3-mini-onnx/model.onnx"
session = InferenceSession(model_path)
# Run inference
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
result = session.run([output_name], {input_name: input_data})
Benchmarks
import time
def benchmark_phi3():
"""Benchmark Phi-3 models."""
prompt = "Explain quantum computing in 100 words"
for model_name in ["microsoft/Phi-3-mini-4k-instruct",
"microsoft/Phi-3-small-8k-instruct"]:
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to("cuda")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
elapsed = time.time() - start
print(f"{model_name}: {elapsed:.2f}s")
benchmark_phi3()
# Expected:
# Phi-3-mini: 2-3s
# Phi-3-small: 3-4s
Fine-tuning Phi-3
from peft import get_peft_model, LoraConfig
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
# Apply LoRA
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Fine-tune on your data (see previous guides)
Comparison Table
Model | Size | Speed | Quality | Memory
Phi-3-mini | 3.8B | Very Fast| Good | 2GB
Phi-3-small | 7B | Fast | Very Good| 4GB
Mistral-7B | 7B | Fast | Excellent| 4GB
Llama2-7B | 7B | Fast | Very Good| 4GB
GPT-3.5-turbo | N/A | Slow | Excellent| N/A (API)
Production Deployment
# Docker deployment
cat > Dockerfile << 'EOF'
FROM python:3.10
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]
EOF
# Run
docker build -t phi3-app .
docker run -it phi3-app
Conclusion
Phi-3 demonstrates that size isn't everything. With careful training, small models can achieve impressive capabilities while remaining deployable on consumer hardware and edge devices.
FAQ
Q: Should I use Phi-3 or Mistral? A: Phi-3-mini for speed and small footprint. Mistral-7B for highest quality in 7B range.
Q: Can Phi-3 replace ChatGPT? A: For specific tasks yes. For general use, GPT-4 is still superior, but Phi-3 is excellent value.
Q: How do I deploy Phi-3 on mobile? A: Use ONNX Runtime for iOS or TensorFlow Lite for Android. Phi-3-mini fits in ~2GB.
Advertisement