Sentence Transformers — Generate Embeddings Locally
Advertisement
Introduction
Sentence Transformers provide state-of-the-art embeddings that run locally on your machine. This guide covers installation, usage, fine-tuning, and optimization for production systems.
- Installation and Setup
- Popular Models
- Choosing a Model
- Batch Encoding
- Semantic Search Implementation
- Similarity Metrics
- Building a RAG System Locally
- Fine-tuning for Your Domain
- Multilingual Embeddings
- GPU Acceleration
- Integration with Vector Stores
- Performance Optimization
- Cost and Performance Comparison
- Conclusion
- FAQ
Installation and Setup
pip install sentence-transformers torch
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Test it
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
Popular Models
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | General purpose |
| all-mpnet-base-v2 | 768 | Fast | Excellent | Best quality/speed balance |
| multi-qa-MiniLM-L6-cos-v1 | 384 | Very Fast | Good | Q&A |
| paraphrase-MiniLM-L6-v2 | 384 | Very Fast | Good | Semantic similarity |
| all-roberta-large-v1 | 1024 | Slow | Excellent | High accuracy |
Choosing a Model
# For speed (prototyping)
model = SentenceTransformer('all-MiniLM-L6-v2')
# For accuracy (production)
model = SentenceTransformer('all-mpnet-base-v2')
# For Q&A tasks
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# Check model info
print(model.get_sentence_embedding_dimension()) # Get embedding dimension
print(model.get_max_seq_length()) # Get max token length
Batch Encoding
# Efficient batch encoding
sentences = [
"Sentence 1",
"Sentence 2",
# ... thousands more
]
# Batch processing with GPU
embeddings = model.encode(
sentences,
batch_size=32,
convert_to_tensor=False, # Return numpy arrays
show_progress_bar=True
)
Semantic Search Implementation
from sentence_transformers import util
import torch
# Embed corpus
corpus = [
"The cat sat on the mat",
"Dogs are loyal animals",
"Machine learning is AI",
"Python is a programming language"
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# Query
query = "Tell me about animals"
query_embedding = model.encode(query, convert_to_tensor=True)
# Semantic search
hits = util.semantic_search(
query_embedding,
corpus_embeddings,
top_k=2,
score_function=util.cos_sim
)
for hit in hits[0]:
print(f"Score: {hit['score']:.4f}")
print(f"Text: {corpus[hit['corpus_id']]}")
Similarity Metrics
from sentence_transformers import util
import numpy as np
# Compute similarity matrices
similarity = util.pytorch_cos_sim(corpus_embeddings, query_embedding)
# Or with numpy
from scipy.spatial.distance import cosine
emb1 = model.encode("First sentence")
emb2 = model.encode("Second sentence")
# Cosine similarity
cosine_sim = 1 - cosine(emb1, emb2)
print(f"Cosine similarity: {cosine_sim:.4f}")
# Dot product (faster for normalized embeddings)
dot_product = np.dot(emb1, emb2)
Building a RAG System Locally
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List
class LocalRAG:
def __init__(self, documents: List[str]):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.documents = documents
self.embeddings = self.model.encode(documents)
def retrieve(self, query: str, top_k: int = 3) -> List[str]:
"""Retrieve similar documents."""
query_emb = self.model.encode(query)
# Cosine similarity
similarities = np.dot(self.embeddings, query_emb) / (
np.linalg.norm(self.embeddings, axis=1) *
np.linalg.norm(query_emb)
)
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
# Usage
documents = [
"Python is a programming language",
"Machine learning uses algorithms",
"Deep learning uses neural networks"
]
rag = LocalRAG(documents)
results = rag.retrieve("What is machine learning?", top_k=2)
print(results)
Fine-tuning for Your Domain
from sentence_transformers import SentenceTransformer, losses, models
from torch.utils.data import DataLoader
from sentence_transformers import InputExample
# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare training data (sentence pairs with scores)
train_examples = [
InputExample(texts=["Python tutorial", "Learn Python"], label=0.9),
InputExample(texts=["Machine learning", "Deep learning"], label=0.8),
InputExample(texts=["Cat", "Dog"], label=0.7),
InputExample(texts=["Quantum physics", "Pizza"], label=0.1),
]
# Create data loader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Use cosine similarity loss
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100
)
# Save
model.save("fine-tuned-model")
Multilingual Embeddings
# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Encode in multiple languages
sentences = [
"This is an English sentence",
"Ceci est une phrase française",
"Dies ist ein deutscher Satz",
"这是一个中文句子"
]
embeddings = model.encode(sentences)
# Cross-lingual semantic search works
query = model.encode("Tell me about programming") # English
doc_es = model.encode("Programación en Python") # Spanish
similarity = np.dot(query, doc_es) / (np.linalg.norm(query) * np.linalg.norm(doc_es))
print(f"Cross-lingual similarity: {similarity:.4f}")
GPU Acceleration
# Use GPU if available
model = SentenceTransformer('all-MiniLM-L6-v2')
model.to('cuda') # or 'cpu'
# Batch size can be larger with GPU
embeddings = model.encode(
sentences,
batch_size=64, # Larger on GPU
device='cuda'
)
# Check GPU usage
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
Integration with Vector Stores
import chromadb
from sentence_transformers import SentenceTransformer
# Create Chroma collection with SentenceTransformer
client = chromadb.Client()
collection = client.create_collection("local-embeddings")
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = ["doc1", "doc2", "doc3"]
embeddings = model.encode(documents).tolist()
collection.add(
ids=["1", "2", "3"],
embeddings=embeddings,
documents=documents
)
# Query
query_emb = model.encode(["query"]).tolist()
results = collection.query(
query_embeddings=query_emb,
n_results=2
)
Performance Optimization
# 1. Use faster models for production
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dims, fast
# 2. Reduce batch size if out of memory
embeddings = model.encode(sentences, batch_size=8)
# 3. Use fp16 for faster computation
embeddings = model.encode(sentences, precision="float16")
# 4. Cache embeddings
import pickle
# Save embeddings
with open('embeddings.pkl', 'wb') as f:
pickle.dump(embeddings, f)
# Load embeddings
with open('embeddings.pkl', 'rb') as f:
embeddings = pickle.load(f)
Cost and Performance Comparison
Local (Sentence Transformers):
- Cost: Free
- Speed: 100-500 sentences/sec on CPU, 1000+ on GPU
- Privacy: All data stays local
- Setup: Simple pip install
OpenAI API:
- Cost: $0.02 per 1M embeddings
- Speed: Rate limited, but no local computation
- Privacy: Data sent to OpenAI
- Setup: Requires API key
Conclusion
Sentence Transformers bring powerful embeddings to your laptop. Whether prototyping locally or deploying in air-gapped environments, they provide excellent quality at zero API cost.
FAQ
Q: Should I use local or cloud embeddings? A: Local for privacy, prototyping, and cost savings. Cloud (OpenAI) for simplicity, latest models, and guaranteed quality.
Q: Can I fine-tune models for my domain? A: Yes, with labeled pairs. Fine-tuning typically improves quality by 5-15% for domain-specific tasks.
Q: How do I deploy Sentence Transformers? A: Package as a Docker container, use FastAPI for inference, or integrate into your application directly.
Advertisement