Embeddings Explained — How Semantic Search Works
Advertisement
Introduction
Embeddings are numerical representations of text that capture semantic meaning. They power semantic search, enabling computers to find documents based on meaning rather than keywords. This guide explains embeddings from first principles and shows how to build semantic search systems.
- What Are Embeddings?
- From Words to Numbers
- Embedding Models
- Word2Vec and GloVe
- Transformers and Contextual Embeddings
- Popular Embedding Models
- Building a Semantic Search System
- Step 1: Embed Your Documents
- Step 2: Index for Similarity Search
- Using OpenAI Embeddings
- Advanced: Semantic Search with Filtering
- Dimensionality and Efficiency
- Distance Metrics
- Production Tips
- Conclusion
- FAQ
What Are Embeddings?
Embeddings are vectors (lists of numbers) that represent text in high-dimensional space. Semantically similar texts have similar embeddings, enabling distance-based retrieval.
Key insight: The distance between embeddings correlates with semantic distance. Texts about "cats" and "dogs" have closer embeddings than texts about "cats" and "quantum physics".
From Words to Numbers
# Naive example: one-hot encoding (outdated)
vocabulary = ["cat", "dog", "bird"]
# "cat" -> [1, 0, 0]
# "dog" -> [0, 1, 0]
# "bird" -> [0, 0, 1]
# Problem: loses semantic relationships, high dimensionality for large vocabularies
Modern embeddings are dense and low-dimensional (384-1536 dimensions), capturing semantic relationships learned from training data.
Embedding Models
Word2Vec and GloVe
Early word embedding models trained on word co-occurrence:
# Conceptual example (not actual implementation)
# If "king - man + woman ≈ queen" in the embedding space
# The model learned semantic relationships
word2vec_embedding = {
"cat": [0.2, -0.5, 0.1, 0.8, ...],
"dog": [0.3, -0.4, 0.15, 0.75, ...], # Similar to cat
"quantum": [-0.9, 0.1, 0.2, -0.5, ...], # Different from cat/dog
}
Transformers and Contextual Embeddings
Modern models like BERT generate context-aware embeddings:
from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings (384 dimensions)
sentences = [
"A cat is a fluffy pet",
"Dogs are loyal animals",
"Quantum computing uses qubits"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 384)
# Compute similarity
from scipy.spatial.distance import cosine
similarity_cat_dog = 1 - cosine(embeddings[0], embeddings[1])
similarity_cat_quantum = 1 - cosine(embeddings[0], embeddings[2])
print(f"Cat-Dog similarity: {similarity_cat_dog:.2f}") # High (~0.7)
print(f"Cat-Quantum similarity: {similarity_cat_quantum:.2f}") # Low (~0.1)
Popular Embedding Models
| Model | Dimension | Speed | Quality | Cost |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | Free |
| all-mpnet-base-v2 | 768 | Fast | Very Good | Free |
| text-embedding-3-small | 1536 | Medium | Excellent | Cheap |
| text-embedding-3-large | 3072 | Medium | Excellent | Moderate |
| OpenAI text-embedding | 1536 | Medium | Best | Cost depends on usage |
Building a Semantic Search System
Step 1: Embed Your Documents
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Your documents
documents = [
"Python is a programming language",
"Machine learning is a subset of AI",
"Data science involves statistics",
"Neural networks process information",
"A cat is a furry pet"
]
# Generate embeddings
embeddings = model.encode(documents, convert_to_tensor=True)
Step 2: Index for Similarity Search
from sklearn.metrics.pairwise import cosine_similarity
# Store embeddings and documents
index = {
'documents': documents,
'embeddings': embeddings.numpy()
}
def semantic_search(query: str, top_k: int = 3) -> list:
"""Find most similar documents."""
# Embed query
query_embedding = model.encode([query], convert_to_tensor=True)
# Compute similarities
similarities = cosine_similarity(
query_embedding.numpy(),
index['embeddings']
)[0]
# Get top-k
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = [
{
'document': index['documents'][i],
'similarity': similarities[i]
}
for i in top_indices
]
return results
# Test
results = semantic_search("What is machine learning?")
for result in results:
print(f"Score: {result['similarity']:.3f} - {result['document']}")
Output:
Score: 0.823 - Machine learning is a subset of AI
Score: 0.654 - Python is a programming language
Score: 0.521 - Neural networks process information
Using OpenAI Embeddings
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def embed_with_openai(texts: list) -> np.ndarray:
"""Generate embeddings using OpenAI."""
response = client.embeddings.create(
input=texts,
model="text-embedding-3-small"
)
embeddings = [item.embedding for item in response.data]
return np.array(embeddings)
# Embed documents
doc_embeddings = embed_with_openai(documents)
# Embed query
query = "What is AI?"
query_embedding = embed_with_openai([query])[0]
# Find similar
similarities = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) *
np.linalg.norm(query_embedding)
)
top_idx = np.argmax(similarities)
print(f"Most similar: {documents[top_idx]}")
Advanced: Semantic Search with Filtering
# Add metadata to documents
indexed_docs = [
{"text": "Python programming", "category": "programming", "year": 2023},
{"text": "Machine learning basics", "category": "ai", "year": 2024},
{"text": "Web development", "category": "programming", "year": 2023},
]
# Embed texts
embeddings = model.encode([doc['text'] for doc in indexed_docs])
def semantic_search_with_filter(
query: str,
category_filter: str = None,
top_k: int = 3
) -> list:
"""Search with optional metadata filtering."""
query_emb = model.encode([query])
# Compute similarities
similarities = cosine_similarity(query_emb, embeddings)[0]
# Filter by category if specified
filtered_results = []
for i, (sim, doc) in enumerate(zip(similarities, indexed_docs)):
if category_filter is None or doc['category'] == category_filter:
filtered_results.append((sim, doc, i))
# Sort and return top-k
filtered_results.sort(key=lambda x: x[0], reverse=True)
return filtered_results[:top_k]
# Search filtered
results = semantic_search_with_filter(
"Python tutorial",
category_filter="programming",
top_k=2
)
Dimensionality and Efficiency
from sklearn.decomposition import PCA
# Reduce dimensionality for efficiency
pca = PCA(n_components=128)
reduced_embeddings = pca.fit_transform(embeddings)
# Trade-off: less storage and faster search, but slight accuracy loss
# Original: 384 dimensions
# Reduced: 128 dimensions (67% smaller)
# Use reduced embeddings for large-scale search
Distance Metrics
# Cosine similarity (most common for embeddings)
from scipy.spatial.distance import cosine
cosine_dist = cosine(emb1, emb2)
# Euclidean distance
euclidean_dist = np.linalg.norm(emb1 - emb2)
# Dot product (fast, used in vector databases)
dot_product = np.dot(emb1, emb2)
# For normalized embeddings, cosine similarity = dot product
Production Tips
- Batch Processing: Embed documents in batches
- Caching: Store embeddings to avoid recomputation
- Monitoring: Track embedding quality and latency
- Model Selection: Choose based on accuracy needs and latency budget
- Regular Reindexing: Update embeddings when documents change
Conclusion
Embeddings are the foundation of semantic search. Understanding how they work and choosing the right model and infrastructure is crucial for building effective retrieval systems.
FAQ
Q: How many dimensions should embeddings have? A: More dimensions capture finer distinctions but are slower and more expensive. 384-768 dimensions work for most applications; 1536+ for highest quality.
Q: Can I use the same embedding model for different languages? A: Some multilingual models exist (e.g., multilingual-BERT), but language-specific models usually perform better.
Q: How often should I update embeddings? A: Update when documents change or monthly to reflect model improvements. Incremental updates are more efficient than full reindexing.
Advertisement