Embeddings Explained — How Semantic Search Works

Sanjeev SharmaSanjeev Sharma
6 min read

Advertisement

Introduction

Embeddings are numerical representations of text that capture semantic meaning. They power semantic search, enabling computers to find documents based on meaning rather than keywords. This guide explains embeddings from first principles and shows how to build semantic search systems.

What Are Embeddings?

Embeddings are vectors (lists of numbers) that represent text in high-dimensional space. Semantically similar texts have similar embeddings, enabling distance-based retrieval.

Key insight: The distance between embeddings correlates with semantic distance. Texts about "cats" and "dogs" have closer embeddings than texts about "cats" and "quantum physics".

From Words to Numbers

# Naive example: one-hot encoding (outdated)
vocabulary = ["cat", "dog", "bird"]

# "cat" -> [1, 0, 0]
# "dog" -> [0, 1, 0]
# "bird" -> [0, 0, 1]

# Problem: loses semantic relationships, high dimensionality for large vocabularies

Modern embeddings are dense and low-dimensional (384-1536 dimensions), capturing semantic relationships learned from training data.

Embedding Models

Word2Vec and GloVe

Early word embedding models trained on word co-occurrence:

# Conceptual example (not actual implementation)
# If "king - man + woman ≈ queen" in the embedding space
# The model learned semantic relationships

word2vec_embedding = {
    "cat": [0.2, -0.5, 0.1, 0.8, ...],
    "dog": [0.3, -0.4, 0.15, 0.75, ...],  # Similar to cat
    "quantum": [-0.9, 0.1, 0.2, -0.5, ...],  # Different from cat/dog
}

Transformers and Contextual Embeddings

Modern models like BERT generate context-aware embeddings:

from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings (384 dimensions)
sentences = [
    "A cat is a fluffy pet",
    "Dogs are loyal animals",
    "Quantum computing uses qubits"
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 384)

# Compute similarity
from scipy.spatial.distance import cosine

similarity_cat_dog = 1 - cosine(embeddings[0], embeddings[1])
similarity_cat_quantum = 1 - cosine(embeddings[0], embeddings[2])

print(f"Cat-Dog similarity: {similarity_cat_dog:.2f}")  # High (~0.7)
print(f"Cat-Quantum similarity: {similarity_cat_quantum:.2f}")  # Low (~0.1)
ModelDimensionSpeedQualityCost
all-MiniLM-L6-v2384Very FastGoodFree
all-mpnet-base-v2768FastVery GoodFree
text-embedding-3-small1536MediumExcellentCheap
text-embedding-3-large3072MediumExcellentModerate
OpenAI text-embedding1536MediumBestCost depends on usage

Building a Semantic Search System

Step 1: Embed Your Documents

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Your documents
documents = [
    "Python is a programming language",
    "Machine learning is a subset of AI",
    "Data science involves statistics",
    "Neural networks process information",
    "A cat is a furry pet"
]

# Generate embeddings
embeddings = model.encode(documents, convert_to_tensor=True)
from sklearn.metrics.pairwise import cosine_similarity

# Store embeddings and documents
index = {
    'documents': documents,
    'embeddings': embeddings.numpy()
}

def semantic_search(query: str, top_k: int = 3) -> list:
    """Find most similar documents."""
    # Embed query
    query_embedding = model.encode([query], convert_to_tensor=True)

    # Compute similarities
    similarities = cosine_similarity(
        query_embedding.numpy(),
        index['embeddings']
    )[0]

    # Get top-k
    top_indices = np.argsort(similarities)[-top_k:][::-1]

    results = [
        {
            'document': index['documents'][i],
            'similarity': similarities[i]
        }
        for i in top_indices
    ]

    return results

# Test
results = semantic_search("What is machine learning?")
for result in results:
    print(f"Score: {result['similarity']:.3f} - {result['document']}")

Output:

Score: 0.823 - Machine learning is a subset of AI
Score: 0.654 - Python is a programming language
Score: 0.521 - Neural networks process information

Using OpenAI Embeddings

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def embed_with_openai(texts: list) -> np.ndarray:
    """Generate embeddings using OpenAI."""
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )

    embeddings = [item.embedding for item in response.data]
    return np.array(embeddings)

# Embed documents
doc_embeddings = embed_with_openai(documents)

# Embed query
query = "What is AI?"
query_embedding = embed_with_openai([query])[0]

# Find similar
similarities = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) *
    np.linalg.norm(query_embedding)
)

top_idx = np.argmax(similarities)
print(f"Most similar: {documents[top_idx]}")

Advanced: Semantic Search with Filtering

# Add metadata to documents
indexed_docs = [
    {"text": "Python programming", "category": "programming", "year": 2023},
    {"text": "Machine learning basics", "category": "ai", "year": 2024},
    {"text": "Web development", "category": "programming", "year": 2023},
]

# Embed texts
embeddings = model.encode([doc['text'] for doc in indexed_docs])

def semantic_search_with_filter(
    query: str,
    category_filter: str = None,
    top_k: int = 3
) -> list:
    """Search with optional metadata filtering."""
    query_emb = model.encode([query])

    # Compute similarities
    similarities = cosine_similarity(query_emb, embeddings)[0]

    # Filter by category if specified
    filtered_results = []
    for i, (sim, doc) in enumerate(zip(similarities, indexed_docs)):
        if category_filter is None or doc['category'] == category_filter:
            filtered_results.append((sim, doc, i))

    # Sort and return top-k
    filtered_results.sort(key=lambda x: x[0], reverse=True)
    return filtered_results[:top_k]

# Search filtered
results = semantic_search_with_filter(
    "Python tutorial",
    category_filter="programming",
    top_k=2
)

Dimensionality and Efficiency

from sklearn.decomposition import PCA

# Reduce dimensionality for efficiency
pca = PCA(n_components=128)
reduced_embeddings = pca.fit_transform(embeddings)

# Trade-off: less storage and faster search, but slight accuracy loss
# Original: 384 dimensions
# Reduced: 128 dimensions (67% smaller)

# Use reduced embeddings for large-scale search

Distance Metrics

# Cosine similarity (most common for embeddings)
from scipy.spatial.distance import cosine
cosine_dist = cosine(emb1, emb2)

# Euclidean distance
euclidean_dist = np.linalg.norm(emb1 - emb2)

# Dot product (fast, used in vector databases)
dot_product = np.dot(emb1, emb2)

# For normalized embeddings, cosine similarity = dot product

Production Tips

  1. Batch Processing: Embed documents in batches
  2. Caching: Store embeddings to avoid recomputation
  3. Monitoring: Track embedding quality and latency
  4. Model Selection: Choose based on accuracy needs and latency budget
  5. Regular Reindexing: Update embeddings when documents change

Conclusion

Embeddings are the foundation of semantic search. Understanding how they work and choosing the right model and infrastructure is crucial for building effective retrieval systems.

FAQ

Q: How many dimensions should embeddings have? A: More dimensions capture finer distinctions but are slower and more expensive. 384-768 dimensions work for most applications; 1536+ for highest quality.

Q: Can I use the same embedding model for different languages? A: Some multilingual models exist (e.g., multilingual-BERT), but language-specific models usually perform better.

Q: How often should I update embeddings? A: Update when documents change or monthly to reflect model improvements. Incremental updates are more efficient than full reindexing.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro