Embedding Model Comparison — OpenAI, Cohere, and Open-Source Options
Advertisement
Introduction
Embeddings are the foundation of semantic search, RAG, and recommendation systems. Your embedding model choice directly impacts retrieval quality, latency, and cost. In 2026, you have excellent options: OpenAI's text-embedding-3 family, Cohere's multi-input embeddings, and powerful open-source models that run locally. This guide cuts through the noise.
- OpenAI text-embedding-3 Family
- Cohere Embed v3 With Input Types
- Sentence-Transformers Local Options
- Multilingual Embeddings
- Domain-Specific Fine-Tuned Embeddings
- MTEB Benchmark Interpretation
- Embedding Dimensions vs Performance
- Cost Per Million Tokens
- Batch Embedding for Bulk Processing
- Caching Embeddings
- Checklist
- Conclusion
OpenAI text-embedding-3 Family
OpenAI's latest models set the quality bar. text-embedding-3-small is incredibly efficient:
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# text-embedding-3-small: 512 dimensions, $0.02 per 1M tokens
response = client.embeddings.create(
model="text-embedding-3-small",
input=[
"The quick brown fox jumps over the lazy dog",
"Vector embeddings power semantic search",
"Transformer models revolutionized NLP",
],
)
# Access embeddings
for data in response.data:
print(f"Index {data.index}: {len(data.embedding)} dimensions")
print(f"First 5 values: {data.embedding[:5]}")
# text-embedding-3-large: 3072 dimensions, $0.13 per 1M tokens
# Higher quality, 10-25% better retrieval accuracy on MTEB
response_large = client.embeddings.create(
model="text-embedding-3-large",
input="Advanced RAG with hybrid search",
)
print(f"Embedding dimensions: {len(response_large.data[0].embedding)}")
3-small vs 3-large trade-off: 3-small is 6× cheaper and 16× smaller embeddings (512 vs 3072 dims). For most RAG use cases, 3-small is sufficient. Use 3-large when:
- Retrieval quality is critical (medical, legal docs)
- Your queries are ultra-vague
- You have compute budget for larger indexes
Cohere Embed v3 With Input Types
Cohere's embed-english-v3.0 introduces input type hints, improving embedding quality for specific contexts:
import cohere
co = cohere.ClientV2(api_key="...")
# Specify input_type: search_query, search_document, or classification
response = co.embed(
texts=[
"Find me papers about quantum computing",
"Quantum computing is a computational paradigm",
],
model="embed-english-v3.0",
input_type="search_query", # for queries
embedding_types=["float"],
)
# Add documents with search_document type
doc_response = co.embed(
texts=[
"A comprehensive guide to quantum algorithms",
"Quantum error correction techniques",
],
model="embed-english-v3.0",
input_type="search_document", # for documents
embedding_types=["float"],
)
print(f"Query embeddings: {len(response.embeddings.float_)}")
print(f"Document embeddings: {len(doc_response.embeddings.float_)}")
# Cost: $0.10 per 1M tokens, 1024 dimensions
Cohere also supports int8 quantization for 4× storage reduction. Input types improve retrieval by 3-5% without reindexing.
Sentence-Transformers Local Options
sentence-transformers runs entirely locally, ideal for privacy-sensitive applications:
from sentence_transformers import SentenceTransformer, util
import torch
# Load model (auto-downloads on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')
# all-MiniLM-L6-v2: 22M parameters, 384 dimensions
# Inference: ~1ms per sentence on CPU
sentences = [
"Vector databases accelerate AI",
"Semantic search finds relevant documents",
"Embeddings capture meaning in vector space",
]
embeddings = model.encode(sentences, show_progress_bar=True)
print(f"Shape: {embeddings.shape}") # (3, 384)
# Cosine similarity
query = "How do embeddings work?"
query_embedding = model.encode(query)
similarities = util.cos_sim(query_embedding, embeddings)[0]
for idx, score in enumerate(similarities):
print(f"{sentences[idx]}: {score:.4f}")
# Batch encode for production
batch_size = 32
large_corpus = ["text"] * 10000
all_embeddings = model.encode(
large_corpus,
batch_size=batch_size,
show_progress_bar=True,
convert_to_numpy=True,
)
print(f"Encoded {len(all_embeddings)} documents")
# Fine-tuning for domain-specific quality
from sentence_transformers import losses, InputExample
from torch.utils.data import DataLoader
# Training data: (sentence1, sentence2, similarity_score)
train_examples = [
InputExample(texts=["Quantum computing", "Quantum algorithms"], label=0.8),
InputExample(texts=["Vector search", "SQL database"], label=0.2),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
)
model.save("models/custom-embedding-model")
Popular models:
- all-MiniLM-L6-v2: 22M params, 384 dims, best speed
- all-mpnet-base-v2: 109M params, 768 dims, higher quality
- paraphrase-multilingual-mpnet-base-v2: 278M params, 768 dims, 50+ languages
Multilingual Embeddings
For global products, multilingual models keep a single embedding space:
from sentence_transformers import SentenceTransformer
# Multilingual model handles 50+ languages in one space
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
texts = [
"The quick brown fox", # English
"El rápido zorro marrón", # Spanish
"Le rapide renard brun", # French
"快速的棕色狐狸", # Chinese
]
embeddings = model.encode(texts)
# Cross-lingual similarity search
import numpy as np
query = "Fast brown animal"
query_emb = model.encode(query)
scores = np.dot(embeddings, query_emb) / (
np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_emb)
)
for text, score in zip(texts, scores):
print(f"{text}: {score:.4f}")
Trade-off: Multilingual models (768 dims) are larger than monolingual options (384 dims). For single-language use cases, specialized models are faster.
Domain-Specific Fine-Tuned Embeddings
Generic embeddings miss domain jargon. Fine-tuning dramatically improves retrieval:
from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MultipleNegativesRankingLoss
from torch.utils.data import DataLoader
from sentence_transformers import InputExample
# Example: medical document embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate training pairs: (query, positive_doc, negative_doc)
medical_examples = [
InputExample(
texts=[
"What are symptoms of hypertension?",
"Hypertension presents with elevated blood pressure readings",
"The liver produces bile for digestion"
]
),
InputExample(
texts=[
"How to treat diabetic neuropathy?",
"Diabetic neuropathy treatment involves glucose control and pain management",
"Acute myocardial infarction requires immediate intervention"
]
),
]
train_dataloader = DataLoader(
medical_examples,
shuffle=True,
batch_size=16
)
train_loss = MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=5,
warmup_steps=100,
evaluation_steps=500,
)
# Evaluate on domain-specific queries
queries = ["diabetes management", "cardiac procedures"]
docs = [
"Diabetes is managed through insulin and lifestyle changes",
"Angioplasty restores blood flow in coronary arteries",
]
query_embs = model.encode(queries)
doc_embs = model.encode(docs)
scores = np.dot(query_embs, doc_embs.T)
print("Retrieval scores after fine-tuning:")
print(scores)
model.save("models/medical-embeddings")
Fine-tuning shows 10-40% improvement in domain-specific retrieval. Cost: a few hours of GPU time.
MTEB Benchmark Interpretation
The Massive Text Embedding Benchmark (MTEB) evaluates models across 56 tasks. How to interpret scores:
# Hypothetical MTEB leaderboard snapshot (2026)
mteb_scores = {
"text-embedding-3-small": {
"retrieval": 0.74,
"clustering": 0.51,
"semantic_similarity": 0.82,
"classification": 0.78,
},
"text-embedding-3-large": {
"retrieval": 0.79,
"clustering": 0.54,
"semantic_similarity": 0.87,
"classification": 0.82,
},
"all-mpnet-base-v2": {
"retrieval": 0.69,
"clustering": 0.48,
"semantic_similarity": 0.80,
"classification": 0.74,
},
}
# Retrieval score <0.70: poor for production RAG
# 0.70-0.75: acceptable
# 0.75+: strong
# Your use case matters:
# - RAG with short queries: prioritize retrieval score
# - Clustering for topics: prioritize clustering score
# - Semantic search UI: prioritize semantic_similarity score
Don't optimize for overall score. Pick benchmarks matching your task. A model with 0.80 retrieval but 0.50 clustering is excellent for RAG.
Embedding Dimensions vs Performance
More dimensions don't always help. The law of diminishing returns applies:
# Theoretical trade-off
dimensions_vs_quality = {
"384": {"quality_gain": 1.0, "storage_mb_per_1m": 1.5, "latency_ms": 5},
"768": {"quality_gain": 1.15, "storage_mb_per_1m": 3.0, "latency_ms": 8},
"1536": {"quality_gain": 1.22, "storage_mb_per_1m": 6.0, "latency_ms": 12},
"3072": {"quality_gain": 1.25, "storage_mb_per_1m": 12.0, "latency_ms": 18},
}
# For 1M documents:
# 384 dims: 1.5 GB + fast
# 3072 dims: 12 GB + slower, 0.25% quality gain
# Recommendation: start with 384-768 dims
# Only go higher if retrieval quality is unacceptable
OpenAI's 3-small (512 dims) is Goldilocks: balanced quality and efficiency.
Cost Per Million Tokens
Here's 2026 pricing for reference:
# Pricing (as of 2026-03)
embedding_costs = {
"text-embedding-3-small": 0.02, # per 1M tokens
"text-embedding-3-large": 0.13,
"embed-english-v3.0": 0.10,
"local (free)": 0.00,
}
# Monthly cost for typical RAG app
docs = 100_000
avg_doc_tokens = 500
total_tokens = docs * avg_doc_tokens
# One-time embedding cost
print("One-time embedding cost:")
for model, cost_per_1m in embedding_costs.items():
total_cost = (total_tokens / 1_000_000) * cost_per_1m
if total_cost == 0:
print(f"{model}: $0 (free)")
else:
print(f"{model}: ${total_cost:.2f}")
# Query cost (assuming 1000 queries/day, 100 tokens each)
queries_per_month = 1000 * 30
query_tokens = queries_per_month * 100
print("\nMonthly query cost:")
for model, cost_per_1m in embedding_costs.items():
total_cost = (query_tokens / 1_000_000) * cost_per_1m
if total_cost == 0:
print(f"{model}: $0/month (free)")
else:
print(f"{model}: ${total_cost:.2f}/month")
At scale, local embeddings (0 cost) become attractive. Trade-off: DevOps overhead.
Batch Embedding for Bulk Processing
Never embed one-at-a-time in production. Batch processing is 10-50× faster:
import asyncio
from openai import AsyncOpenAI
async def batch_embed_documents(
documents: list[str],
batch_size: int = 100,
model: str = "text-embedding-3-small",
):
client = AsyncOpenAI(api_key="sk-...")
embeddings_list = []
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
response = await client.embeddings.create(
model=model,
input=batch,
)
embeddings_list.extend([data.embedding for data in response.data])
return embeddings_list
# Usage
async def main():
docs = [f"Document {i}" for i in range(10000)]
embeddings = await batch_embed_documents(docs, batch_size=100)
print(f"Embedded {len(embeddings)} documents")
asyncio.run(main())
Batch size 100-500 optimizes throughput. Larger batches = lower per-token cost.
Caching Embeddings
Never recompute embeddings. Cache aggressively:
import hashlib
import json
from redis import Redis
class EmbeddingCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.redis = Redis.from_url(redis_url)
def get_or_create(self, text: str, model: str = "text-embedding-3-small"):
# Create cache key from text content
key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}"
# Check cache
cached = self.redis.get(key)
if cached:
return json.loads(cached)
# Compute and cache
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(model=model, input=text)
embedding = response.data[0].embedding
# Store for 30 days
self.redis.setex(key, 30 * 86400, json.dumps(embedding))
return embedding
# Usage
cache = EmbeddingCache()
# First call: computes
emb1 = cache.get_or_create("Vector databases")
# Second call: returns cached
emb2 = cache.get_or_create("Vector databases")
# Verify cache hit
assert emb1 == emb2
Cache hits save 90%+ of embedding API costs. Critical for production.
Checklist
- Define embedding task (retrieval, clustering, classification)
- Check MTEB scores for your task, not overall score
- Benchmark 2-3 models on your corpus
- Calculate 12-month API cost vs local inference cost
- Test multilingual support if applicable
- Implement embedding caching layer
- Set up batch processing pipeline
- Monitor embedding model performance over time
- Plan fine-tuning if domain-specific retrieval is weak
Conclusion
In 2026, OpenAI's text-embedding-3-small is the default choice for most teams: excellent quality, low cost, and fully managed. Cohere embed v3 excels with input type hints. Sentence-transformers dominates self-hosted scenarios. For domain-specific needs, fine-tune or adopt specialized models. Always cache embeddings and batch-process. Choose based on your task (retrieval vs clustering vs similarity), not leaderboard hype.
Advertisement