- Published on
Embedding Model Comparison — OpenAI, Cohere, and Open-Source Options
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Embeddings are the foundation of semantic search, RAG, and recommendation systems. Your embedding model choice directly impacts retrieval quality, latency, and cost. In 2026, you have excellent options: OpenAI's text-embedding-3 family, Cohere's multi-input embeddings, and powerful open-source models that run locally. This guide cuts through the noise.
- OpenAI text-embedding-3 Family
- Cohere Embed v3 With Input Types
- Sentence-Transformers Local Options
- Multilingual Embeddings
- Domain-Specific Fine-Tuned Embeddings
- MTEB Benchmark Interpretation
- Embedding Dimensions vs Performance
- Cost Per Million Tokens
- Batch Embedding for Bulk Processing
- Caching Embeddings
- Checklist
- Conclusion
OpenAI text-embedding-3 Family
OpenAI's latest models set the quality bar. text-embedding-3-small is incredibly efficient:
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# text-embedding-3-small: 512 dimensions, $0.02 per 1M tokens
response = client.embeddings.create(
model="text-embedding-3-small",
input=[
"The quick brown fox jumps over the lazy dog",
"Vector embeddings power semantic search",
"Transformer models revolutionized NLP",
],
)
# Access embeddings
for data in response.data:
print(f"Index {data.index}: {len(data.embedding)} dimensions")
print(f"First 5 values: {data.embedding[:5]}")
# text-embedding-3-large: 3072 dimensions, $0.13 per 1M tokens
# Higher quality, 10-25% better retrieval accuracy on MTEB
response_large = client.embeddings.create(
model="text-embedding-3-large",
input="Advanced RAG with hybrid search",
)
print(f"Embedding dimensions: {len(response_large.data[0].embedding)}")
3-small vs 3-large trade-off: 3-small is 6× cheaper and 16× smaller embeddings (512 vs 3072 dims). For most RAG use cases, 3-small is sufficient. Use 3-large when:
- Retrieval quality is critical (medical, legal docs)
- Your queries are ultra-vague
- You have compute budget for larger indexes
Cohere Embed v3 With Input Types
Cohere's embed-english-v3.0 introduces input type hints, improving embedding quality for specific contexts:
import cohere
co = cohere.ClientV2(api_key="...")
# Specify input_type: search_query, search_document, or classification
response = co.embed(
texts=[
"Find me papers about quantum computing",
"Quantum computing is a computational paradigm",
],
model="embed-english-v3.0",
input_type="search_query", # for queries
embedding_types=["float"],
)
# Add documents with search_document type
doc_response = co.embed(
texts=[
"A comprehensive guide to quantum algorithms",
"Quantum error correction techniques",
],
model="embed-english-v3.0",
input_type="search_document", # for documents
embedding_types=["float"],
)
print(f"Query embeddings: {len(response.embeddings.float_)}")
print(f"Document embeddings: {len(doc_response.embeddings.float_)}")
# Cost: $0.10 per 1M tokens, 1024 dimensions
Cohere also supports int8 quantization for 4× storage reduction. Input types improve retrieval by 3-5% without reindexing.
Sentence-Transformers Local Options
sentence-transformers runs entirely locally, ideal for privacy-sensitive applications:
from sentence_transformers import SentenceTransformer, util
import torch
# Load model (auto-downloads on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')
# all-MiniLM-L6-v2: 22M parameters, 384 dimensions
# Inference: ~1ms per sentence on CPU
sentences = [
"Vector databases accelerate AI",
"Semantic search finds relevant documents",
"Embeddings capture meaning in vector space",
]
embeddings = model.encode(sentences, show_progress_bar=True)
print(f"Shape: {embeddings.shape}") # (3, 384)
# Cosine similarity
query = "How do embeddings work?"
query_embedding = model.encode(query)
similarities = util.cos_sim(query_embedding, embeddings)[0]
for idx, score in enumerate(similarities):
print(f"{sentences[idx]}: {score:.4f}")
# Batch encode for production
batch_size = 32
large_corpus = ["text"] * 10000
all_embeddings = model.encode(
large_corpus,
batch_size=batch_size,
show_progress_bar=True,
convert_to_numpy=True,
)
print(f"Encoded {len(all_embeddings)} documents")
# Fine-tuning for domain-specific quality
from sentence_transformers import losses, InputExample
from torch.utils.data import DataLoader
# Training data: (sentence1, sentence2, similarity_score)
train_examples = [
InputExample(texts=["Quantum computing", "Quantum algorithms"], label=0.8),
InputExample(texts=["Vector search", "SQL database"], label=0.2),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
)
model.save("models/custom-embedding-model")
Popular models:
- all-MiniLM-L6-v2: 22M params, 384 dims, best speed
- all-mpnet-base-v2: 109M params, 768 dims, higher quality
- paraphrase-multilingual-mpnet-base-v2: 278M params, 768 dims, 50+ languages
Multilingual Embeddings
For global products, multilingual models keep a single embedding space:
from sentence_transformers import SentenceTransformer
# Multilingual model handles 50+ languages in one space
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
texts = [
"The quick brown fox", # English
"El rápido zorro marrón", # Spanish
"Le rapide renard brun", # French
"快速的棕色狐狸", # Chinese
]
embeddings = model.encode(texts)
# Cross-lingual similarity search
import numpy as np
query = "Fast brown animal"
query_emb = model.encode(query)
scores = np.dot(embeddings, query_emb) / (
np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_emb)
)
for text, score in zip(texts, scores):
print(f"{text}: {score:.4f}")
Trade-off: Multilingual models (768 dims) are larger than monolingual options (384 dims). For single-language use cases, specialized models are faster.
Domain-Specific Fine-Tuned Embeddings
Generic embeddings miss domain jargon. Fine-tuning dramatically improves retrieval:
from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MultipleNegativesRankingLoss
from torch.utils.data import DataLoader
from sentence_transformers import InputExample
# Example: medical document embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate training pairs: (query, positive_doc, negative_doc)
medical_examples = [
InputExample(
texts=[
"What are symptoms of hypertension?",
"Hypertension presents with elevated blood pressure readings",
"The liver produces bile for digestion"
]
),
InputExample(
texts=[
"How to treat diabetic neuropathy?",
"Diabetic neuropathy treatment involves glucose control and pain management",
"Acute myocardial infarction requires immediate intervention"
]
),
]
train_dataloader = DataLoader(
medical_examples,
shuffle=True,
batch_size=16
)
train_loss = MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=5,
warmup_steps=100,
evaluation_steps=500,
)
# Evaluate on domain-specific queries
queries = ["diabetes management", "cardiac procedures"]
docs = [
"Diabetes is managed through insulin and lifestyle changes",
"Angioplasty restores blood flow in coronary arteries",
]
query_embs = model.encode(queries)
doc_embs = model.encode(docs)
scores = np.dot(query_embs, doc_embs.T)
print("Retrieval scores after fine-tuning:")
print(scores)
model.save("models/medical-embeddings")
Fine-tuning shows 10-40% improvement in domain-specific retrieval. Cost: a few hours of GPU time.
MTEB Benchmark Interpretation
The Massive Text Embedding Benchmark (MTEB) evaluates models across 56 tasks. How to interpret scores:
# Hypothetical MTEB leaderboard snapshot (2026)
mteb_scores = {
"text-embedding-3-small": {
"retrieval": 0.74,
"clustering": 0.51,
"semantic_similarity": 0.82,
"classification": 0.78,
},
"text-embedding-3-large": {
"retrieval": 0.79,
"clustering": 0.54,
"semantic_similarity": 0.87,
"classification": 0.82,
},
"all-mpnet-base-v2": {
"retrieval": 0.69,
"clustering": 0.48,
"semantic_similarity": 0.80,
"classification": 0.74,
},
}
# Retrieval score <0.70: poor for production RAG
# 0.70-0.75: acceptable
# 0.75+: strong
# Your use case matters:
# - RAG with short queries: prioritize retrieval score
# - Clustering for topics: prioritize clustering score
# - Semantic search UI: prioritize semantic_similarity score
Don't optimize for overall score. Pick benchmarks matching your task. A model with 0.80 retrieval but 0.50 clustering is excellent for RAG.
Embedding Dimensions vs Performance
More dimensions don't always help. The law of diminishing returns applies:
# Theoretical trade-off
dimensions_vs_quality = {
"384": {"quality_gain": 1.0, "storage_mb_per_1m": 1.5, "latency_ms": 5},
"768": {"quality_gain": 1.15, "storage_mb_per_1m": 3.0, "latency_ms": 8},
"1536": {"quality_gain": 1.22, "storage_mb_per_1m": 6.0, "latency_ms": 12},
"3072": {"quality_gain": 1.25, "storage_mb_per_1m": 12.0, "latency_ms": 18},
}
# For 1M documents:
# 384 dims: 1.5 GB + fast
# 3072 dims: 12 GB + slower, 0.25% quality gain
# Recommendation: start with 384-768 dims
# Only go higher if retrieval quality is unacceptable
OpenAI's 3-small (512 dims) is Goldilocks: balanced quality and efficiency.
Cost Per Million Tokens
Here's 2026 pricing for reference:
# Pricing (as of 2026-03)
embedding_costs = {
"text-embedding-3-small": 0.02, # per 1M tokens
"text-embedding-3-large": 0.13,
"embed-english-v3.0": 0.10,
"local (free)": 0.00,
}
# Monthly cost for typical RAG app
docs = 100_000
avg_doc_tokens = 500
total_tokens = docs * avg_doc_tokens
# One-time embedding cost
print("One-time embedding cost:")
for model, cost_per_1m in embedding_costs.items():
total_cost = (total_tokens / 1_000_000) * cost_per_1m
if total_cost == 0:
print(f"{model}: $0 (free)")
else:
print(f"{model}: ${total_cost:.2f}")
# Query cost (assuming 1000 queries/day, 100 tokens each)
queries_per_month = 1000 * 30
query_tokens = queries_per_month * 100
print("\nMonthly query cost:")
for model, cost_per_1m in embedding_costs.items():
total_cost = (query_tokens / 1_000_000) * cost_per_1m
if total_cost == 0:
print(f"{model}: $0/month (free)")
else:
print(f"{model}: ${total_cost:.2f}/month")
At scale, local embeddings (0 cost) become attractive. Trade-off: DevOps overhead.
Batch Embedding for Bulk Processing
Never embed one-at-a-time in production. Batch processing is 10-50× faster:
import asyncio
from openai import AsyncOpenAI
async def batch_embed_documents(
documents: list[str],
batch_size: int = 100,
model: str = "text-embedding-3-small",
):
client = AsyncOpenAI(api_key="sk-...")
embeddings_list = []
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
response = await client.embeddings.create(
model=model,
input=batch,
)
embeddings_list.extend([data.embedding for data in response.data])
return embeddings_list
# Usage
async def main():
docs = [f"Document {i}" for i in range(10000)]
embeddings = await batch_embed_documents(docs, batch_size=100)
print(f"Embedded {len(embeddings)} documents")
asyncio.run(main())
Batch size 100-500 optimizes throughput. Larger batches = lower per-token cost.
Caching Embeddings
Never recompute embeddings. Cache aggressively:
import hashlib
import json
from redis import Redis
class EmbeddingCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.redis = Redis.from_url(redis_url)
def get_or_create(self, text: str, model: str = "text-embedding-3-small"):
# Create cache key from text content
key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}"
# Check cache
cached = self.redis.get(key)
if cached:
return json.loads(cached)
# Compute and cache
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(model=model, input=text)
embedding = response.data[0].embedding
# Store for 30 days
self.redis.setex(key, 30 * 86400, json.dumps(embedding))
return embedding
# Usage
cache = EmbeddingCache()
# First call: computes
emb1 = cache.get_or_create("Vector databases")
# Second call: returns cached
emb2 = cache.get_or_create("Vector databases")
# Verify cache hit
assert emb1 == emb2
Cache hits save 90%+ of embedding API costs. Critical for production.
Checklist
- Define embedding task (retrieval, clustering, classification)
- Check MTEB scores for your task, not overall score
- Benchmark 2-3 models on your corpus
- Calculate 12-month API cost vs local inference cost
- Test multilingual support if applicable
- Implement embedding caching layer
- Set up batch processing pipeline
- Monitor embedding model performance over time
- Plan fine-tuning if domain-specific retrieval is weak
Conclusion
In 2026, OpenAI's text-embedding-3-small is the default choice for most teams: excellent quality, low cost, and fully managed. Cohere embed v3 excels with input type hints. Sentence-transformers dominates self-hosted scenarios. For domain-specific needs, fine-tune or adopt specialized models. Always cache embeddings and batch-process. Choose based on your task (retrieval vs clustering vs similarity), not leaderboard hype.