Published on

Embedding Model Comparison — OpenAI, Cohere, and Open-Source Options

Authors

Introduction

Embeddings are the foundation of semantic search, RAG, and recommendation systems. Your embedding model choice directly impacts retrieval quality, latency, and cost. In 2026, you have excellent options: OpenAI's text-embedding-3 family, Cohere's multi-input embeddings, and powerful open-source models that run locally. This guide cuts through the noise.

OpenAI text-embedding-3 Family

OpenAI's latest models set the quality bar. text-embedding-3-small is incredibly efficient:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

# text-embedding-3-small: 512 dimensions, $0.02 per 1M tokens
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "The quick brown fox jumps over the lazy dog",
        "Vector embeddings power semantic search",
        "Transformer models revolutionized NLP",
    ],
)

# Access embeddings
for data in response.data:
    print(f"Index {data.index}: {len(data.embedding)} dimensions")
    print(f"First 5 values: {data.embedding[:5]}")

# text-embedding-3-large: 3072 dimensions, $0.13 per 1M tokens
# Higher quality, 10-25% better retrieval accuracy on MTEB
response_large = client.embeddings.create(
    model="text-embedding-3-large",
    input="Advanced RAG with hybrid search",
)

print(f"Embedding dimensions: {len(response_large.data[0].embedding)}")

3-small vs 3-large trade-off: 3-small is 6× cheaper and 16× smaller embeddings (512 vs 3072 dims). For most RAG use cases, 3-small is sufficient. Use 3-large when:

  • Retrieval quality is critical (medical, legal docs)
  • Your queries are ultra-vague
  • You have compute budget for larger indexes

Cohere Embed v3 With Input Types

Cohere's embed-english-v3.0 introduces input type hints, improving embedding quality for specific contexts:

import cohere

co = cohere.ClientV2(api_key="...")

# Specify input_type: search_query, search_document, or classification
response = co.embed(
    texts=[
        "Find me papers about quantum computing",
        "Quantum computing is a computational paradigm",
    ],
    model="embed-english-v3.0",
    input_type="search_query",  # for queries
    embedding_types=["float"],
)

# Add documents with search_document type
doc_response = co.embed(
    texts=[
        "A comprehensive guide to quantum algorithms",
        "Quantum error correction techniques",
    ],
    model="embed-english-v3.0",
    input_type="search_document",  # for documents
    embedding_types=["float"],
)

print(f"Query embeddings: {len(response.embeddings.float_)}")
print(f"Document embeddings: {len(doc_response.embeddings.float_)}")

# Cost: $0.10 per 1M tokens, 1024 dimensions

Cohere also supports int8 quantization for 4× storage reduction. Input types improve retrieval by 3-5% without reindexing.

Sentence-Transformers Local Options

sentence-transformers runs entirely locally, ideal for privacy-sensitive applications:

from sentence_transformers import SentenceTransformer, util
import torch

# Load model (auto-downloads on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')

# all-MiniLM-L6-v2: 22M parameters, 384 dimensions
# Inference: ~1ms per sentence on CPU
sentences = [
    "Vector databases accelerate AI",
    "Semantic search finds relevant documents",
    "Embeddings capture meaning in vector space",
]

embeddings = model.encode(sentences, show_progress_bar=True)
print(f"Shape: {embeddings.shape}")  # (3, 384)

# Cosine similarity
query = "How do embeddings work?"
query_embedding = model.encode(query)

similarities = util.cos_sim(query_embedding, embeddings)[0]
for idx, score in enumerate(similarities):
    print(f"{sentences[idx]}: {score:.4f}")

# Batch encode for production
batch_size = 32
large_corpus = ["text"] * 10000
all_embeddings = model.encode(
    large_corpus,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True,
)
print(f"Encoded {len(all_embeddings)} documents")

# Fine-tuning for domain-specific quality
from sentence_transformers import losses, InputExample
from torch.utils.data import DataLoader

# Training data: (sentence1, sentence2, similarity_score)
train_examples = [
    InputExample(texts=["Quantum computing", "Quantum algorithms"], label=0.8),
    InputExample(texts=["Vector search", "SQL database"], label=0.2),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
)

model.save("models/custom-embedding-model")

Popular models:

  • all-MiniLM-L6-v2: 22M params, 384 dims, best speed
  • all-mpnet-base-v2: 109M params, 768 dims, higher quality
  • paraphrase-multilingual-mpnet-base-v2: 278M params, 768 dims, 50+ languages

Multilingual Embeddings

For global products, multilingual models keep a single embedding space:

from sentence_transformers import SentenceTransformer

# Multilingual model handles 50+ languages in one space
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

texts = [
    "The quick brown fox",  # English
    "El rápido zorro marrón",  # Spanish
    "Le rapide renard brun",  # French
    "快速的棕色狐狸",  # Chinese
]

embeddings = model.encode(texts)

# Cross-lingual similarity search
import numpy as np

query = "Fast brown animal"
query_emb = model.encode(query)

scores = np.dot(embeddings, query_emb) / (
    np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_emb)
)

for text, score in zip(texts, scores):
    print(f"{text}: {score:.4f}")

Trade-off: Multilingual models (768 dims) are larger than monolingual options (384 dims). For single-language use cases, specialized models are faster.

Domain-Specific Fine-Tuned Embeddings

Generic embeddings miss domain jargon. Fine-tuning dramatically improves retrieval:

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MultipleNegativesRankingLoss
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

# Example: medical document embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate training pairs: (query, positive_doc, negative_doc)
medical_examples = [
    InputExample(
        texts=[
            "What are symptoms of hypertension?",
            "Hypertension presents with elevated blood pressure readings",
            "The liver produces bile for digestion"
        ]
    ),
    InputExample(
        texts=[
            "How to treat diabetic neuropathy?",
            "Diabetic neuropathy treatment involves glucose control and pain management",
            "Acute myocardial infarction requires immediate intervention"
        ]
    ),
]

train_dataloader = DataLoader(
    medical_examples,
    shuffle=True,
    batch_size=16
)

train_loss = MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=5,
    warmup_steps=100,
    evaluation_steps=500,
)

# Evaluate on domain-specific queries
queries = ["diabetes management", "cardiac procedures"]
docs = [
    "Diabetes is managed through insulin and lifestyle changes",
    "Angioplasty restores blood flow in coronary arteries",
]

query_embs = model.encode(queries)
doc_embs = model.encode(docs)

scores = np.dot(query_embs, doc_embs.T)
print("Retrieval scores after fine-tuning:")
print(scores)

model.save("models/medical-embeddings")

Fine-tuning shows 10-40% improvement in domain-specific retrieval. Cost: a few hours of GPU time.

MTEB Benchmark Interpretation

The Massive Text Embedding Benchmark (MTEB) evaluates models across 56 tasks. How to interpret scores:

# Hypothetical MTEB leaderboard snapshot (2026)
mteb_scores = {
    "text-embedding-3-small": {
        "retrieval": 0.74,
        "clustering": 0.51,
        "semantic_similarity": 0.82,
        "classification": 0.78,
    },
    "text-embedding-3-large": {
        "retrieval": 0.79,
        "clustering": 0.54,
        "semantic_similarity": 0.87,
        "classification": 0.82,
    },
    "all-mpnet-base-v2": {
        "retrieval": 0.69,
        "clustering": 0.48,
        "semantic_similarity": 0.80,
        "classification": 0.74,
    },
}

# Retrieval score <0.70: poor for production RAG
# 0.70-0.75: acceptable
# 0.75+: strong

# Your use case matters:
# - RAG with short queries: prioritize retrieval score
# - Clustering for topics: prioritize clustering score
# - Semantic search UI: prioritize semantic_similarity score

Don't optimize for overall score. Pick benchmarks matching your task. A model with 0.80 retrieval but 0.50 clustering is excellent for RAG.

Embedding Dimensions vs Performance

More dimensions don't always help. The law of diminishing returns applies:

# Theoretical trade-off
dimensions_vs_quality = {
    "384": {"quality_gain": 1.0, "storage_mb_per_1m": 1.5, "latency_ms": 5},
    "768": {"quality_gain": 1.15, "storage_mb_per_1m": 3.0, "latency_ms": 8},
    "1536": {"quality_gain": 1.22, "storage_mb_per_1m": 6.0, "latency_ms": 12},
    "3072": {"quality_gain": 1.25, "storage_mb_per_1m": 12.0, "latency_ms": 18},
}

# For 1M documents:
# 384 dims: 1.5 GB + fast
# 3072 dims: 12 GB + slower, 0.25% quality gain

# Recommendation: start with 384-768 dims
# Only go higher if retrieval quality is unacceptable

OpenAI's 3-small (512 dims) is Goldilocks: balanced quality and efficiency.

Cost Per Million Tokens

Here's 2026 pricing for reference:

# Pricing (as of 2026-03)
embedding_costs = {
    "text-embedding-3-small": 0.02,   # per 1M tokens
    "text-embedding-3-large": 0.13,
    "embed-english-v3.0": 0.10,
    "local (free)": 0.00,
}

# Monthly cost for typical RAG app
docs = 100_000
avg_doc_tokens = 500
total_tokens = docs * avg_doc_tokens

# One-time embedding cost
print("One-time embedding cost:")
for model, cost_per_1m in embedding_costs.items():
    total_cost = (total_tokens / 1_000_000) * cost_per_1m
    if total_cost == 0:
        print(f"{model}: $0 (free)")
    else:
        print(f"{model}: ${total_cost:.2f}")

# Query cost (assuming 1000 queries/day, 100 tokens each)
queries_per_month = 1000 * 30
query_tokens = queries_per_month * 100

print("\nMonthly query cost:")
for model, cost_per_1m in embedding_costs.items():
    total_cost = (query_tokens / 1_000_000) * cost_per_1m
    if total_cost == 0:
        print(f"{model}: $0/month (free)")
    else:
        print(f"{model}: ${total_cost:.2f}/month")

At scale, local embeddings (0 cost) become attractive. Trade-off: DevOps overhead.

Batch Embedding for Bulk Processing

Never embed one-at-a-time in production. Batch processing is 10-50× faster:

import asyncio
from openai import AsyncOpenAI

async def batch_embed_documents(
    documents: list[str],
    batch_size: int = 100,
    model: str = "text-embedding-3-small",
):
    client = AsyncOpenAI(api_key="sk-...")
    embeddings_list = []

    for i in range(0, len(documents), batch_size):
        batch = documents[i : i + batch_size]

        response = await client.embeddings.create(
            model=model,
            input=batch,
        )

        embeddings_list.extend([data.embedding for data in response.data])

    return embeddings_list

# Usage
async def main():
    docs = [f"Document {i}" for i in range(10000)]
    embeddings = await batch_embed_documents(docs, batch_size=100)
    print(f"Embedded {len(embeddings)} documents")

asyncio.run(main())

Batch size 100-500 optimizes throughput. Larger batches = lower per-token cost.

Caching Embeddings

Never recompute embeddings. Cache aggressively:

import hashlib
import json
from redis import Redis

class EmbeddingCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = Redis.from_url(redis_url)

    def get_or_create(self, text: str, model: str = "text-embedding-3-small"):
        # Create cache key from text content
        key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}"

        # Check cache
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)

        # Compute and cache
        from openai import OpenAI
        client = OpenAI()
        response = client.embeddings.create(model=model, input=text)
        embedding = response.data[0].embedding

        # Store for 30 days
        self.redis.setex(key, 30 * 86400, json.dumps(embedding))
        return embedding

# Usage
cache = EmbeddingCache()

# First call: computes
emb1 = cache.get_or_create("Vector databases")

# Second call: returns cached
emb2 = cache.get_or_create("Vector databases")

# Verify cache hit
assert emb1 == emb2

Cache hits save 90%+ of embedding API costs. Critical for production.

Checklist

  • Define embedding task (retrieval, clustering, classification)
  • Check MTEB scores for your task, not overall score
  • Benchmark 2-3 models on your corpus
  • Calculate 12-month API cost vs local inference cost
  • Test multilingual support if applicable
  • Implement embedding caching layer
  • Set up batch processing pipeline
  • Monitor embedding model performance over time
  • Plan fine-tuning if domain-specific retrieval is weak

Conclusion

In 2026, OpenAI's text-embedding-3-small is the default choice for most teams: excellent quality, low cost, and fully managed. Cohere embed v3 excels with input type hints. Sentence-transformers dominates self-hosted scenarios. For domain-specific needs, fine-tune or adopt specialized models. Always cache embeddings and batch-process. Choose based on your task (retrieval vs clustering vs similarity), not leaderboard hype.