OpenAI Embeddings API — Complete Tutorial
Advertisement
Introduction
OpenAI's Embeddings API provides state-of-the-art text embeddings through a simple REST API. This tutorial covers everything from basic usage to advanced techniques for building production applications.
- Getting Started
- Setup
- First Embedding
- Embedding Models
- Model Comparison
- Batch Processing
- Efficient Batch Requests
- Batch with Rate Limiting
- Building a Semantic Search System
- Using Dimensionality Reduction
- Cost and Quality Trade-off
- Integration with Vector Databases
- Pinecone Integration
- Chroma Integration
- Building a RAG System with OpenAI Embeddings
- Cost Optimization
- Cost Reduction Strategies
- Best Practices
- Conclusion
- FAQ
Getting Started
Setup
pip install openai numpy scipy
from openai import OpenAI
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
First Embedding
response = client.embeddings.create(
input="The quick brown fox jumps over the lazy dog",
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
Embedding Models
OpenAI offers three embedding models:
| Model | Dimensions | Input Limit | Use Case |
|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 tokens | Good balance of speed/quality |
| text-embedding-3-large | 3072 | 8191 tokens | Highest quality, slower |
| text-embedding-ada-002 | 1536 | 8191 tokens | Deprecated, use 3-small |
Model Comparison
# text-embedding-3-small (recommended)
response_small = client.embeddings.create(
input="Machine learning is powerful",
model="text-embedding-3-small"
)
# text-embedding-3-large (more accurate, slower)
response_large = client.embeddings.create(
input="Machine learning is powerful",
model="text-embedding-3-large"
)
small_dim = len(response_small.data[0].embedding) # 1536
large_dim = len(response_large.data[0].embedding) # 3072
Batch Processing
Efficient Batch Requests
from typing import List
def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> list:
"""Embed multiple texts efficiently."""
# OpenAI allows up to 2048 texts per request
response = client.embeddings.create(
input=texts,
model=model
)
embeddings = [item.embedding for item in response.data]
return embeddings
# Embed documents
documents = [
"The cat sat on the mat",
"Dogs are loyal companions",
"Machine learning algorithms learn patterns",
"Python is a programming language"
]
embeddings = embed_texts(documents)
print(f"Generated {len(embeddings)} embeddings")
Batch with Rate Limiting
import time
def embed_large_dataset(texts: List[str], batch_size: int = 100):
"""Embed large datasets with rate limiting."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
embeddings = embed_texts(batch)
all_embeddings.extend(embeddings)
# Rate limit: 3 requests per minute for free tier
if i + batch_size < len(texts):
time.sleep(20) # Wait 20 seconds between batches
return all_embeddings
Building a Semantic Search System
import numpy as np
from scipy.spatial.distance import cosine
class SemanticSearcher:
def __init__(self, documents: List[str]):
self.documents = documents
self.embeddings = np.array(embed_texts(documents))
def search(self, query: str, top_k: int = 3) -> list:
"""Search documents semantically."""
query_embedding = np.array(embed_texts([query])[0])
# Compute cosine similarities
similarities = []
for emb in self.embeddings:
sim = 1 - cosine(query_embedding, emb)
similarities.append(sim)
# Get top-k results
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = [
{
"document": self.documents[i],
"score": similarities[i]
}
for i in top_indices
]
return results
# Usage
documents = [
"Python is a versatile programming language",
"Machine learning requires large datasets",
"Neural networks mimic the brain",
"Data science combines statistics and programming"
]
searcher = SemanticSearcher(documents)
results = searcher.search("What is machine learning?", top_k=2)
for result in results:
print(f"Score: {result['score']:.3f}")
print(f"Doc: {result['document']}\n")
Using Dimensionality Reduction
OpenAI embeddings support shortening to reduce costs:
response = client.embeddings.create(
input="Sample text",
model="text-embedding-3-small",
dimensions=256 # Reduce from 1536 to 256
)
reduced_embedding = response.data[0].embedding
print(f"Reduced dimension: {len(reduced_embedding)}") # 256
Cost and Quality Trade-off
# Original dimensions (1536)
# Cost: $0.02 per 1M embeddings
# Quality: Highest
# Reduced dimensions (256)
# Cost: Reduced proportionally
# Quality: Still very good for most applications
# Recommendation: Start with 1536, reduce if cost is concern and quality is acceptable
Integration with Vector Databases
Pinecone Integration
import pinecone
# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-key")
index = pc.Index("semantic-search")
# Embed and upsert
documents = ["doc1 content", "doc2 content"]
embeddings = embed_texts(documents)
vectors = [
(f"id{i}", emb, {"text": doc})
for i, (emb, doc) in enumerate(zip(embeddings, documents))
]
index.upsert(vectors=vectors)
# Query
query = "Find documents about machine learning"
query_embedding = embed_texts([query])[0]
results = index.query(
vector=query_embedding,
top_k=3,
include_metadata=True
)
Chroma Integration
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="documents")
# Embed and add
documents = ["doc1", "doc2", "doc3"]
embeddings = embed_texts(documents)
collection.add(
ids=[f"id{i}" for i in range(len(documents))],
embeddings=embeddings,
documents=documents
)
# Query
query_embedding = embed_texts(["query"])[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
Building a RAG System with OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# Create embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store
vectorstore = Chroma.from_texts(
texts=documents,
embedding=embeddings,
persist_directory="./chroma_data"
)
# Create QA chain
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Query
answer = qa.run("What is the main topic?")
print(answer)
Cost Optimization
# Monitor usage
def estimate_cost(num_embeddings: int, model: str = "text-embedding-3-small"):
"""Estimate API cost."""
# text-embedding-3-small: $0.02 per 1M embeddings
# text-embedding-3-large: $0.13 per 1M embeddings
if model == "text-embedding-3-small":
cost_per_1m = 0.02
else:
cost_per_1m = 0.13
total_cost = (num_embeddings / 1_000_000) * cost_per_1m
return total_cost
# Example: 1M embeddings
cost = estimate_cost(1_000_000, "text-embedding-3-small")
print(f"Estimated cost: ${cost:.2f}") # $0.02
Cost Reduction Strategies
- Use text-embedding-3-small: 10x cheaper than large
- Reduce dimensions: Use
dimensionsparameter - Cache embeddings: Don't re-embed identical texts
- Batch requests: Send multiple texts per request
- Lazy embedding: Only embed when necessary
# Caching implementation
from functools import lru_cache
@lru_cache(maxsize=10000)
def cached_embed(text: str) -> tuple:
"""Cache embeddings to avoid redundant API calls."""
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return tuple(response.data[0].embedding)
# Use cached version
embedding = cached_embed("query text")
Best Practices
- Handle Rate Limits: Implement exponential backoff
- Error Handling: Catch API errors gracefully
- Input Validation: Truncate texts exceeding 8191 tokens
- Monitor Costs: Track API usage regularly
- Batch Requests: Combine multiple texts per API call
import time
from tenacity import retry, wait_exponential
@retry(wait=wait_exponential(multiplier=1, min=2, max=10))
def embed_with_retry(text: str) -> list:
"""Embed with automatic retry on failure."""
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
Conclusion
OpenAI's Embeddings API simplifies building semantic search and RAG systems. With proper batching, caching, and optimization, you can build production-grade applications cost-effectively.
FAQ
Q: Should I use 3-small or 3-large? A: Start with 3-small; it's 10x cheaper and sufficient for most applications. Upgrade to 3-large if accuracy matters more than cost.
Q: Can I reduce embeddings to save money? A: Yes, use the dimensions parameter, but start at 1536 and only reduce if quality remains acceptable for your use case.
Q: How often should I re-embed documents? A: Only re-embed when documents change or annually for model improvements. Caching eliminates redundant embeddings.
Advertisement