OpenAI Embeddings API — Complete Tutorial

Introduction

OpenAI's Embeddings API provides state-of-the-art text embeddings through a simple REST API. This tutorial covers everything from basic usage to advanced techniques for building production applications.

Getting Started
Setup
First Embedding
Embedding Models
Model Comparison
Batch Processing
Efficient Batch Requests
Batch with Rate Limiting
Building a Semantic Search System
Using Dimensionality Reduction
Cost and Quality Trade-off
Integration with Vector Databases
Pinecone Integration
Chroma Integration
Building a RAG System with OpenAI Embeddings
Cost Optimization
Cost Reduction Strategies
Best Practices
Conclusion
FAQ

Getting Started

Setup

pip install openai numpy scipy

from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

First Embedding

response = client.embeddings.create(
    input="The quick brown fox jumps over the lazy dog",
    model="text-embedding-3-small"
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Embedding Models

OpenAI offers three embedding models:

Model	Dimensions	Input Limit	Use Case
text-embedding-3-small	1536	8191 tokens	Good balance of speed/quality
text-embedding-3-large	3072	8191 tokens	Highest quality, slower
text-embedding-ada-002	1536	8191 tokens	Deprecated, use 3-small

Model Comparison

# text-embedding-3-small (recommended)
response_small = client.embeddings.create(
    input="Machine learning is powerful",
    model="text-embedding-3-small"
)

# text-embedding-3-large (more accurate, slower)
response_large = client.embeddings.create(
    input="Machine learning is powerful",
    model="text-embedding-3-large"
)

small_dim = len(response_small.data[0].embedding)  # 1536
large_dim = len(response_large.data[0].embedding)  # 3072

Batch Processing

Efficient Batch Requests

from typing import List

def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> list:
    """Embed multiple texts efficiently."""
    # OpenAI allows up to 2048 texts per request
    response = client.embeddings.create(
        input=texts,
        model=model
    )

    embeddings = [item.embedding for item in response.data]
    return embeddings

# Embed documents
documents = [
    "The cat sat on the mat",
    "Dogs are loyal companions",
    "Machine learning algorithms learn patterns",
    "Python is a programming language"
]

embeddings = embed_texts(documents)
print(f"Generated {len(embeddings)} embeddings")

Batch with Rate Limiting

import time

def embed_large_dataset(texts: List[str], batch_size: int = 100):
    """Embed large datasets with rate limiting."""
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        embeddings = embed_texts(batch)
        all_embeddings.extend(embeddings)

        # Rate limit: 3 requests per minute for free tier
        if i + batch_size < len(texts):
            time.sleep(20)  # Wait 20 seconds between batches

    return all_embeddings

Building a Semantic Search System

import numpy as np
from scipy.spatial.distance import cosine

class SemanticSearcher:
    def __init__(self, documents: List[str]):
        self.documents = documents
        self.embeddings = np.array(embed_texts(documents))

    def search(self, query: str, top_k: int = 3) -> list:
        """Search documents semantically."""
        query_embedding = np.array(embed_texts([query])[0])

        # Compute cosine similarities
        similarities = []
        for emb in self.embeddings:
            sim = 1 - cosine(query_embedding, emb)
            similarities.append(sim)

        # Get top-k results
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        results = [
            {
                "document": self.documents[i],
                "score": similarities[i]
            }
            for i in top_indices
        ]

        return results

# Usage
documents = [
    "Python is a versatile programming language",
    "Machine learning requires large datasets",
    "Neural networks mimic the brain",
    "Data science combines statistics and programming"
]

searcher = SemanticSearcher(documents)
results = searcher.search("What is machine learning?", top_k=2)

for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Doc: {result['document']}\n")

Using Dimensionality Reduction

OpenAI embeddings support shortening to reduce costs:

response = client.embeddings.create(
    input="Sample text",
    model="text-embedding-3-small",
    dimensions=256  # Reduce from 1536 to 256
)

reduced_embedding = response.data[0].embedding
print(f"Reduced dimension: {len(reduced_embedding)}")  # 256

Cost and Quality Trade-off

# Original dimensions (1536)
# Cost: $0.02 per 1M embeddings
# Quality: Highest

# Reduced dimensions (256)
# Cost: Reduced proportionally
# Quality: Still very good for most applications

# Recommendation: Start with 1536, reduce if cost is concern and quality is acceptable

Integration with Vector Databases

Pinecone Integration

import pinecone

# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-key")
index = pc.Index("semantic-search")

# Embed and upsert
documents = ["doc1 content", "doc2 content"]
embeddings = embed_texts(documents)

vectors = [
    (f"id{i}", emb, {"text": doc})
    for i, (emb, doc) in enumerate(zip(embeddings, documents))
]

index.upsert(vectors=vectors)

# Query
query = "Find documents about machine learning"
query_embedding = embed_texts([query])[0]

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

Chroma Integration

import chromadb

client = chromadb.Client()
collection = client.create_collection(name="documents")

# Embed and add
documents = ["doc1", "doc2", "doc3"]
embeddings = embed_texts(documents)

collection.add(
    ids=[f"id{i}" for i in range(len(documents))],
    embeddings=embeddings,
    documents=documents
)

# Query
query_embedding = embed_texts(["query"])[0]
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

Building a RAG System with OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# Create embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store
vectorstore = Chroma.from_texts(
    texts=documents,
    embedding=embeddings,
    persist_directory="./chroma_data"
)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")
print(answer)

Cost Optimization

# Monitor usage
def estimate_cost(num_embeddings: int, model: str = "text-embedding-3-small"):
    """Estimate API cost."""
    # text-embedding-3-small: $0.02 per 1M embeddings
    # text-embedding-3-large: $0.13 per 1M embeddings

    if model == "text-embedding-3-small":
        cost_per_1m = 0.02
    else:
        cost_per_1m = 0.13

    total_cost = (num_embeddings / 1_000_000) * cost_per_1m
    return total_cost

# Example: 1M embeddings
cost = estimate_cost(1_000_000, "text-embedding-3-small")
print(f"Estimated cost: ${cost:.2f}")  # $0.02

Cost Reduction Strategies

Use text-embedding-3-small: 10x cheaper than large
Reduce dimensions: Use dimensions parameter
Cache embeddings: Don't re-embed identical texts
Batch requests: Send multiple texts per request
Lazy embedding: Only embed when necessary

# Caching implementation
from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_embed(text: str) -> tuple:
    """Cache embeddings to avoid redundant API calls."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return tuple(response.data[0].embedding)

# Use cached version
embedding = cached_embed("query text")

Best Practices

Handle Rate Limits: Implement exponential backoff
Error Handling: Catch API errors gracefully
Input Validation: Truncate texts exceeding 8191 tokens
Monitor Costs: Track API usage regularly
Batch Requests: Combine multiple texts per API call

import time
from tenacity import retry, wait_exponential

@retry(wait=wait_exponential(multiplier=1, min=2, max=10))
def embed_with_retry(text: str) -> list:
    """Embed with automatic retry on failure."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

Conclusion

OpenAI's Embeddings API simplifies building semantic search and RAG systems. With proper batching, caching, and optimization, you can build production-grade applications cost-effectively.

FAQ

Q: Should I use 3-small or 3-large? A: Start with 3-small; it's 10x cheaper and sufficient for most applications. Upgrade to 3-large if accuracy matters more than cost.

Q: Can I reduce embeddings to save money? A: Yes, use the dimensions parameter, but start at 1536 and only reduce if quality remains acceptable for your use case.

Q: How often should I re-embed documents? A: Only re-embed when documents change or annually for model improvements. Caching eliminates redundant embeddings.