OpenAI Embeddings API — Complete Tutorial

Sanjeev SharmaSanjeev Sharma
6 min read

Advertisement

Introduction

OpenAI's Embeddings API provides state-of-the-art text embeddings through a simple REST API. This tutorial covers everything from basic usage to advanced techniques for building production applications.

Getting Started

Setup

pip install openai numpy scipy
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

First Embedding

response = client.embeddings.create(
    input="The quick brown fox jumps over the lazy dog",
    model="text-embedding-3-small"
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Embedding Models

OpenAI offers three embedding models:

ModelDimensionsInput LimitUse Case
text-embedding-3-small15368191 tokensGood balance of speed/quality
text-embedding-3-large30728191 tokensHighest quality, slower
text-embedding-ada-00215368191 tokensDeprecated, use 3-small

Model Comparison

# text-embedding-3-small (recommended)
response_small = client.embeddings.create(
    input="Machine learning is powerful",
    model="text-embedding-3-small"
)

# text-embedding-3-large (more accurate, slower)
response_large = client.embeddings.create(
    input="Machine learning is powerful",
    model="text-embedding-3-large"
)

small_dim = len(response_small.data[0].embedding)  # 1536
large_dim = len(response_large.data[0].embedding)  # 3072

Batch Processing

Efficient Batch Requests

from typing import List

def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> list:
    """Embed multiple texts efficiently."""
    # OpenAI allows up to 2048 texts per request
    response = client.embeddings.create(
        input=texts,
        model=model
    )

    embeddings = [item.embedding for item in response.data]
    return embeddings

# Embed documents
documents = [
    "The cat sat on the mat",
    "Dogs are loyal companions",
    "Machine learning algorithms learn patterns",
    "Python is a programming language"
]

embeddings = embed_texts(documents)
print(f"Generated {len(embeddings)} embeddings")

Batch with Rate Limiting

import time

def embed_large_dataset(texts: List[str], batch_size: int = 100):
    """Embed large datasets with rate limiting."""
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        embeddings = embed_texts(batch)
        all_embeddings.extend(embeddings)

        # Rate limit: 3 requests per minute for free tier
        if i + batch_size < len(texts):
            time.sleep(20)  # Wait 20 seconds between batches

    return all_embeddings

Building a Semantic Search System

import numpy as np
from scipy.spatial.distance import cosine

class SemanticSearcher:
    def __init__(self, documents: List[str]):
        self.documents = documents
        self.embeddings = np.array(embed_texts(documents))

    def search(self, query: str, top_k: int = 3) -> list:
        """Search documents semantically."""
        query_embedding = np.array(embed_texts([query])[0])

        # Compute cosine similarities
        similarities = []
        for emb in self.embeddings:
            sim = 1 - cosine(query_embedding, emb)
            similarities.append(sim)

        # Get top-k results
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        results = [
            {
                "document": self.documents[i],
                "score": similarities[i]
            }
            for i in top_indices
        ]

        return results

# Usage
documents = [
    "Python is a versatile programming language",
    "Machine learning requires large datasets",
    "Neural networks mimic the brain",
    "Data science combines statistics and programming"
]

searcher = SemanticSearcher(documents)
results = searcher.search("What is machine learning?", top_k=2)

for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Doc: {result['document']}\n")

Using Dimensionality Reduction

OpenAI embeddings support shortening to reduce costs:

response = client.embeddings.create(
    input="Sample text",
    model="text-embedding-3-small",
    dimensions=256  # Reduce from 1536 to 256
)

reduced_embedding = response.data[0].embedding
print(f"Reduced dimension: {len(reduced_embedding)}")  # 256

Cost and Quality Trade-off

# Original dimensions (1536)
# Cost: $0.02 per 1M embeddings
# Quality: Highest

# Reduced dimensions (256)
# Cost: Reduced proportionally
# Quality: Still very good for most applications

# Recommendation: Start with 1536, reduce if cost is concern and quality is acceptable

Integration with Vector Databases

Pinecone Integration

import pinecone

# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-key")
index = pc.Index("semantic-search")

# Embed and upsert
documents = ["doc1 content", "doc2 content"]
embeddings = embed_texts(documents)

vectors = [
    (f"id{i}", emb, {"text": doc})
    for i, (emb, doc) in enumerate(zip(embeddings, documents))
]

index.upsert(vectors=vectors)

# Query
query = "Find documents about machine learning"
query_embedding = embed_texts([query])[0]

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

Chroma Integration

import chromadb

client = chromadb.Client()
collection = client.create_collection(name="documents")

# Embed and add
documents = ["doc1", "doc2", "doc3"]
embeddings = embed_texts(documents)

collection.add(
    ids=[f"id{i}" for i in range(len(documents))],
    embeddings=embeddings,
    documents=documents
)

# Query
query_embedding = embed_texts(["query"])[0]
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

Building a RAG System with OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# Create embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store
vectorstore = Chroma.from_texts(
    texts=documents,
    embedding=embeddings,
    persist_directory="./chroma_data"
)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")
print(answer)

Cost Optimization

# Monitor usage
def estimate_cost(num_embeddings: int, model: str = "text-embedding-3-small"):
    """Estimate API cost."""
    # text-embedding-3-small: $0.02 per 1M embeddings
    # text-embedding-3-large: $0.13 per 1M embeddings

    if model == "text-embedding-3-small":
        cost_per_1m = 0.02
    else:
        cost_per_1m = 0.13

    total_cost = (num_embeddings / 1_000_000) * cost_per_1m
    return total_cost

# Example: 1M embeddings
cost = estimate_cost(1_000_000, "text-embedding-3-small")
print(f"Estimated cost: ${cost:.2f}")  # $0.02

Cost Reduction Strategies

  1. Use text-embedding-3-small: 10x cheaper than large
  2. Reduce dimensions: Use dimensions parameter
  3. Cache embeddings: Don't re-embed identical texts
  4. Batch requests: Send multiple texts per request
  5. Lazy embedding: Only embed when necessary
# Caching implementation
from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_embed(text: str) -> tuple:
    """Cache embeddings to avoid redundant API calls."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return tuple(response.data[0].embedding)

# Use cached version
embedding = cached_embed("query text")

Best Practices

  1. Handle Rate Limits: Implement exponential backoff
  2. Error Handling: Catch API errors gracefully
  3. Input Validation: Truncate texts exceeding 8191 tokens
  4. Monitor Costs: Track API usage regularly
  5. Batch Requests: Combine multiple texts per API call
import time
from tenacity import retry, wait_exponential

@retry(wait=wait_exponential(multiplier=1, min=2, max=10))
def embed_with_retry(text: str) -> list:
    """Embed with automatic retry on failure."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

Conclusion

OpenAI's Embeddings API simplifies building semantic search and RAG systems. With proper batching, caching, and optimization, you can build production-grade applications cost-effectively.

FAQ

Q: Should I use 3-small or 3-large? A: Start with 3-small; it's 10x cheaper and sufficient for most applications. Upgrade to 3-large if accuracy matters more than cost.

Q: Can I reduce embeddings to save money? A: Yes, use the dimensions parameter, but start at 1536 and only reduce if quality remains acceptable for your use case.

Q: How often should I re-embed documents? A: Only re-embed when documents change or annually for model improvements. Caching eliminates redundant embeddings.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro