Published on

Pinecone in Production — Namespaces, Metadata Filtering, and Cost Optimization

Authors

Introduction

Pinecone's managed infrastructure eliminates DevOps overhead, but production deployments require strategic choices about namespaces, filtering, and cost management. This guide covers real-world patterns for scaling Pinecone in production.

Serverless vs Pod-Based Architecture

Choose based on your query volume and cost model:

import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

// Serverless: pay per query + storage
// Ideal for: variable traffic, startups, <10K QPS
// Cost: $0.40 per 100K queries + $0.10/GB-month storage

const serverlessIndex = pc.Index("my-serverless-index");

// Pod-based: fixed monthly cost + compute units
// Ideal: predictable traffic >10K QPS, enterprises
// Cost: $0.50-$2.00 per pod per month + storage

// Cost calculation for typical use case:
// 100M vectors, 1536 dims (6GB) + 100K queries/day
const monthlyQueries = 100_000 * 30; // 3M queries
const storageGB = 6;

const serverlessCost = (monthlyQueries / 100_000) * 0.40 + storageGB * 0.10;
// = 12 + 0.60 = $12.60/month

const podBasedCost = 2.0 * 1 + storageGB * 0.10; // 1 pod
// = 2.00 + 0.60 = $2.60/month

console.log(`Serverless: $${serverlessCost}/month`);
console.log(`Pod-based: $${podBasedCost}/month`);

// Serverless wins until ~50K QPS; pods win at high volume

Rule of thumb: Serverless for <5K queries/day, pods for >100K queries/day.

Namespace Strategy: Per-Tenant and Per-Environment

Namespaces isolate data within an index, enabling multi-tenancy:

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const index = pc.Index("shared-index");

// Namespace per tenant (SaaS multi-tenancy)
const tenantId = "org-12345";
const namespace = `tenant-${tenantId}`;

// Upsert to tenant namespace
await index.namespace(namespace).upsert([
  {
    id: "doc-1",
    values: [0.1, 0.2, 0.3, 0.4],
    metadata: {
      title: "AI Infrastructure",
      author: "alice",
      domain: "ai",
    },
  },
  {
    id: "doc-2",
    values: [0.2, 0.3, 0.4, 0.5],
    metadata: {
      title: "Vector Databases",
      author: "bob",
      domain: "db",
    },
  },
]);

// Query only tenant's data
const results = await index.namespace(namespace).query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  includeMetadata: true,
});

console.log(`Found ${results.matches.length} results for tenant ${tenantId}`);

// Namespace per environment (dev, staging, prod)
async function queryEnvironment(
  env: "dev" | "staging" | "prod",
  query: number[],
) {
  return index.namespace(env).query({
    vector: query,
    topK: 10,
  });
}

const prodResults = await queryEnvironment("prod", [0.1, 0.2, 0.3, 0.4]);

// Hybrid strategy: tenant + environment
const multiTenantNamespace = `${tenantId}-prod`;
await index.namespace(multiTenantNamespace).upsert([
  {
    id: "prod-doc-1",
    values: [0.1, 0.2, 0.3, 0.4],
    metadata: { env: "prod" },
  },
]);

Per-tenant namespaces: Enable one index per customer group, reducing index count and costs.

Per-environment namespaces: Run prod, staging, dev on single index with complete isolation.

Metadata Filtering Best Practices

Metadata filtering adds precision to vector search:

const index = pc.Index("documents");

// Insert with rich metadata
await index.upsert([
  {
    id: "1",
    values: [0.1, 0.2, 0.3, 0.4],
    metadata: {
      title: "Advanced RAG",
      category: "ai",
      author_id: "user-123",
      date_published: "2026-03-15",
      confidence: 0.95,
      is_public: true,
      tags: ["retrieval", "llm", "production"],
    },
  },
]);

// Query with metadata filter: exact match (string)
const exactMatch = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  filter: {
    category: { $eq: "ai" },
  },
  includeMetadata: true,
});

// Query with numeric range
const rangeFilter = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  filter: {
    confidence: { $gte: 0.9 },
  },
});

// Complex filter: AND logic
const complexFilter = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  filter: {
    $and: [
      { category: { $eq: "ai" } },
      { confidence: { $gte: 0.9 } },
      { is_public: { $eq: true } },
    ],
  },
});

// Filter with $in operator
const multiMatch = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  filter: {
    author_id: { $in: ["user-123", "user-456", "user-789"] },
  },
});

// NOT filter
const exclude = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  filter: {
    category: { $ne: "spam" },
  },
});

Filtering best practices:

  • Index only metadata you filter on frequently
  • Use $in for many values instead of multiple conditions
  • Combine vector similarity with filters for precise results

Upsert Batching: 100 Vectors Max

Pinecone limits batch size to 100 vectors. Implement smart batching:

import { Pinecone, Vector } from "@pinecone-database/pinecone";

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const index = pc.Index("documents");

async function batchUpsert(
  vectors: Vector[],
  batchSize: number = 100,
): Promise<void> {
  for (let i = 0; i < vectors.length; i += batchSize) {
    const batch = vectors.slice(i, i + batchSize);
    try {
      await index.upsert(batch);
      console.log(`Upserted ${batch.length} vectors (${i + batch.length}/${vectors.length})`);
    } catch (error) {
      console.error(`Batch failed at index ${i}:`, error);
      // Retry or handle error
    }
  }
}

// Generate documents for bulk ingestion
async function ingestDocuments(documents: { id: string; embedding: number[]; text: string }[]) {
  const vectors: Vector[] = documents.map((doc) => ({
    id: doc.id,
    values: doc.embedding,
    metadata: {
      text: doc.text,
      ingested_at: new Date().toISOString(),
    },
  }));

  await batchUpsert(vectors);
}

// Usage
const documents = Array.from({ length: 10000 }, (_, i) => ({
  id: `doc-${i}`,
  embedding: Array(1536).fill(0.1 + (i % 256) / 256.0),
  text: `Document ${i}`,
}));

await ingestDocuments(documents);

Batch upsert is 10× faster than single upsert. Respect the 100-vector limit.

Hybrid Search: Dense + Sparse

Combine vector embeddings (dense) with keyword search (sparse):

import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const index = pc.Index("hybrid-index");

// Insert with sparse vectors (keyword matches)
// Sparse vectors have few non-zero dimensions

const documents = [
  {
    id: "doc-1",
    values: [0.1, 0.2, 0.3, 0.4], // dense embedding
    sparseValues: {
      indices: [0, 42, 128, 512], // keyword indices
      values: [1, 1, 0.8, 0.9], // keyword importance
    },
    metadata: {
      keywords: ["vector", "database", "retrieval"],
      title: "Vector Databases 101",
    },
  },
];

await index.upsert(documents);

// Dense query (semantic search)
const denseResults = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 10,
  includeMetadata: true,
});

// Sparse query (keyword search)
// Requires preprocessing text to sparse indices
function textToSparseVector(text: string) {
  const words = text.toLowerCase().split(/\s+/);
  // Simplified: in production, use vocabulary hash
  const indices = words.map((w) => Math.abs(w.charCodeAt(0) + w.length) % 512);
  return { indices, values: Array(indices.length).fill(1.0) };
}

const sparseQuery = textToSparseVector("vector retrieval");

// Hybrid results combine dense + sparse
// Result score = alpha * denseScore + (1 - alpha) * sparseScore

Hybrid search excels for:

  • Exact phrase matching + semantic relevance
  • Technical documents where keywords matter
  • Reducing hallucinations from pure semantic search

Index Freshness and Async Upsert

Pinecone's eventual consistency model requires understanding:

const index = pc.Index("documents");

// Async upsert: faster, might return stale data briefly
await index.upsert([
  {
    id: "doc-1",
    values: [0.1, 0.2, 0.3, 0.4],
    metadata: { version: 2 },
  },
]);

// Immediately query might return old version (briefly)
const staleResults = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 1,
  includeMetadata: true,
});

// For consistency-critical operations, add delay
await new Promise((resolve) => setTimeout(resolve, 500));

const freshResults = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 1,
  includeMetadata: true,
});

// For real-time requirements, use shorter freshness windows
// Typical indexing latency: &lt;100ms

Eventual consistency: Accept <1 second staleness for faster ingest.

Query With Score Threshold

Confidence filtering prevents low-quality results:

const index = pc.Index("documents");

const results = await index.query({
  vector: [0.1, 0.2, 0.3, 0.4],
  topK: 100, // Fetch many, filter by confidence
  includeMetadata: true,
});

const threshold = 0.7;
const filteredResults = results.matches.filter(
  (match) => match.score >= threshold,
);

console.log(
  `Retrieved ${results.matches.length} results, ${filteredResults.length} above threshold`,
);

// Return only high-confidence results
return filteredResults.slice(0, 10);

Always set a confidence threshold to prevent bad results. 0.7+ is typical.

Cost Calculation: Storage + Reads + Writes

Understand your bill:

interface PineconeUsage {
  storageGB: number;
  queriesPerMonth: number;
  vectorsUpsertedPerMonth: number;
  podCount?: number; // For pod-based
}

function calculateServerlessCost(usage: PineconeUsage): number {
  // Query cost: $0.40 per 100K queries
  const queryCost = (usage.queriesPerMonth / 100_000) * 0.40;

  // Storage cost: $0.10 per GB-month
  const storageCost = usage.storageGB * 0.10;

  // Upsert cost: $0.10 per 1M vectors
  const upsertCost = (usage.vectorsUpsertedPerMonth / 1_000_000) * 0.10;

  return queryCost + storageCost + upsertCost;
}

function calculatePodCost(usage: PineconeUsage): number {
  // Pod cost: $2.00 per pod-month
  const podCost = (usage.podCount || 1) * 2.0;

  // Storage cost: $0.10 per GB-month
  const storageCost = usage.storageGB * 0.10;

  // No per-query cost with pods
  return podCost + storageCost;
}

// Example: 500M vectors (3GB), 1M queries/month, 100M upserts/month
const usage: PineconeUsage = {
  storageGB: 3,
  queriesPerMonth: 1_000_000,
  vectorsUpsertedPerMonth: 100_000_000,
  podCount: 2,
};

const serverlessCost = calculateServerlessCost(usage);
const podCost = calculatePodCost(usage);

console.log(`Serverless: $${serverlessCost.toFixed(2)}/month`);
console.log(`Pod-based (2 pods): $${podCost.toFixed(2)}/month`);

// Typical breakdown:
// - Storage dominates for read-heavy workloads
// - Queries matter for high-throughput systems
// - Upserts negligible unless &gt;1B vectors/month

Optimize by understanding your bottleneck: queries, storage, or upserts.

Pinecone Canopy for RAG

Canopy simplifies RAG pipeline integration:

import { Pinecone } from "@pinecone-database/pinecone";
import { OpenAI } from "openai";

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Canopy handles embedding + retrieval + generation
async function ragQuery(userQuestion: string): Promise<string> {
  const index = pc.Index("rag-docs");

  // Retrieve relevant documents
  const queryEmbedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: userQuestion,
  });

  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 3,
    includeMetadata: true,
  });

  // Build context
  const context = results.matches
    .map((match) => match.metadata?.text || "")
    .join("\n");

  // Generate response with context
  const completion = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    messages: [
      {
        role: "system",
        content: `You are a helpful assistant. Answer based on the provided context.

Context:
${context}`,
      },
      {
        role: "user",
        content: userQuestion,
      },
    ],
  });

  return completion.choices[0].message.content || "";
}

// Usage
const answer = await ragQuery("What is semantic search?");
console.log(answer);

Canopy abstracts retrieval, enabling focus on generation quality.

Checklist

  • Decide serverless vs pod-based based on query volume
  • Design namespace strategy (per-tenant vs per-environment)
  • Index metadata fields for filtering
  • Implement batch upsert (<= 100 vectors per batch)
  • Set up monitoring for query latency and costs
  • Test hybrid search on production queries
  • Calculate 12-month cost and set alerts
  • Implement score threshold filtering
  • Document namespace and metadata schema
  • Plan backup and recovery procedures

Conclusion

Pinecone's managed architecture removes operational burden. Focus on smart namespace design for multi-tenancy, strategic metadata filtering, and cost optimization. Understand serverless vs pods trade-offs. Master batching, filtering, and hybrid search. Deploy with confidence at any scale.