Published on

RAG Metadata Filtering — Using Structured Data to Sharpen Retrieval

Authors

Introduction

Vector similarity alone is blunt. A query about "engineering best practices from 2024" will retrieve 2019 documents if they're semantically close. Metadata filtering sharpens RAG by combining semantic similarity with structured constraints. When done right, filtering can reduce retrieval noise by 50% or more while respecting tenant isolation and regulatory requirements.

Metadata Schema Design

Before filtering, design your schema. What metadata will each document carry?

interface DocumentMetadata {
  documentId: string;
  date: string; // ISO 8601
  category: string;
  author: string;
  tenantId: string; // Multi-tenancy
  dataClassification: 'public' | 'internal' | 'restricted';
  version: number;
  source: 'internal' | 'web' | 'customer-feedback';
  tags: string[];
  language: string;
}

interface DocumentWithMetadata {
  id: string;
  content: string;
  embedding: number[]; // 1536-dim (OpenAI)
  metadata: DocumentMetadata;
}

Metadata must balance specificity with query flexibility. Too many fields bloat every document; too few lose filtering power. Start with date, category, author, and tenantId. Add others as use cases demand.

Self-Querying Retrieval

Self-querying uses the LLM to convert natural language into structured filters. User says "Show me docs from marketing in 2024" → the system extracts { category: 'marketing', dateMin: '2024-01-01' }.

interface FilterExpression {
  operator: 'and' | 'or';
  conditions: FilterCondition[];
}

interface FilterCondition {
  field: string; // e.g., 'metadata.date', 'metadata.category'
  operator: '<' | '>' | '=' | 'in' | 'between';
  value: string | number | string[];
}

async function selfQueryRetrieval(
  userQuery: string,
  metadataSchema: DocumentMetadata
): Promise<Document[]> {
  const schemaJson = JSON.stringify(metadataSchema, null, 2);

  const extractedFilter = await llm.generate({
    prompt: `Given this metadata schema:
${schemaJson}

User query: "${userQuery}"

Extract filter conditions as JSON.
Return: { operator, conditions: [{ field, operator, value }] }`,
  });

  const filter: FilterExpression = JSON.parse(extractedFilter);

  return await vectorDb.searchWithFilter(userQuery, filter);
}

Self-querying works well for users accustomed to natural language. It reduces the need for custom domain-specific query languages.

Filter Push-Down Architecture

Vector databases vary in filter performance. Some (Pinecone, Weaviate) execute filters at the SIMD level before ranking results. Others (naive implementations) fetch top-k, then filter—wasteful.

For optimal performance, design your pipeline to push filters down:

interface SearchRequest {
  query: string;
  embedding: number[];
  filter: FilterExpression;
  topK: number;
}

async function efficientVectorSearch(
  request: SearchRequest
): Promise<Document[]> {
  // Good: vector DB applies filter, then finds nearest neighbors
  // Returns only <topK results matching the filter
  return await vectorDb.search(request);
}

async function inefficientSearch(
  query: string,
  filter: FilterExpression
): Promise<Document[]> {
  // Bad: fetch all top-K, then filter in application
  // May return far fewer results than requested
  const allResults = await vectorDb.search(query, { topK: 1000 });
  return allResults.filter(doc => matchesFilter(doc, filter));
}

For large-scale systems, use vector databases with native filter support (Pinecone, Milvus, Weaviate). They'll outperform post-hoc filtering by an order of magnitude.

Combining Vector Similarity and Metadata Filters

The best retrieval combines two signals: semantic relevance and metadata matching.

interface CombinedSearchRequest {
  query: string;
  filter: FilterExpression;
  vectorWeight: number; // 0-1
  filterWeight: number; // 0-1
}

async function combinedSearch(request: CombinedSearchRequest): Promise<Document[]> {
  // Retrieve with filter to reduce candidate set
  const candidates = await vectorDb.searchWithFilter(
    request.query,
    request.filter,
    { topK: 100 } // Larger candidate set for re-ranking
  );

  // Re-rank by combined score
  const scored = candidates.map(doc => ({
    doc,
    score:
      request.vectorWeight * doc.similarityScore +
      request.filterWeight * computeMetadataScore(doc.metadata),
  }));

  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, 10)
    .map(s => s.doc);
}

function computeMetadataScore(metadata: DocumentMetadata): number {
  let score = 1.0;

  // Boost recent documents
  const daysSincePublish = daysSince(metadata.date);
  score *= Math.max(0, 1 - daysSincePublish / 365);

  // Boost verified sources
  if (metadata.source === 'internal') score *= 1.2;

  return score;
}

This two-stage ranking—first filter, then score—balances accuracy and performance.

Tenant Isolation via Metadata

In multi-tenant SaaS systems, metadata filtering is your security boundary.

interface TenantContext {
  tenantId: string;
  userId: string;
  roles: string[];
}

async function tenantSafeSearch(
  query: string,
  tenant: TenantContext
): Promise<Document[]> {
  // Always add tenantId to filter—non-negotiable
  const filter: FilterExpression = {
    operator: 'and',
    conditions: [
      {
        field: 'metadata.tenantId',
        operator: '=',
        value: tenant.tenantId,
      },
      // Optional: filter by dataClassification based on user role
      ...(tenant.roles.includes('admin')
        ? []
        : [
            {
              field: 'metadata.dataClassification',
              operator: '!=',
              value: 'restricted',
            },
          ]),
    ],
  };

  return await vectorDb.searchWithFilter(query, filter);
}

Never trust application-level logic for tenant isolation. Embed it in the query itself.

Date-Range Filtering for Temporal Queries

Many queries are temporal. "What was the quarterly revenue in Q3 2024?"

interface DateRangeFilter {
  field: 'metadata.date';
  operator: 'between';
  value: [string, string]; // [minDate, maxDate]
}

async function dateRangeSearch(
  query: string,
  startDate: string,
  endDate: string
): Promise<Document[]> {
  const filter: FilterExpression = {
    operator: 'and',
    conditions: [
      {
        field: 'metadata.date',
        operator: 'between',
        value: [startDate, endDate],
      },
    ],
  };

  return await vectorDb.searchWithFilter(query, filter);
}

For time-series queries ("revenue trends over the last 2 years"), retrieve documents progressively, chunking by quarter or month.

Multi-Value Filter Handling

Tags and categories often have multiple values. "Find docs in [engineering, sales] categories from [2023, 2024]."

async function multiValueFilters(
  query: string,
  categories: string[],
  years: number[]
): Promise<Document[]> {
  const yearRange = [
    `${Math.min(...years)}-01-01`,
    `${Math.max(...years)}-12-31`,
  ];

  const filter: FilterExpression = {
    operator: 'and',
    conditions: [
      {
        field: 'metadata.category',
        operator: 'in',
        value: categories,
      },
      {
        field: 'metadata.date',
        operator: 'between',
        value: yearRange,
      },
    ],
  };

  return await vectorDb.searchWithFilter(query, filter);
}

The in operator for multi-value matching is essential for usability. Without it, users can't easily combine options.

Filter Cardinality and Index Strategy

High-cardinality fields (unique IDs, timestamps) are expensive to filter. Low-cardinality fields (categories, statuses) are cheap.

// Good: filter by low-cardinality field first
// 10-100 unique values
{
  field: 'metadata.category',
  operator: 'in',
  value: ['engineering', 'product'],
}

// Expensive: filter by high-cardinality field (potentially millions of values)
{
  field: 'metadata.authorId',
  operator: '=',
  value: 'user-12345',
}

For large datasets:

  • Index low-cardinality filters at the vector DB level
  • For high-cardinality, retrieve and filter in application
  • Avoid OR across high-cardinality fields

Checklist

  • Design metadata schema capturing date, category, author, tenant, classification
  • Implement self-querying to extract filters from natural language
  • Use vector database with native filter push-down
  • Combine vector similarity with metadata scoring
  • Always enforce tenant isolation at the query level
  • Support date-range and multi-value filtering
  • Monitor query performance; optimize high-cardinality filters

Conclusion

Metadata filtering transforms RAG from blunt similarity matching into precise, secure retrieval. By designing a thoughtful schema and leveraging database-level filtering, you'll reduce noise, respect tenant boundaries, and improve answer relevance. Start simple with date, category, and tenant—expand only as needed.