- Published on
RAG Metadata Filtering — Using Structured Data to Sharpen Retrieval
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Vector similarity alone is blunt. A query about "engineering best practices from 2024" will retrieve 2019 documents if they're semantically close. Metadata filtering sharpens RAG by combining semantic similarity with structured constraints. When done right, filtering can reduce retrieval noise by 50% or more while respecting tenant isolation and regulatory requirements.
- Metadata Schema Design
- Self-Querying Retrieval
- Filter Push-Down Architecture
- Combining Vector Similarity and Metadata Filters
- Tenant Isolation via Metadata
- Date-Range Filtering for Temporal Queries
- Multi-Value Filter Handling
- Filter Cardinality and Index Strategy
- Checklist
- Conclusion
Metadata Schema Design
Before filtering, design your schema. What metadata will each document carry?
interface DocumentMetadata {
documentId: string;
date: string; // ISO 8601
category: string;
author: string;
tenantId: string; // Multi-tenancy
dataClassification: 'public' | 'internal' | 'restricted';
version: number;
source: 'internal' | 'web' | 'customer-feedback';
tags: string[];
language: string;
}
interface DocumentWithMetadata {
id: string;
content: string;
embedding: number[]; // 1536-dim (OpenAI)
metadata: DocumentMetadata;
}
Metadata must balance specificity with query flexibility. Too many fields bloat every document; too few lose filtering power. Start with date, category, author, and tenantId. Add others as use cases demand.
Self-Querying Retrieval
Self-querying uses the LLM to convert natural language into structured filters. User says "Show me docs from marketing in 2024" → the system extracts { category: 'marketing', dateMin: '2024-01-01' }.
interface FilterExpression {
operator: 'and' | 'or';
conditions: FilterCondition[];
}
interface FilterCondition {
field: string; // e.g., 'metadata.date', 'metadata.category'
operator: '<' | '>' | '=' | 'in' | 'between';
value: string | number | string[];
}
async function selfQueryRetrieval(
userQuery: string,
metadataSchema: DocumentMetadata
): Promise<Document[]> {
const schemaJson = JSON.stringify(metadataSchema, null, 2);
const extractedFilter = await llm.generate({
prompt: `Given this metadata schema:
${schemaJson}
User query: "${userQuery}"
Extract filter conditions as JSON.
Return: { operator, conditions: [{ field, operator, value }] }`,
});
const filter: FilterExpression = JSON.parse(extractedFilter);
return await vectorDb.searchWithFilter(userQuery, filter);
}
Self-querying works well for users accustomed to natural language. It reduces the need for custom domain-specific query languages.
Filter Push-Down Architecture
Vector databases vary in filter performance. Some (Pinecone, Weaviate) execute filters at the SIMD level before ranking results. Others (naive implementations) fetch top-k, then filter—wasteful.
For optimal performance, design your pipeline to push filters down:
interface SearchRequest {
query: string;
embedding: number[];
filter: FilterExpression;
topK: number;
}
async function efficientVectorSearch(
request: SearchRequest
): Promise<Document[]> {
// Good: vector DB applies filter, then finds nearest neighbors
// Returns only <topK results matching the filter
return await vectorDb.search(request);
}
async function inefficientSearch(
query: string,
filter: FilterExpression
): Promise<Document[]> {
// Bad: fetch all top-K, then filter in application
// May return far fewer results than requested
const allResults = await vectorDb.search(query, { topK: 1000 });
return allResults.filter(doc => matchesFilter(doc, filter));
}
For large-scale systems, use vector databases with native filter support (Pinecone, Milvus, Weaviate). They'll outperform post-hoc filtering by an order of magnitude.
Combining Vector Similarity and Metadata Filters
The best retrieval combines two signals: semantic relevance and metadata matching.
interface CombinedSearchRequest {
query: string;
filter: FilterExpression;
vectorWeight: number; // 0-1
filterWeight: number; // 0-1
}
async function combinedSearch(request: CombinedSearchRequest): Promise<Document[]> {
// Retrieve with filter to reduce candidate set
const candidates = await vectorDb.searchWithFilter(
request.query,
request.filter,
{ topK: 100 } // Larger candidate set for re-ranking
);
// Re-rank by combined score
const scored = candidates.map(doc => ({
doc,
score:
request.vectorWeight * doc.similarityScore +
request.filterWeight * computeMetadataScore(doc.metadata),
}));
return scored
.sort((a, b) => b.score - a.score)
.slice(0, 10)
.map(s => s.doc);
}
function computeMetadataScore(metadata: DocumentMetadata): number {
let score = 1.0;
// Boost recent documents
const daysSincePublish = daysSince(metadata.date);
score *= Math.max(0, 1 - daysSincePublish / 365);
// Boost verified sources
if (metadata.source === 'internal') score *= 1.2;
return score;
}
This two-stage ranking—first filter, then score—balances accuracy and performance.
Tenant Isolation via Metadata
In multi-tenant SaaS systems, metadata filtering is your security boundary.
interface TenantContext {
tenantId: string;
userId: string;
roles: string[];
}
async function tenantSafeSearch(
query: string,
tenant: TenantContext
): Promise<Document[]> {
// Always add tenantId to filter—non-negotiable
const filter: FilterExpression = {
operator: 'and',
conditions: [
{
field: 'metadata.tenantId',
operator: '=',
value: tenant.tenantId,
},
// Optional: filter by dataClassification based on user role
...(tenant.roles.includes('admin')
? []
: [
{
field: 'metadata.dataClassification',
operator: '!=',
value: 'restricted',
},
]),
],
};
return await vectorDb.searchWithFilter(query, filter);
}
Never trust application-level logic for tenant isolation. Embed it in the query itself.
Date-Range Filtering for Temporal Queries
Many queries are temporal. "What was the quarterly revenue in Q3 2024?"
interface DateRangeFilter {
field: 'metadata.date';
operator: 'between';
value: [string, string]; // [minDate, maxDate]
}
async function dateRangeSearch(
query: string,
startDate: string,
endDate: string
): Promise<Document[]> {
const filter: FilterExpression = {
operator: 'and',
conditions: [
{
field: 'metadata.date',
operator: 'between',
value: [startDate, endDate],
},
],
};
return await vectorDb.searchWithFilter(query, filter);
}
For time-series queries ("revenue trends over the last 2 years"), retrieve documents progressively, chunking by quarter or month.
Multi-Value Filter Handling
Tags and categories often have multiple values. "Find docs in [engineering, sales] categories from [2023, 2024]."
async function multiValueFilters(
query: string,
categories: string[],
years: number[]
): Promise<Document[]> {
const yearRange = [
`${Math.min(...years)}-01-01`,
`${Math.max(...years)}-12-31`,
];
const filter: FilterExpression = {
operator: 'and',
conditions: [
{
field: 'metadata.category',
operator: 'in',
value: categories,
},
{
field: 'metadata.date',
operator: 'between',
value: yearRange,
},
],
};
return await vectorDb.searchWithFilter(query, filter);
}
The in operator for multi-value matching is essential for usability. Without it, users can't easily combine options.
Filter Cardinality and Index Strategy
High-cardinality fields (unique IDs, timestamps) are expensive to filter. Low-cardinality fields (categories, statuses) are cheap.
// Good: filter by low-cardinality field first
// 10-100 unique values
{
field: 'metadata.category',
operator: 'in',
value: ['engineering', 'product'],
}
// Expensive: filter by high-cardinality field (potentially millions of values)
{
field: 'metadata.authorId',
operator: '=',
value: 'user-12345',
}
For large datasets:
- Index low-cardinality filters at the vector DB level
- For high-cardinality, retrieve and filter in application
- Avoid OR across high-cardinality fields
Checklist
- Design metadata schema capturing date, category, author, tenant, classification
- Implement self-querying to extract filters from natural language
- Use vector database with native filter push-down
- Combine vector similarity with metadata scoring
- Always enforce tenant isolation at the query level
- Support date-range and multi-value filtering
- Monitor query performance; optimize high-cardinality filters
Conclusion
Metadata filtering transforms RAG from blunt similarity matching into precise, secure retrieval. By designing a thoughtful schema and leveraging database-level filtering, you'll reduce noise, respect tenant boundaries, and improve answer relevance. Start simple with date, category, and tenant—expand only as needed.