- Published on
Reranking for RAG — Why Your Top-K Retrieved Chunks Are Wrong
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Vector search is fast but biased. Your top-5 retrieved chunks often don't contain the answer. The culprit: symmetric similarity search treats query and document embeddings equally, missing query-specific relevance signals.
Reranking fixes this by using task-specific models that understand answer relevance.
- Why Vector Search Ranks Poorly
- Cross-Encoder Rerankers
- Cohere Rerank API
- Local Reranking with FlashRank
- Lost-in-the-Middle Problem
- Reranking with Metadata Signals
- Compression After Reranking
- Reranker Evaluation
- Checklist
- Conclusion
Why Vector Search Ranks Poorly
Vector similarity uses symmetric distance: both query and document get embedded, then cosine similarity determines rank. This ignores critical signals:
// Vector search: symmetric similarity (often wrong)
async function vectorSearchRanking(
query: string,
candidates: Array<{ id: string; text: string }>
): Promise<Array<{ id: string; score: number; text: string }>> {
const queryEmbedding = await embedModel.embed(query);
const scored = candidates.map(candidate => ({
id: candidate.id,
score: cosineSimilarity(queryEmbedding, await embedModel.embed(candidate.text)),
text: candidate.text,
}));
return scored.sort((a, b) => b.score - a.score);
}
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, x, i) => sum + x * b[i], 0);
const normA = Math.sqrt(a.reduce((sum, x) => sum + x * x, 0));
const normB = Math.sqrt(b.reduce((sum, x) => sum + x * x, 0));
return dotProduct / (normA * normB);
}
// Problem: This misses:
// - Exact answer specificity (query "John's age?" ranks chunks about "people" above "John is 25")
// - Answer position bias (answer at end of chunk gets same score as beginning)
// - Chunk length bias (longer chunks score higher by default)
// - Out-of-domain embeddings
Cross-Encoder Rerankers
Cross-encoders take both query and document as input and output relevance score:
interface CrossEncoderReranker {
rank(query: string, candidates: string[]): Promise<number[]>;
}
class LocalCrossEncoderReranker implements CrossEncoderReranker {
private model: any; // Hugging Face model
private tokenizer: any;
private device: string = 'cpu';
constructor(modelName: string = 'cross-encoder/ms-marco-MiniLM-L-12-v2') {
// In real code, load from Hugging Face
// this.model = AutoModel.from_pretrained(modelName);
// this.tokenizer = AutoTokenizer.from_pretrained(modelName);
}
async rank(query: string, candidates: string[]): Promise<number[]> {
// Encode [query, document] pairs
const pairs = candidates.map(doc => [query, doc]);
// Get logits from cross-encoder
const encodings = this.tokenizer(pairs, {
padding: true,
truncation: true,
returnTensors: 'pt',
});
const outputs = this.model(encodings);
const scores = outputs.logits.flatten().tolist();
return scores;
}
}
async function rerankerRanking(
query: string,
candidates: Array<{ id: string; text: string }>,
reranker: CrossEncoderReranker
): Promise<Array<{ id: string; score: number; text: string }>> {
const scores = await reranker.rank(query, candidates.map(c => c.text));
return candidates
.map((candidate, idx) => ({
...candidate,
score: scores[idx],
}))
.sort((a, b) => b.score - a.score);
}
// Cross-encoder advantages:
// - Task-specific ranking (trained on relevance pairs)
// - Asymmetric scoring (query ≠ document treatment)
// - Better than vector search on NDCG@10
// - Disadvantage: slow (per-pair computation)
Cohere Rerank API
Use managed cross-encoder API (simplest):
import Cohere from 'cohere-ai';
async function cohereRanking(
query: string,
candidates: Array<{ id: string; text: string }>,
topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
const cohere = new Cohere();
const response = await cohere.rerank({
model: 'rerank-english-v3.0',
query: query,
documents: candidates.map(c => c.text),
topN: topK,
returnDocuments: false,
});
// Map results back to candidates
return response.results
.map(result => ({
...candidates[result.index],
score: result.relevanceScore,
}))
.slice(0, topK);
}
// Usage in RAG pipeline
async function ragWithCohere(
query: string,
vectorDB: VectorStore,
llm: LLMClient
): Promise<string> {
// Step 1: Initial retrieval (fast, many candidates)
const candidates = await vectorDB.search(query, { topK: 20 });
// Step 2: Rerank with Cohere (accurate, few documents)
const reranked = await cohereRanking(query, candidates, topK = 5);
// Step 3: Generate with top results
const context = reranked.map(r => r.text).join('\n\n');
const answer = await llm.generate({
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${query}`,
},
],
});
return answer.text;
}
Local Reranking with FlashRank
Lightweight local reranking (no API calls):
interface FlashRankReranker {
rank(query: string, passages: string[]): Promise<Array<{ score: number; index: number }>>;
}
async function flashrankReranking(
query: string,
candidates: Array<{ id: string; text: string }>,
topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
// Mock FlashRank implementation
// In reality, use: pip install flashrank
// from flashrank import Ranker
// ranker = Ranker(model_name="ce-onnx-small", cache_dir="/tmp/flashrank_cache")
interface RankResult {
score: number;
index: number;
}
const flashrank = {
rank: async (q: string, passages: string[]): Promise<RankResult[]> => {
// Placeholder: in production, call actual flashrank
return passages.map((_, idx) => ({
score: Math.random(), // Replace with real scores
index: idx,
}));
},
};
const results = await flashrank.rank(query, candidates.map(c => c.text));
return results
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map(result => ({
...candidates[result.index],
score: result.score,
}));
}
// Benchmarking local vs API reranking
async function benchmarkReranking(
query: string,
candidates: Array<{ id: string; text: string }>,
topK: number = 5
): Promise<{
cohere: { time: number; results: any[] };
local: { time: number; results: any[] };
}> {
// Cohere API call
const cohereStart = Date.now();
const cohereResults = await cohereRanking(query, candidates, topK);
const cohereTime = Date.now() - cohereStart;
// Local reranking
const localStart = Date.now();
const localResults = await flashrankReranking(query, candidates, topK);
const localTime = Date.now() - localStart;
return {
cohere: { time: cohereTime, results: cohereResults },
local: { time: localTime, results: localResults },
};
}
Lost-in-the-Middle Problem
Answers in the middle of context windows often get ignored. Reranking exacerbates this:
// Problem: middle chunks downranked despite containing answer
interface MiddleChunk {
position: 'beginning' | 'middle' | 'end';
chunkIndex: number;
originalRank: number;
rerankerScore: number;
}
async function analyzePositionBias(
rerankedResults: Array<{ id: string; text: string; score: number }>,
originalOrder: Array<{ id: string }>
): Promise<{
positionBias: Record<string, number>;
middleDropRate: number;
}> {
const totalChunks = originalOrder.length;
const positionBias: Record<string, number> = {
beginning: 0,
middle: 0,
end: 0,
};
const middleIndices = new Set(
originalOrder
.map((_, idx) => idx)
.filter(idx => idx > totalChunks * 0.2 && idx < totalChunks * 0.8)
);
// Count reranked positions
for (const result of rerankedResults) {
const originalIdx = originalOrder.findIndex(r => r.id === result.id);
if (originalIdx < totalChunks * 0.2) {
positionBias.beginning++;
} else if (originalIdx < totalChunks * 0.8) {
positionBias.middle++;
} else {
positionBias.end++;
}
}
const middleDropRate = 1 - (positionBias.middle / rerankedResults.length);
return {
positionBias,
middleDropRate,
};
}
// Solution: position-aware reranking
async function positionAwareReranking(
query: string,
candidates: Array<{ id: string; text: string; originalRank: number }>,
reranker: CrossEncoderReranker,
topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
const scores = await reranker.rank(query, candidates.map(c => c.text));
const adjusted = candidates.map((candidate, idx) => {
let score = scores[idx];
// Penalty for middle chunks to counteract lost-in-the-middle
const middleRatio = Math.abs(
(idx / candidates.length) - 0.5
);
// Slight boost for edges (beginning/end)
if (middleRatio > 0.3) {
score *= 1.1; // 10% boost for peripheral chunks
}
return {
...candidate,
score,
};
});
return adjusted
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
Reranking with Metadata Signals
Enhance reranking with document metadata:
interface ChunkWithMetadata {
id: string;
text: string;
metadata: {
source?: string;
authority?: number; // 0-1, higher = more trustworthy
recency?: number; // milliseconds since last update
relevantKeywords?: string[];
};
}
async function metadataAwareReranking(
query: string,
candidates: ChunkWithMetadata[],
reranker: CrossEncoderReranker,
topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
const rerankerScores = await reranker.rank(
query,
candidates.map(c => c.text)
);
const now = Date.now();
const adjusted = candidates.map((candidate, idx) => {
let score = rerankerScores[idx];
// Boost by authority
if (candidate.metadata.authority) {
score *= (1 + candidate.metadata.authority * 0.3); // Up to 30% boost
}
// Boost for recency (decay over 30 days)
if (candidate.metadata.recency) {
const daysSinceUpdate = (now - candidate.metadata.recency) / (1000 * 60 * 60 * 24);
const recencyBoost = Math.max(0, 1 - daysSinceUpdate / 30);
score *= (1 + recencyBoost * 0.2);
}
// Boost for keyword match
if (candidate.metadata.relevantKeywords) {
const queryWords = query.toLowerCase().split(/\s+/);
const keywordMatches = candidate.metadata.relevantKeywords.filter(kw =>
queryWords.some(qw => qw.includes(kw) || kw.includes(qw))
).length;
score *= (1 + keywordMatches * 0.15);
}
return {
...candidate,
score,
};
});
return adjusted
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
Compression After Reranking
Reduce context size while preserving answer relevance using LLMLingua:
async function llmCompression(
context: string,
query: string,
compressionRatio: number = 0.5 // Keep 50% of tokens
): Promise<string> {
const tokenCount = context.split(/\s+/).length;
const targetTokens = Math.floor(tokenCount * compressionRatio);
const compressionPrompt = `
Your task is to compress the following context while preserving information
needed to answer the query.
Query: "${query}"
Original context:
${context}
Compressed context (keep only most relevant ~${targetTokens} tokens):`;
const compressed = await llm.generate({
messages: [{ role: 'user', content: compressionPrompt }],
maxTokens: Math.floor(targetTokens * 1.2),
});
return compressed.text;
}
// Production pipeline with reranking + compression
async function optimizedRAG(
query: string,
vectorDB: VectorStore,
reranker: CrossEncoderReranker,
llm: LLMClient
): Promise<string> {
// Step 1: Fast retrieval (many candidates)
const candidates = await vectorDB.search(query, { topK: 30 });
// Step 2: Accurate reranking (few top results)
const reranked = await rerankerRanking(query, candidates, reranker);
// Step 3: Compress context
const contextToCompress = reranked.slice(0, 5).map(r => r.text).join('\n\n');
const compressed = await llmCompression(contextToCompress, query, 0.6);
// Step 4: Generate with compressed context
const answer = await llm.generate({
messages: [
{
role: 'user',
content: `Context:\n${compressed}\n\nQuestion: ${query}`,
},
],
});
return answer.text;
}
Reranker Evaluation
Measure reranking effectiveness:
interface RerankerMetrics {
ndcgImprovement: number; // % improvement over vector search
latency: number; // ms per query
costPerQuery: number; // $ (for API rerankers)
relativeRank: number; // Average position improvement
}
async function evaluateReranker(
testQueries: Array<{ query: string; candidates: Array<{ id: string; text: string }>; relevant: Set<string> }>,
vectorRetriever: (q: string, docs: string[]) => Promise<number[]>,
reranker: CrossEncoderReranker
): Promise<RerankerMetrics> {
const improvements: number[] = [];
const latencies: number[] = [];
let totalCost = 0;
for (const test of testQueries) {
// Vector ranking
const vectorScores = await vectorRetriever(test.query, test.candidates.map(c => c.text));
const vectorRanked = test.candidates
.map((c, i) => ({ id: c.id, score: vectorScores[i] }))
.sort((a, b) => b.score - a.score);
// Reranker ranking
const start = Date.now();
const rerankerScores = await reranker.rank(test.query, test.candidates.map(c => c.text));
latencies.push(Date.now() - start);
const rerankerRanked = test.candidates
.map((c, i) => ({ id: c.id, score: rerankerScores[i] }))
.sort((a, b) => b.score - a.score);
// Calculate NDCG improvement
const vectorNDCG = computeNDCG(vectorRanked.map(r => r.id), test.relevant, 5);
const rerankerNDCG = computeNDCG(rerankerRanked.map(r => r.id), test.relevant, 5);
improvements.push((rerankerNDCG - vectorNDCG) / vectorNDCG);
// Cost estimate (Cohere: ~$0.000001 per token)
totalCost += test.candidates.length * 0.000001;
}
return {
ndcgImprovement: improvements.reduce((a, b) => a + b, 0) / improvements.length,
latency: latencies.reduce((a, b) => a + b, 0) / latencies.length,
costPerQuery: totalCost / testQueries.length,
relativeRank: 0, // Compute position improvement
};
}
function computeNDCG(rankedIds: string[], relevant: Set<string>, k: number = 5): number {
const dcg = rankedIds
.slice(0, k)
.reduce((sum, id, i) => {
const rel = relevant.has(id) ? 1 : 0;
return sum + rel / Math.log2(i + 2);
}, 0);
const idcg = Array(Math.min(k, relevant.size))
.fill(1)
.reduce((sum, _, i) => sum + 1 / Math.log2(i + 2), 0);
return idcg === 0 ? 0 : dcg / idcg;
}
Checklist
- Measure baseline vector search NDCG@5
- Implement cross-encoder reranking (Cohere or local)
- Track reranking latency (target <100ms for top-5)
- Evaluate position bias in your reranked results
- Add metadata signals (authority, recency, keywords)
- Test compression with reranked context
- Cost-benefit analysis (improved quality vs API costs)
- Set up A/B test vector-only vs reranked pipeline
- Monitor top-k position distribution (avoid middle collapse)
- Build golden relevance dataset for evaluation
Conclusion
Reranking is the highest-impact optimization in RAG. A 10-20% improvement in retrieval quality compounds: faster generation, fewer hallucinations, better user experience. Start with API-based cross-encoders (simplest), measure NDCG@5 improvement, then migrate to local models if latency becomes the bottleneck. The key metric: does reranking place relevant chunks in top-3? Everything else is optimization.