Published on

Reranking for RAG — Why Your Top-K Retrieved Chunks Are Wrong

Authors

Introduction

Vector search is fast but biased. Your top-5 retrieved chunks often don't contain the answer. The culprit: symmetric similarity search treats query and document embeddings equally, missing query-specific relevance signals.

Reranking fixes this by using task-specific models that understand answer relevance.

Why Vector Search Ranks Poorly

Vector similarity uses symmetric distance: both query and document get embedded, then cosine similarity determines rank. This ignores critical signals:

// Vector search: symmetric similarity (often wrong)
async function vectorSearchRanking(
  query: string,
  candidates: Array<{ id: string; text: string }>
): Promise<Array<{ id: string; score: number; text: string }>> {
  const queryEmbedding = await embedModel.embed(query);

  const scored = candidates.map(candidate => ({
    id: candidate.id,
    score: cosineSimilarity(queryEmbedding, await embedModel.embed(candidate.text)),
    text: candidate.text,
  }));

  return scored.sort((a, b) => b.score - a.score);
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, x, i) => sum + x * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, x) => sum + x * x, 0));
  const normB = Math.sqrt(b.reduce((sum, x) => sum + x * x, 0));
  return dotProduct / (normA * normB);
}

// Problem: This misses:
// - Exact answer specificity (query "John's age?" ranks chunks about "people" above "John is 25")
// - Answer position bias (answer at end of chunk gets same score as beginning)
// - Chunk length bias (longer chunks score higher by default)
// - Out-of-domain embeddings

Cross-Encoder Rerankers

Cross-encoders take both query and document as input and output relevance score:

interface CrossEncoderReranker {
  rank(query: string, candidates: string[]): Promise<number[]>;
}

class LocalCrossEncoderReranker implements CrossEncoderReranker {
  private model: any; // Hugging Face model
  private tokenizer: any;
  private device: string = 'cpu';

  constructor(modelName: string = 'cross-encoder/ms-marco-MiniLM-L-12-v2') {
    // In real code, load from Hugging Face
    // this.model = AutoModel.from_pretrained(modelName);
    // this.tokenizer = AutoTokenizer.from_pretrained(modelName);
  }

  async rank(query: string, candidates: string[]): Promise<number[]> {
    // Encode [query, document] pairs
    const pairs = candidates.map(doc => [query, doc]);

    // Get logits from cross-encoder
    const encodings = this.tokenizer(pairs, {
      padding: true,
      truncation: true,
      returnTensors: 'pt',
    });

    const outputs = this.model(encodings);
    const scores = outputs.logits.flatten().tolist();

    return scores;
  }
}

async function rerankerRanking(
  query: string,
  candidates: Array<{ id: string; text: string }>,
  reranker: CrossEncoderReranker
): Promise<Array<{ id: string; score: number; text: string }>> {
  const scores = await reranker.rank(query, candidates.map(c => c.text));

  return candidates
    .map((candidate, idx) => ({
      ...candidate,
      score: scores[idx],
    }))
    .sort((a, b) => b.score - a.score);
}

// Cross-encoder advantages:
// - Task-specific ranking (trained on relevance pairs)
// - Asymmetric scoring (query ≠ document treatment)
// - Better than vector search on NDCG@10
// - Disadvantage: slow (per-pair computation)

Cohere Rerank API

Use managed cross-encoder API (simplest):

import Cohere from 'cohere-ai';

async function cohereRanking(
  query: string,
  candidates: Array<{ id: string; text: string }>,
  topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
  const cohere = new Cohere();

  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query: query,
    documents: candidates.map(c => c.text),
    topN: topK,
    returnDocuments: false,
  });

  // Map results back to candidates
  return response.results
    .map(result => ({
      ...candidates[result.index],
      score: result.relevanceScore,
    }))
    .slice(0, topK);
}

// Usage in RAG pipeline
async function ragWithCohere(
  query: string,
  vectorDB: VectorStore,
  llm: LLMClient
): Promise<string> {
  // Step 1: Initial retrieval (fast, many candidates)
  const candidates = await vectorDB.search(query, { topK: 20 });

  // Step 2: Rerank with Cohere (accurate, few documents)
  const reranked = await cohereRanking(query, candidates, topK = 5);

  // Step 3: Generate with top results
  const context = reranked.map(r => r.text).join('\n\n');
  const answer = await llm.generate({
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return answer.text;
}

Local Reranking with FlashRank

Lightweight local reranking (no API calls):

interface FlashRankReranker {
  rank(query: string, passages: string[]): Promise<Array<{ score: number; index: number }>>;
}

async function flashrankReranking(
  query: string,
  candidates: Array<{ id: string; text: string }>,
  topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
  // Mock FlashRank implementation
  // In reality, use: pip install flashrank
  // from flashrank import Ranker
  // ranker = Ranker(model_name="ce-onnx-small", cache_dir="/tmp/flashrank_cache")

  interface RankResult {
    score: number;
    index: number;
  }

  const flashrank = {
    rank: async (q: string, passages: string[]): Promise<RankResult[]> => {
      // Placeholder: in production, call actual flashrank
      return passages.map((_, idx) => ({
        score: Math.random(), // Replace with real scores
        index: idx,
      }));
    },
  };

  const results = await flashrank.rank(query, candidates.map(c => c.text));

  return results
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map(result => ({
      ...candidates[result.index],
      score: result.score,
    }));
}

// Benchmarking local vs API reranking
async function benchmarkReranking(
  query: string,
  candidates: Array<{ id: string; text: string }>,
  topK: number = 5
): Promise<{
  cohere: { time: number; results: any[] };
  local: { time: number; results: any[] };
}> {
  // Cohere API call
  const cohereStart = Date.now();
  const cohereResults = await cohereRanking(query, candidates, topK);
  const cohereTime = Date.now() - cohereStart;

  // Local reranking
  const localStart = Date.now();
  const localResults = await flashrankReranking(query, candidates, topK);
  const localTime = Date.now() - localStart;

  return {
    cohere: { time: cohereTime, results: cohereResults },
    local: { time: localTime, results: localResults },
  };
}

Lost-in-the-Middle Problem

Answers in the middle of context windows often get ignored. Reranking exacerbates this:

// Problem: middle chunks downranked despite containing answer
interface MiddleChunk {
  position: 'beginning' | 'middle' | 'end';
  chunkIndex: number;
  originalRank: number;
  rerankerScore: number;
}

async function analyzePositionBias(
  rerankedResults: Array<{ id: string; text: string; score: number }>,
  originalOrder: Array<{ id: string }>
): Promise<{
  positionBias: Record<string, number>;
  middleDropRate: number;
}> {
  const totalChunks = originalOrder.length;
  const positionBias: Record<string, number> = {
    beginning: 0,
    middle: 0,
    end: 0,
  };

  const middleIndices = new Set(
    originalOrder
      .map((_, idx) => idx)
      .filter(idx => idx > totalChunks * 0.2 && idx < totalChunks * 0.8)
  );

  // Count reranked positions
  for (const result of rerankedResults) {
    const originalIdx = originalOrder.findIndex(r => r.id === result.id);

    if (originalIdx < totalChunks * 0.2) {
      positionBias.beginning++;
    } else if (originalIdx < totalChunks * 0.8) {
      positionBias.middle++;
    } else {
      positionBias.end++;
    }
  }

  const middleDropRate = 1 - (positionBias.middle / rerankedResults.length);

  return {
    positionBias,
    middleDropRate,
  };
}

// Solution: position-aware reranking
async function positionAwareReranking(
  query: string,
  candidates: Array<{ id: string; text: string; originalRank: number }>,
  reranker: CrossEncoderReranker,
  topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
  const scores = await reranker.rank(query, candidates.map(c => c.text));

  const adjusted = candidates.map((candidate, idx) => {
    let score = scores[idx];

    // Penalty for middle chunks to counteract lost-in-the-middle
    const middleRatio = Math.abs(
      (idx / candidates.length) - 0.5
    );

    // Slight boost for edges (beginning/end)
    if (middleRatio > 0.3) {
      score *= 1.1; // 10% boost for peripheral chunks
    }

    return {
      ...candidate,
      score,
    };
  });

  return adjusted
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Reranking with Metadata Signals

Enhance reranking with document metadata:

interface ChunkWithMetadata {
  id: string;
  text: string;
  metadata: {
    source?: string;
    authority?: number; // 0-1, higher = more trustworthy
    recency?: number; // milliseconds since last update
    relevantKeywords?: string[];
  };
}

async function metadataAwareReranking(
  query: string,
  candidates: ChunkWithMetadata[],
  reranker: CrossEncoderReranker,
  topK: number = 5
): Promise<Array<{ id: string; score: number; text: string }>> {
  const rerankerScores = await reranker.rank(
    query,
    candidates.map(c => c.text)
  );

  const now = Date.now();

  const adjusted = candidates.map((candidate, idx) => {
    let score = rerankerScores[idx];

    // Boost by authority
    if (candidate.metadata.authority) {
      score *= (1 + candidate.metadata.authority * 0.3); // Up to 30% boost
    }

    // Boost for recency (decay over 30 days)
    if (candidate.metadata.recency) {
      const daysSinceUpdate = (now - candidate.metadata.recency) / (1000 * 60 * 60 * 24);
      const recencyBoost = Math.max(0, 1 - daysSinceUpdate / 30);
      score *= (1 + recencyBoost * 0.2);
    }

    // Boost for keyword match
    if (candidate.metadata.relevantKeywords) {
      const queryWords = query.toLowerCase().split(/\s+/);
      const keywordMatches = candidate.metadata.relevantKeywords.filter(kw =>
        queryWords.some(qw => qw.includes(kw) || kw.includes(qw))
      ).length;

      score *= (1 + keywordMatches * 0.15);
    }

    return {
      ...candidate,
      score,
    };
  });

  return adjusted
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Compression After Reranking

Reduce context size while preserving answer relevance using LLMLingua:

async function llmCompression(
  context: string,
  query: string,
  compressionRatio: number = 0.5 // Keep 50% of tokens
): Promise<string> {
  const tokenCount = context.split(/\s+/).length;
  const targetTokens = Math.floor(tokenCount * compressionRatio);

  const compressionPrompt = `
Your task is to compress the following context while preserving information
needed to answer the query.

Query: "${query}"

Original context:
${context}

Compressed context (keep only most relevant ~${targetTokens} tokens):`;

  const compressed = await llm.generate({
    messages: [{ role: 'user', content: compressionPrompt }],
    maxTokens: Math.floor(targetTokens * 1.2),
  });

  return compressed.text;
}

// Production pipeline with reranking + compression
async function optimizedRAG(
  query: string,
  vectorDB: VectorStore,
  reranker: CrossEncoderReranker,
  llm: LLMClient
): Promise<string> {
  // Step 1: Fast retrieval (many candidates)
  const candidates = await vectorDB.search(query, { topK: 30 });

  // Step 2: Accurate reranking (few top results)
  const reranked = await rerankerRanking(query, candidates, reranker);

  // Step 3: Compress context
  const contextToCompress = reranked.slice(0, 5).map(r => r.text).join('\n\n');
  const compressed = await llmCompression(contextToCompress, query, 0.6);

  // Step 4: Generate with compressed context
  const answer = await llm.generate({
    messages: [
      {
        role: 'user',
        content: `Context:\n${compressed}\n\nQuestion: ${query}`,
      },
    ],
  });

  return answer.text;
}

Reranker Evaluation

Measure reranking effectiveness:

interface RerankerMetrics {
  ndcgImprovement: number; // % improvement over vector search
  latency: number; // ms per query
  costPerQuery: number; // $ (for API rerankers)
  relativeRank: number; // Average position improvement
}

async function evaluateReranker(
  testQueries: Array<{ query: string; candidates: Array<{ id: string; text: string }>; relevant: Set<string> }>,
  vectorRetriever: (q: string, docs: string[]) => Promise<number[]>,
  reranker: CrossEncoderReranker
): Promise<RerankerMetrics> {
  const improvements: number[] = [];
  const latencies: number[] = [];
  let totalCost = 0;

  for (const test of testQueries) {
    // Vector ranking
    const vectorScores = await vectorRetriever(test.query, test.candidates.map(c => c.text));
    const vectorRanked = test.candidates
      .map((c, i) => ({ id: c.id, score: vectorScores[i] }))
      .sort((a, b) => b.score - a.score);

    // Reranker ranking
    const start = Date.now();
    const rerankerScores = await reranker.rank(test.query, test.candidates.map(c => c.text));
    latencies.push(Date.now() - start);

    const rerankerRanked = test.candidates
      .map((c, i) => ({ id: c.id, score: rerankerScores[i] }))
      .sort((a, b) => b.score - a.score);

    // Calculate NDCG improvement
    const vectorNDCG = computeNDCG(vectorRanked.map(r => r.id), test.relevant, 5);
    const rerankerNDCG = computeNDCG(rerankerRanked.map(r => r.id), test.relevant, 5);
    improvements.push((rerankerNDCG - vectorNDCG) / vectorNDCG);

    // Cost estimate (Cohere: ~$0.000001 per token)
    totalCost += test.candidates.length * 0.000001;
  }

  return {
    ndcgImprovement: improvements.reduce((a, b) => a + b, 0) / improvements.length,
    latency: latencies.reduce((a, b) => a + b, 0) / latencies.length,
    costPerQuery: totalCost / testQueries.length,
    relativeRank: 0, // Compute position improvement
  };
}

function computeNDCG(rankedIds: string[], relevant: Set<string>, k: number = 5): number {
  const dcg = rankedIds
    .slice(0, k)
    .reduce((sum, id, i) => {
      const rel = relevant.has(id) ? 1 : 0;
      return sum + rel / Math.log2(i + 2);
    }, 0);

  const idcg = Array(Math.min(k, relevant.size))
    .fill(1)
    .reduce((sum, _, i) => sum + 1 / Math.log2(i + 2), 0);

  return idcg === 0 ? 0 : dcg / idcg;
}

Checklist

  • Measure baseline vector search NDCG@5
  • Implement cross-encoder reranking (Cohere or local)
  • Track reranking latency (target <100ms for top-5)
  • Evaluate position bias in your reranked results
  • Add metadata signals (authority, recency, keywords)
  • Test compression with reranked context
  • Cost-benefit analysis (improved quality vs API costs)
  • Set up A/B test vector-only vs reranked pipeline
  • Monitor top-k position distribution (avoid middle collapse)
  • Build golden relevance dataset for evaluation

Conclusion

Reranking is the highest-impact optimization in RAG. A 10-20% improvement in retrieval quality compounds: faster generation, fewer hallucinations, better user experience. Start with API-based cross-encoders (simplest), measure NDCG@5 improvement, then migrate to local models if latency becomes the bottleneck. The key metric: does reranking place relevant chunks in top-3? Everything else is optimization.