Published on

RAG Architecture Deep Dive — From Naive Retrieval to Production-Grade Pipelines

Authors

Introduction

Retrieval-Augmented Generation has become the standard approach for building knowledge-aware AI systems. However, many teams start with a basic "retrieve-then-generate" pipeline that fails at scale: duplicate context, irrelevant chunks, and hallucinations plague early implementations.

This post explores production-grade RAG architectures that solve these fundamental problems.

Naive RAG and Its Pitfalls

The simplest RAG flow looks deceptively clean:

// Naive RAG: embed query → find top-k → stuff context → generate
async function naiveRAG(query: string): Promise<string> {
  // 1. Embed user query
  const queryEmbedding = await embedModel.embed(query);

  // 2. Search vector database
  const topChunks = await vectorDB.search(queryEmbedding, {
    topK: 5,
  });

  // 3. Stuff all chunks into context window
  const context = topChunks
    .map(chunk => chunk.text)
    .join('\n\n');

  // 4. Generate answer
  const answer = await llm.generate({
    system: 'You are a helpful assistant.',
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return answer.text;
}

This approach has three critical weaknesses:

  1. Retrieval Quality: Vector similarity ≠ relevance. A chunk with identical words might miss semantic relationships
  2. Context Stuffing: All top-k chunks get stuffed into the prompt, wasting tokens and creating "lost-in-the-middle" effects
  3. No Adaptation: The pipeline doesn't adjust based on query type or content difficulty

Advanced Retrieval Strategies

Rewrite user queries to optimize for vector similarity before retrieval:

async function queryRewritingRAG(originalQuery: string): Promise<string> {
  // Step 1: Rewrite query for better retrieval
  const rewritePrompt = `
Given the user query, rewrite it to be more specific and optimized for
vector similarity search. Focus on key entities and concepts.

Original query: "${originalQuery}"

Rewritten query:`;

  const rewrittenQuery = await llm.generate({
    messages: [{ role: 'user', content: rewritePrompt }],
    maxTokens: 100,
  });

  // Step 2: Search with rewritten query
  const rewriteEmbedding = await embedModel.embed(
    rewrittenQuery.text.trim()
  );
  const chunks = await vectorDB.search(rewriteEmbedding, { topK: 10 });

  // Step 3: Generate answer
  const context = chunks.map(c => c.text).join('\n\n');
  const answer = await llm.generate({
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nOriginal question: ${originalQuery}`,
      },
    ],
  });

  return answer.text;
}

HyDE (Hypothetical Document Embeddings)

Generate hypothetical documents that answer the query, then search for similar real documents:

async function hydeRAG(query: string): Promise<string> {
  // Step 1: Generate hypothetical document answering the query
  const hydePrompt = `
Please write a detailed document that would answer the following question:
"${query}"

The document should be comprehensive and well-structured.`;

  const hypotheticalDoc = await llm.generate({
    messages: [{ role: 'user', content: hydePrompt }],
    maxTokens: 300,
  });

  // Step 2: Embed both query and hypothetical document
  const [queryEmbed, hydeEmbed] = await Promise.all([
    embedModel.embed(query),
    embedModel.embed(hypotheticalDoc.text),
  ]);

  // Step 3: Search using weighted combination of embeddings
  const queryChunks = await vectorDB.search(queryEmbed, { topK: 5 });
  const hydeChunks = await vectorDB.search(hydeEmbed, { topK: 5 });

  // Deduplicate and combine results
  const combined = new Map<string, number>();
  queryChunks.forEach((c, i) => combined.set(c.id, (combined.get(c.id) || 0) + (1 / (i + 1))));
  hydeChunks.forEach((c, i) => combined.set(c.id, (combined.get(c.id) || 0) + (1 / (i + 1))));

  const topChunks = Array.from(combined.entries())
    .sort(([, a], [, b]) => b - a)
    .slice(0, 8)
    .map(([id]) => queryChunks.find(c => c.id === id) || hydeChunks.find(c => c.id === id)!);

  // Step 4: Generate final answer
  const context = topChunks.map(c => c.text).join('\n\n');
  const answer = await llm.generate({
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return answer.text;
}

Modular and Adaptive RAG

Self-RAG: Let the Model Decide When to Retrieve

Self-RAG makes the model explicitly decide whether retrieval is needed:

async function selfRAG(query: string): Promise<string> {
  interface RAGDecision {
    needsRetrieval: boolean;
    reason: string;
    retrievalQuery?: string;
  }

  // Step 1: Decide whether retrieval is needed
  const decisionPrompt = `
Analyze the question: "${query}"

Respond in JSON with:
- needsRetrieval: boolean (true if this question requires external knowledge)
- reason: string (explanation)
- retrievalQuery: string (if needsRetrieval=true, what to search for)`;

  const decisionResponse = await llm.generate({
    messages: [{ role: 'user', content: decisionPrompt }],
    maxTokens: 200,
  });

  const decision: RAGDecision = JSON.parse(decisionResponse.text);

  let context = '';
  if (decision.needsRetrieval && decision.retrievalQuery) {
    const embedding = await embedModel.embed(decision.retrievalQuery);
    const chunks = await vectorDB.search(embedding, { topK: 5 });
    context = chunks.map(c => c.text).join('\n\n');
  }

  // Step 2: Generate answer (with optional context)
  const generatePrompt = decision.needsRetrieval
    ? `Context:\n${context}\n\nQuestion: ${query}`
    : `Question: ${query}`;

  const answer = await llm.generate({
    messages: [{ role: 'user', content: generatePrompt }],
  });

  // Step 3: Validate retrieved context matches the answer (optional)
  if (decision.needsRetrieval) {
    const validationPrompt = `
Answer: "${answer.text}"
Context chunks used above.

Is the answer grounded in the provided context? Respond YES or NO.`;

    const validation = await llm.generate({
      messages: [{ role: 'user', content: validationPrompt }],
      maxTokens: 5,
    });

    if (!validation.text.includes('YES')) {
      console.warn('Answer not grounded in context');
    }
  }

  return answer.text;
}

CRAG: Corrective RAG with Feedback Loop

Corrective RAG evaluates retrieval quality and adjusts the strategy:

async function correctiveRAG(query: string): Promise<string> {
  interface RetrievalEval {
    relevance: 'relevant' | 'partially_relevant' | 'irrelevant';
    confidence: number;
    action: 'proceed' | 'expand' | 'rewrite';
  }

  // Step 1: Initial retrieval
  const queryEmbedding = await embedModel.embed(query);
  let chunks = await vectorDB.search(queryEmbedding, { topK: 5 });

  // Step 2: Evaluate retrieval quality
  const evalPrompt = `
Query: "${query}"
Retrieved documents: ${chunks.map(c => c.text).join('\n---\n')}

Assess the relevance of these documents. Respond in JSON:
{
  "relevance": "relevant" | "partially_relevant" | "irrelevant",
  "confidence": 0-100,
  "action": "proceed" | "expand" | "rewrite"
}`;

  const evalResponse = await llm.generate({
    messages: [{ role: 'user', content: evalPrompt }],
    maxTokens: 150,
  });

  const evaluation: RetrievalEval = JSON.parse(evalResponse.text);

  // Step 3: Take corrective action
  if (evaluation.action === 'rewrite') {
    // Rewrite query and retrieve again
    const rewritePrompt = `Rewrite for better search: "${query}"`;
    const rewritten = await llm.generate({
      messages: [{ role: 'user', content: rewritePrompt }],
      maxTokens: 100,
    });

    const newEmbedding = await embedModel.embed(rewritten.text);
    chunks = await vectorDB.search(newEmbedding, { topK: 8 });
  } else if (evaluation.action === 'expand') {
    // Expand search
    chunks = await vectorDB.search(queryEmbedding, { topK: 15 });
  }

  // Step 4: Generate answer
  const context = chunks.map(c => c.text).join('\n\n');
  const answer = await llm.generate({
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return answer.text;
}

Decision: RAG vs Long Context vs Fine-Tuning

When should you use RAG? Here's a production decision matrix:

type ArchitectureStrategy = 'rag' | 'long_context' | 'fine_tuning' | 'hybrid';

interface DecisionFactors {
  dataSize: 'small' | 'medium' | 'large'; // GB of knowledge
  updateFrequency: 'static' | 'weekly' | 'daily' | 'realtime'; // How often knowledge changes
  latencyRequirement: number; // ms
  costPerQuery: number; // $ budget
  domainSpecialization: 'general' | 'niche'; // Domain focus
}

function recommendArchitecture(factors: DecisionFactors): ArchitectureStrategy {
  // Small, static data → fine-tuning (best quality, lowest latency)
  if (factors.dataSize === 'small' && factors.updateFrequency === 'static') {
    return 'fine_tuning';
  }

  // Large, frequently updated → RAG (flexible, scalable)
  if (factors.dataSize === 'large' &&
      (factors.updateFrequency === 'daily' || factors.updateFrequency === 'realtime')) {
    return 'rag';
  }

  // Medium data, niche domain → hybrid (fine-tune base, RAG for specifics)
  if (factors.dataSize === 'medium' && factors.domainSpecialization === 'niche') {
    return 'hybrid';
  }

  // Default to RAG for flexibility
  return 'rag';
}

Checklist

  • Implement query rewriting or HyDE for improved retrieval
  • Add retrieval quality evaluation before generation
  • Implement modular pipeline allowing swappable retrievers
  • Track retrieval quality metrics (hit rate, MRR)
  • Set up reranking step after initial retrieval
  • Implement fallback strategies for low-confidence results
  • Monitor token usage per query
  • Add context attribution and source tracking
  • Test multi-hop query handling
  • Establish decision criteria for your architecture choice

Conclusion

Production RAG systems require moving beyond naive retrieve-and-generate patterns. By implementing advanced retrieval strategies, adaptive routing, and corrective feedback loops, you build systems that reliably scale to complex question-answering tasks. The key insight: treat retrieval as a learned, iterative process, not a static similarity search.