- Published on
RAG Architecture Deep Dive — From Naive Retrieval to Production-Grade Pipelines
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Retrieval-Augmented Generation has become the standard approach for building knowledge-aware AI systems. However, many teams start with a basic "retrieve-then-generate" pipeline that fails at scale: duplicate context, irrelevant chunks, and hallucinations plague early implementations.
This post explores production-grade RAG architectures that solve these fundamental problems.
- Naive RAG and Its Pitfalls
- Advanced Retrieval Strategies
- Query Rewriting for Better Search
- HyDE (Hypothetical Document Embeddings)
- Modular and Adaptive RAG
- Self-RAG: Let the Model Decide When to Retrieve
- CRAG: Corrective RAG with Feedback Loop
- Decision: RAG vs Long Context vs Fine-Tuning
- Checklist
- Conclusion
Naive RAG and Its Pitfalls
The simplest RAG flow looks deceptively clean:
// Naive RAG: embed query → find top-k → stuff context → generate
async function naiveRAG(query: string): Promise<string> {
// 1. Embed user query
const queryEmbedding = await embedModel.embed(query);
// 2. Search vector database
const topChunks = await vectorDB.search(queryEmbedding, {
topK: 5,
});
// 3. Stuff all chunks into context window
const context = topChunks
.map(chunk => chunk.text)
.join('\n\n');
// 4. Generate answer
const answer = await llm.generate({
system: 'You are a helpful assistant.',
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${query}`,
},
],
});
return answer.text;
}
This approach has three critical weaknesses:
- Retrieval Quality: Vector similarity ≠ relevance. A chunk with identical words might miss semantic relationships
- Context Stuffing: All top-k chunks get stuffed into the prompt, wasting tokens and creating "lost-in-the-middle" effects
- No Adaptation: The pipeline doesn't adjust based on query type or content difficulty
Advanced Retrieval Strategies
Query Rewriting for Better Search
Rewrite user queries to optimize for vector similarity before retrieval:
async function queryRewritingRAG(originalQuery: string): Promise<string> {
// Step 1: Rewrite query for better retrieval
const rewritePrompt = `
Given the user query, rewrite it to be more specific and optimized for
vector similarity search. Focus on key entities and concepts.
Original query: "${originalQuery}"
Rewritten query:`;
const rewrittenQuery = await llm.generate({
messages: [{ role: 'user', content: rewritePrompt }],
maxTokens: 100,
});
// Step 2: Search with rewritten query
const rewriteEmbedding = await embedModel.embed(
rewrittenQuery.text.trim()
);
const chunks = await vectorDB.search(rewriteEmbedding, { topK: 10 });
// Step 3: Generate answer
const context = chunks.map(c => c.text).join('\n\n');
const answer = await llm.generate({
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nOriginal question: ${originalQuery}`,
},
],
});
return answer.text;
}
HyDE (Hypothetical Document Embeddings)
Generate hypothetical documents that answer the query, then search for similar real documents:
async function hydeRAG(query: string): Promise<string> {
// Step 1: Generate hypothetical document answering the query
const hydePrompt = `
Please write a detailed document that would answer the following question:
"${query}"
The document should be comprehensive and well-structured.`;
const hypotheticalDoc = await llm.generate({
messages: [{ role: 'user', content: hydePrompt }],
maxTokens: 300,
});
// Step 2: Embed both query and hypothetical document
const [queryEmbed, hydeEmbed] = await Promise.all([
embedModel.embed(query),
embedModel.embed(hypotheticalDoc.text),
]);
// Step 3: Search using weighted combination of embeddings
const queryChunks = await vectorDB.search(queryEmbed, { topK: 5 });
const hydeChunks = await vectorDB.search(hydeEmbed, { topK: 5 });
// Deduplicate and combine results
const combined = new Map<string, number>();
queryChunks.forEach((c, i) => combined.set(c.id, (combined.get(c.id) || 0) + (1 / (i + 1))));
hydeChunks.forEach((c, i) => combined.set(c.id, (combined.get(c.id) || 0) + (1 / (i + 1))));
const topChunks = Array.from(combined.entries())
.sort(([, a], [, b]) => b - a)
.slice(0, 8)
.map(([id]) => queryChunks.find(c => c.id === id) || hydeChunks.find(c => c.id === id)!);
// Step 4: Generate final answer
const context = topChunks.map(c => c.text).join('\n\n');
const answer = await llm.generate({
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${query}`,
},
],
});
return answer.text;
}
Modular and Adaptive RAG
Self-RAG: Let the Model Decide When to Retrieve
Self-RAG makes the model explicitly decide whether retrieval is needed:
async function selfRAG(query: string): Promise<string> {
interface RAGDecision {
needsRetrieval: boolean;
reason: string;
retrievalQuery?: string;
}
// Step 1: Decide whether retrieval is needed
const decisionPrompt = `
Analyze the question: "${query}"
Respond in JSON with:
- needsRetrieval: boolean (true if this question requires external knowledge)
- reason: string (explanation)
- retrievalQuery: string (if needsRetrieval=true, what to search for)`;
const decisionResponse = await llm.generate({
messages: [{ role: 'user', content: decisionPrompt }],
maxTokens: 200,
});
const decision: RAGDecision = JSON.parse(decisionResponse.text);
let context = '';
if (decision.needsRetrieval && decision.retrievalQuery) {
const embedding = await embedModel.embed(decision.retrievalQuery);
const chunks = await vectorDB.search(embedding, { topK: 5 });
context = chunks.map(c => c.text).join('\n\n');
}
// Step 2: Generate answer (with optional context)
const generatePrompt = decision.needsRetrieval
? `Context:\n${context}\n\nQuestion: ${query}`
: `Question: ${query}`;
const answer = await llm.generate({
messages: [{ role: 'user', content: generatePrompt }],
});
// Step 3: Validate retrieved context matches the answer (optional)
if (decision.needsRetrieval) {
const validationPrompt = `
Answer: "${answer.text}"
Context chunks used above.
Is the answer grounded in the provided context? Respond YES or NO.`;
const validation = await llm.generate({
messages: [{ role: 'user', content: validationPrompt }],
maxTokens: 5,
});
if (!validation.text.includes('YES')) {
console.warn('Answer not grounded in context');
}
}
return answer.text;
}
CRAG: Corrective RAG with Feedback Loop
Corrective RAG evaluates retrieval quality and adjusts the strategy:
async function correctiveRAG(query: string): Promise<string> {
interface RetrievalEval {
relevance: 'relevant' | 'partially_relevant' | 'irrelevant';
confidence: number;
action: 'proceed' | 'expand' | 'rewrite';
}
// Step 1: Initial retrieval
const queryEmbedding = await embedModel.embed(query);
let chunks = await vectorDB.search(queryEmbedding, { topK: 5 });
// Step 2: Evaluate retrieval quality
const evalPrompt = `
Query: "${query}"
Retrieved documents: ${chunks.map(c => c.text).join('\n---\n')}
Assess the relevance of these documents. Respond in JSON:
{
"relevance": "relevant" | "partially_relevant" | "irrelevant",
"confidence": 0-100,
"action": "proceed" | "expand" | "rewrite"
}`;
const evalResponse = await llm.generate({
messages: [{ role: 'user', content: evalPrompt }],
maxTokens: 150,
});
const evaluation: RetrievalEval = JSON.parse(evalResponse.text);
// Step 3: Take corrective action
if (evaluation.action === 'rewrite') {
// Rewrite query and retrieve again
const rewritePrompt = `Rewrite for better search: "${query}"`;
const rewritten = await llm.generate({
messages: [{ role: 'user', content: rewritePrompt }],
maxTokens: 100,
});
const newEmbedding = await embedModel.embed(rewritten.text);
chunks = await vectorDB.search(newEmbedding, { topK: 8 });
} else if (evaluation.action === 'expand') {
// Expand search
chunks = await vectorDB.search(queryEmbedding, { topK: 15 });
}
// Step 4: Generate answer
const context = chunks.map(c => c.text).join('\n\n');
const answer = await llm.generate({
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${query}`,
},
],
});
return answer.text;
}
Decision: RAG vs Long Context vs Fine-Tuning
When should you use RAG? Here's a production decision matrix:
type ArchitectureStrategy = 'rag' | 'long_context' | 'fine_tuning' | 'hybrid';
interface DecisionFactors {
dataSize: 'small' | 'medium' | 'large'; // GB of knowledge
updateFrequency: 'static' | 'weekly' | 'daily' | 'realtime'; // How often knowledge changes
latencyRequirement: number; // ms
costPerQuery: number; // $ budget
domainSpecialization: 'general' | 'niche'; // Domain focus
}
function recommendArchitecture(factors: DecisionFactors): ArchitectureStrategy {
// Small, static data → fine-tuning (best quality, lowest latency)
if (factors.dataSize === 'small' && factors.updateFrequency === 'static') {
return 'fine_tuning';
}
// Large, frequently updated → RAG (flexible, scalable)
if (factors.dataSize === 'large' &&
(factors.updateFrequency === 'daily' || factors.updateFrequency === 'realtime')) {
return 'rag';
}
// Medium data, niche domain → hybrid (fine-tune base, RAG for specifics)
if (factors.dataSize === 'medium' && factors.domainSpecialization === 'niche') {
return 'hybrid';
}
// Default to RAG for flexibility
return 'rag';
}
Checklist
- Implement query rewriting or HyDE for improved retrieval
- Add retrieval quality evaluation before generation
- Implement modular pipeline allowing swappable retrievers
- Track retrieval quality metrics (hit rate, MRR)
- Set up reranking step after initial retrieval
- Implement fallback strategies for low-confidence results
- Monitor token usage per query
- Add context attribution and source tracking
- Test multi-hop query handling
- Establish decision criteria for your architecture choice
Conclusion
Production RAG systems require moving beyond naive retrieve-and-generate patterns. By implementing advanced retrieval strategies, adaptive routing, and corrective feedback loops, you build systems that reliably scale to complex question-answering tasks. The key insight: treat retrieval as a learned, iterative process, not a static similarity search.