- Published on
Agentic RAG — When Your RAG Pipeline Thinks Before It Retrieves
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Traditional RAG systems retrieve documents and then generate answers. Agentic RAG flips this: the LLM decides whether to retrieve, what to retrieve, and whether results are sufficient before finalizing an answer. This shift transforms RAG from a static pipeline into an intelligent decision-making system that mirrors how humans research problems.
- Query Analysis Before Retrieval
- Adaptive Retrieval and Confidence Scoring
- Iterative Retrieval: Retrieve-Assess-Re-retrieve
- FLARE: Forward-Looking Active Retrieval
- Self-Ask Decomposition
- Corrective RAG (CRAG)
- Comparing Agentic vs Static RAG
- Checklist
- Conclusion
Query Analysis Before Retrieval
Not every user query requires retrieval. Agentic RAG starts by asking: does this question need external knowledge?
interface QueryAnalysisResult {
needsRetrieval: boolean;
confidence: number;
reasoning: string;
suggestedSources: 'internal' | 'web' | 'database';
}
async function analyzeQuery(query: string): Promise<QueryAnalysisResult> {
const analysis = await llm.generate({
prompt: `Analyze if this query needs external retrieval or can be answered from general knowledge:
Query: "${query}"
Respond with JSON: { needsRetrieval, confidence, reasoning, suggestedSources }`,
model: 'gpt-4',
});
return JSON.parse(analysis);
}
This reduces unnecessary vector database calls. If the LLM is confident about general knowledge (e.g., "What is photosynthesis?"), skip retrieval entirely. This saves embedding costs and latency.
Adaptive Retrieval and Confidence Scoring
Agentic systems assess retrieval results in real-time. If confidence is low, they modify the query and re-retrieve.
interface RetrievalState {
query: string;
results: Document[];
confidence: number;
attempts: number;
}
async function adaptiveRetrieval(
query: string,
maxAttempts: number = 3
): Promise<Document[]> {
let state: RetrievalState = {
query,
results: [],
confidence: 0,
attempts: 0,
};
while (state.confidence < 0.75 && state.attempts < maxAttempts) {
state.results = await vectorDb.search(state.query);
const assessment = await llm.generate({
prompt: `Do these documents answer the query "${query}"? Score 0-1.`,
});
state.confidence = parseFloat(assessment);
if (state.confidence < 0.75) {
state.query = await llm.generate({
prompt: `Rephrase to broaden search: "${state.query}"`,
});
}
state.attempts++;
}
return state.results;
}
This mimics how humans refine searches. Low-confidence results trigger query expansion or related concept exploration.
Iterative Retrieval: Retrieve-Assess-Re-retrieve
Rather than a single retrieval pass, agentic systems loop: retrieve documents, assess their relevance, then retrieve more context if needed.
async function iterativeRetrieval(
question: string,
rounds: number = 3
): Promise<{ docs: Document[]; reasoning: string[] }> {
const allDocs = new Set<string>();
const reasoning: string[] = [];
let currentQuestion = question;
for (let i = 0; i < rounds; i++) {
const docs = await vectorDb.search(currentQuestion);
docs.forEach(d => allDocs.add(d.id));
const assessment = await llm.generate({
prompt: `Given these docs, what still needs answering about "${question}"?`,
context: docs,
});
reasoning.push(assessment);
if (assessment.includes('sufficient')) break;
currentQuestion = `${question}. Still need: ${assessment}`;
}
return {
docs: Array.from(allDocs).map(id => vectorDb.getById(id)),
reasoning,
};
}
This approach is especially valuable for multi-hop questions requiring evidence from multiple documents.
FLARE: Forward-Looking Active Retrieval
FLARE (Forward-Looking Active Retrieval Augmentation) generates the LLM's response token-by-token, retrieving when the model expresses uncertainty.
async function flareGeneration(
query: string,
vectorDb: VectorStore
): Promise<string> {
let response = '';
const uncertaintyThreshold = 0.4;
for await (const token of llm.streamGenerate(query)) {
response += token;
const uncertainty = await llm.scoreUncertainty(response);
if (uncertainty > uncertaintyThreshold) {
const relevant = await vectorDb.search(response.slice(-100));
const context = relevant.map(d => d.content).join('\n');
// Continue generation with retrieved context
continue;
}
}
return response;
}
FLARE is more cost-effective than fixed retrieval because it retrieves only when needed during generation.
Self-Ask Decomposition
Self-ask breaks complex questions into simpler sub-questions, each with its own retrieval.
interface QuestionDecomposition {
mainQuestion: string;
subQuestions: string[];
dependencies: Map<string, string[]>;
}
async function selfAskDecompose(
question: string
): Promise<QuestionDecomposition> {
const decomposition = await llm.generate({
prompt: `Break into sub-questions: "${question}"
Format as JSON with subQuestions array and dependencies map.`,
});
return JSON.parse(decomposition);
}
async function answerWithSelfAsk(question: string): Promise<string> {
const decomp = await selfAskDecompose(question);
const answers = new Map<string, string>();
for (const subQ of decomp.subQuestions) {
const docs = await vectorDb.search(subQ);
const answer = await llm.generate({
prompt: subQ,
context: docs,
});
answers.set(subQ, answer);
}
return await llm.generate({
prompt: decomp.mainQuestion,
context: Array.from(answers.entries())
.map(([q, a]) => `Q: ${q}\nA: ${a}`)
.join('\n'),
});
}
Self-ask improves accuracy on questions requiring synthesized knowledge from multiple sources.
Corrective RAG (CRAG)
CRAG adds a retrieval evaluator that grades relevance and triggers web search fallback when document quality is poor.
interface RetrievalEvaluation {
isRelevant: boolean;
score: number;
hasHallucination: boolean;
}
async function correctiveRag(query: string): Promise<string> {
const docs = await vectorDb.search(query);
const evaluation: RetrievalEvaluation = await llm.generate({
prompt: `Rate document relevance for "${query}".
Respond JSON: { isRelevant, score, hasHallucination }`,
context: docs,
});
if (!evaluation.isRelevant || evaluation.score < 0.5) {
// Fallback to web search
const webResults = await webSearch(query);
return await llm.generate({
prompt: query,
context: webResults,
});
}
return await llm.generate({
prompt: query,
context: docs,
});
}
CRAG reduces hallucinations by validating retrieval quality before answer generation.
Comparing Agentic vs Static RAG
The tradeoff: agentic RAG increases latency (multiple LLM calls, retries) but improves accuracy and reduces hallucinations. Static RAG is faster but can fail on complex queries.
Measure both. Track:
- Answer accuracy on test queries
- Retrieval quality (precision, recall)
- Total latency (including retrieval + generation)
- Token usage and cost
For customer-facing applications where accuracy > speed, agentic RAG wins. For high-throughput scenarios, optimize the static pipeline.
Checklist
- Implement query analysis to skip unnecessary retrievals
- Add confidence scoring to retrieval results
- Build adaptive re-querying when confidence is low
- Consider FLARE for streaming use cases
- Use self-ask for complex, multi-hop questions
- Add a retrieval evaluator for quality control
- Measure accuracy gains vs latency costs
Conclusion
Agentic RAG systems think before they retrieve, refine queries when results disappoint, and validate information quality. This mirrors human research behavior and dramatically improves answer quality on challenging questions. Start with query analysis and confidence scoring—measure impact before adding complexity.