Agentic RAG — When Your RAG Pipeline Thinks Before It Retrieves

Introduction

Traditional RAG systems retrieve documents and then generate answers. Agentic RAG flips this: the LLM decides whether to retrieve, what to retrieve, and whether results are sufficient before finalizing an answer. This shift transforms RAG from a static pipeline into an intelligent decision-making system that mirrors how humans research problems.

Query Analysis Before Retrieval
Adaptive Retrieval and Confidence Scoring
Iterative Retrieval: Retrieve-Assess-Re-retrieve
FLARE: Forward-Looking Active Retrieval
Self-Ask Decomposition
Corrective RAG (CRAG)
Comparing Agentic vs Static RAG
Checklist
Conclusion

Query Analysis Before Retrieval

Not every user query requires retrieval. Agentic RAG starts by asking: does this question need external knowledge?

interface QueryAnalysisResult {
  needsRetrieval: boolean;
  confidence: number;
  reasoning: string;
  suggestedSources: 'internal' | 'web' | 'database';
}

async function analyzeQuery(query: string): Promise<QueryAnalysisResult> {
  const analysis = await llm.generate({
    prompt: `Analyze if this query needs external retrieval or can be answered from general knowledge:
    Query: "${query}"

    Respond with JSON: { needsRetrieval, confidence, reasoning, suggestedSources }`,
    model: 'gpt-4',
  });

  return JSON.parse(analysis);
}

This reduces unnecessary vector database calls. If the LLM is confident about general knowledge (e.g., "What is photosynthesis?"), skip retrieval entirely. This saves embedding costs and latency.

Adaptive Retrieval and Confidence Scoring

Agentic systems assess retrieval results in real-time. If confidence is low, they modify the query and re-retrieve.

interface RetrievalState {
  query: string;
  results: Document[];
  confidence: number;
  attempts: number;
}

async function adaptiveRetrieval(
  query: string,
  maxAttempts: number = 3
): Promise<Document[]> {
  let state: RetrievalState = {
    query,
    results: [],
    confidence: 0,
    attempts: 0,
  };

  while (state.confidence &lt; 0.75 &amp;&amp; state.attempts &lt; maxAttempts) {
    state.results = await vectorDb.search(state.query);

    const assessment = await llm.generate({
      prompt: `Do these documents answer the query "${query}"? Score 0-1.`,
    });

    state.confidence = parseFloat(assessment);

    if (state.confidence &lt; 0.75) {
      state.query = await llm.generate({
        prompt: `Rephrase to broaden search: "${state.query}"`,
      });
    }

    state.attempts++;
  }

  return state.results;
}

This mimics how humans refine searches. Low-confidence results trigger query expansion or related concept exploration.

Iterative Retrieval: Retrieve-Assess-Re-retrieve

Rather than a single retrieval pass, agentic systems loop: retrieve documents, assess their relevance, then retrieve more context if needed.

async function iterativeRetrieval(
  question: string,
  rounds: number = 3
): Promise<{ docs: Document[]; reasoning: string[] }> {
  const allDocs = new Set<string>();
  const reasoning: string[] = [];
  let currentQuestion = question;

  for (let i = 0; i &lt; rounds; i++) {
    const docs = await vectorDb.search(currentQuestion);
    docs.forEach(d =&gt; allDocs.add(d.id));

    const assessment = await llm.generate({
      prompt: `Given these docs, what still needs answering about "${question}"?`,
      context: docs,
    });

    reasoning.push(assessment);

    if (assessment.includes('sufficient')) break;

    currentQuestion = `${question}. Still need: ${assessment}`;
  }

  return {
    docs: Array.from(allDocs).map(id =&gt; vectorDb.getById(id)),
    reasoning,
  };
}

This approach is especially valuable for multi-hop questions requiring evidence from multiple documents.

FLARE: Forward-Looking Active Retrieval

FLARE (Forward-Looking Active Retrieval Augmentation) generates the LLM's response token-by-token, retrieving when the model expresses uncertainty.

async function flareGeneration(
  query: string,
  vectorDb: VectorStore
): Promise<string> {
  let response = '';
  const uncertaintyThreshold = 0.4;

  for await (const token of llm.streamGenerate(query)) {
    response += token;

    const uncertainty = await llm.scoreUncertainty(response);

    if (uncertainty &gt; uncertaintyThreshold) {
      const relevant = await vectorDb.search(response.slice(-100));
      const context = relevant.map(d =&gt; d.content).join('\n');

      // Continue generation with retrieved context
      continue;
    }
  }

  return response;
}

FLARE is more cost-effective than fixed retrieval because it retrieves only when needed during generation.

Self-Ask Decomposition

Self-ask breaks complex questions into simpler sub-questions, each with its own retrieval.

interface QuestionDecomposition {
  mainQuestion: string;
  subQuestions: string[];
  dependencies: Map&lt;string, string[]&gt;;
}

async function selfAskDecompose(
  question: string
): Promise<QuestionDecomposition> {
  const decomposition = await llm.generate({
    prompt: `Break into sub-questions: "${question}"

    Format as JSON with subQuestions array and dependencies map.`,
  });

  return JSON.parse(decomposition);
}

async function answerWithSelfAsk(question: string): Promise<string> {
  const decomp = await selfAskDecompose(question);
  const answers = new Map&lt;string, string&gt;();

  for (const subQ of decomp.subQuestions) {
    const docs = await vectorDb.search(subQ);
    const answer = await llm.generate({
      prompt: subQ,
      context: docs,
    });
    answers.set(subQ, answer);
  }

  return await llm.generate({
    prompt: decomp.mainQuestion,
    context: Array.from(answers.entries())
      .map(([q, a]) =&gt; `Q: ${q}\nA: ${a}`)
      .join('\n'),
  });
}

Self-ask improves accuracy on questions requiring synthesized knowledge from multiple sources.

Corrective RAG (CRAG)

CRAG adds a retrieval evaluator that grades relevance and triggers web search fallback when document quality is poor.

interface RetrievalEvaluation {
  isRelevant: boolean;
  score: number;
  hasHallucination: boolean;
}

async function correctiveRag(query: string): Promise<string> {
  const docs = await vectorDb.search(query);

  const evaluation: RetrievalEvaluation = await llm.generate({
    prompt: `Rate document relevance for "${query}".
    Respond JSON: { isRelevant, score, hasHallucination }`,
    context: docs,
  });

  if (!evaluation.isRelevant || evaluation.score &lt; 0.5) {
    // Fallback to web search
    const webResults = await webSearch(query);
    return await llm.generate({
      prompt: query,
      context: webResults,
    });
  }

  return await llm.generate({
    prompt: query,
    context: docs,
  });
}

CRAG reduces hallucinations by validating retrieval quality before answer generation.

Comparing Agentic vs Static RAG

The tradeoff: agentic RAG increases latency (multiple LLM calls, retries) but improves accuracy and reduces hallucinations. Static RAG is faster but can fail on complex queries.

Measure both. Track:

Answer accuracy on test queries
Retrieval quality (precision, recall)
Total latency (including retrieval + generation)
Token usage and cost

For customer-facing applications where accuracy > speed, agentic RAG wins. For high-throughput scenarios, optimize the static pipeline.

Checklist

Implement query analysis to skip unnecessary retrievals
Add confidence scoring to retrieval results
Build adaptive re-querying when confidence is low
Consider FLARE for streaming use cases
Use self-ask for complex, multi-hop questions
Add a retrieval evaluator for quality control
Measure accuracy gains vs latency costs

Conclusion

Agentic RAG systems think before they retrieve, refine queries when results disappoint, and validate information quality. This mirrors human research behavior and dramatically improves answer quality on challenging questions. Start with query analysis and confidence scoring—measure impact before adding complexity.