Published on

Continuous RAG Improvement — Using Production Data to Make Your Pipeline Better

Authors

Introduction

Most RAG systems are static after deployment. But the best teams treat RAG as a learning system: every user interaction is a data point. A user corrects an answer—that's a signal. A query gets zero retrieval results—that's a problem. A user rephrases their question—that's guidance. Mining production data transforms RAG from fixed to continuously improving. This post covers the infrastructure and techniques to capture, evaluate, and act on real-world signals.

Logging Retrieval Quality Signals

Instrument every retrieval with metadata to evaluate quality later.

interface RetrievalLog {
  timestamp: number;
  queryId: string;
  userQuery: string;
  retrievedChunks: Array<{
    id: string;
    content: string;
    similarity: number;
  }>;
  generatedAnswer: string;
  userFeedback?: 'helpful' | 'incorrect' | 'irrelevant' | 'incomplete';
  userCorrection?: string; // If user provided the right answer
  sessionId: string;
  userId: string;
}

async function logRetrieval(
  query: string,
  chunks: DocumentChunk[],
  answer: string,
  sessionId: string,
  userId: string
): Promise<string> {
  const queryId = generateId();

  const log: RetrievalLog = {
    timestamp: Date.now(),
    queryId,
    userQuery: query,
    retrievedChunks: chunks.map(c => ({
      id: c.id,
      content: c.content,
      similarity: c.similarity,
    })),
    generatedAnswer: answer,
    sessionId,
    userId,
  };

  // Store in database for analysis
  await logsDb.insert('retrievals', log);

  return queryId;
}

// Collect explicit feedback from user
async function recordUserFeedback(
  queryId: string,
  feedback: 'helpful' | 'incorrect' | 'irrelevant' | 'incomplete',
  correction?: string
): Promise<void> {
  await logsDb.update('retrievals', { queryId }, {
    userFeedback: feedback,
    userCorrection: correction,
  });
}

Store all logs in a queryable database. Over time, logs become your ground truth for evaluation.

Identifying Bad Retrievals From Conversation Patterns

Watch for red flags in conversation flow.

interface ConversationPattern {
  queryId: string;
  pattern: 'rephrasing' | 'clarification' | 'complaint' | 'correction';
  confidence: number;
}

async function detectConversationPatterns(
  messages: Array<{ role: string; content: string }>
): Promise<ConversationPattern[]> {
  const patterns: ConversationPattern[] = [];

  for (let i = 1; i < messages.length; i++) {
    const prev = messages[i - 1];
    const curr = messages[i];

    if (prev.role === 'assistant' && curr.role === 'user') {
      // Check for rephrasing: user asks similar question again
      if (isSimilarQuery(prev.content, curr.content)) {
        patterns.push({
          queryId: generateId(),
          pattern: 'rephrasing',
          confidence: computeSimilarity(prev.content, curr.content),
        });
      }

      // Check for clarification: "What do you mean by..."
      if (
        curr.content.toLowerCase().includes('what do you mean') ||
        curr.content.toLowerCase().includes('can you explain')
      ) {
        patterns.push({
          queryId: generateId(),
          pattern: 'clarification',
          confidence: 0.9,
        });
      }

      // Check for complaints: "That's wrong" or "That doesn't make sense"
      if (
        curr.content.toLowerCase().includes('wrong') ||
        curr.content.toLowerCase().includes('incorrect') ||
        curr.content.toLowerCase().includes('doesn't make sense')
      ) {
        patterns.push({
          queryId: generateId(),
          pattern: 'complaint',
          confidence: 0.85,
        });
      }

      // Check for correction: "Actually, it's..." or "No, I meant..."
      if (
        curr.content.toLowerCase().includes('actually') ||
        curr.content.toLowerCase().includes('i meant')
      ) {
        patterns.push({
          queryId: generateId(),
          pattern: 'correction',
          confidence: 0.8,
        });
      }
    }
  }

  return patterns;
}

// Alert when patterns indicate retrieval failure
async function monitorPatterns(
  sessionId: string,
  messages: Array<{ role: string; content: string }>
): Promise<void> {
  const patterns = await detectConversationPatterns(messages);

  const negativeCount = patterns.filter(
    p => p.pattern === 'complaint' || p.pattern === 'correction'
  ).length;

  if (negativeCount >= 2) {
    console.warn(`Session ${sessionId} shows retrieval issues; flagged for review`);
    await flagSessionForReview(sessionId, 'high_failure_rate');
  }
}

Rephrasing and corrections are strong signals that initial retrieval failed.

Automatic Evaluation on Production Queries

Evaluate retrieval quality using LLM judgment, without requiring user labels.

interface RetrievalQualityScore {
  relevance: number; // 0-1: Does the chunk answer the query?
  completeness: number; // 0-1: Is the chunk sufficient?
  clarity: number; // 0-1: Is it clear and coherent?
  overallScore: number;
}

async function evaluateRetrievalQuality(
  query: string,
  chunks: DocumentChunk[],
  generatedAnswer: string,
  llm: LanguageModel
): Promise<RetrievalQualityScore> {
  const evaluation = await llm.generate({
    prompt: `Rate the quality of this retrieval for the given query.

Query: "${query}"

Retrieved chunks:
${chunks.map(c => c.content).join('\n---\n')}

Generated answer: "${generatedAnswer}"

Score each dimension 0-10:
- Relevance: Does the chunk directly address the query?
- Completeness: Does the chunk fully answer the question?
- Clarity: Is the information clear and coherent?

Return JSON: { relevance, completeness, clarity }`,
  });

  const scores = JSON.parse(evaluation);

  return {
    relevance: scores.relevance / 10,
    completeness: scores.completeness / 10,
    clarity: scores.clarity / 10,
    overallScore:
      (scores.relevance + scores.completeness + scores.clarity) / 30,
  };
}

async function evaluateBatch(logs: RetrievalLog[]): Promise<void> {
  for (const log of logs) {
    const quality = await evaluateRetrievalQuality(
      log.userQuery,
      log.retrievedChunks,
      log.generatedAnswer,
      llm
    );

    // Store evaluation
    await logsDb.update('retrievals', { queryId: log.queryId }, {
      llmQualityScore: quality.overallScore,
      relevanceScore: quality.relevance,
      completenessScore: quality.completeness,
    });
  }
}

LLM-based evaluation is fast and flexible. Evaluate every production query overnight.

A/B Testing Retrieval Changes

Test retrieval improvements before rolling out.

interface ABTestConfig {
  testId: string;
  controlRetriever: Retriever; // Current production
  experimentalRetriever: Retriever; // New approach
  splitRatio: number; // % of traffic to experiment
  duration: number; // milliseconds
  metric: 'relevance_score' | 'click_through_rate' | 'user_feedback';
}

async function abTestRetrieval(
  query: string,
  config: ABTestConfig
): Promise<{
  retrievedChunks: DocumentChunk[];
  variant: 'control' | 'experimental';
  testId: string;
}> {
  const assignedVariant = Math.random() < config.splitRatio
    ? 'experimental'
    : 'control';

  const retriever =
    assignedVariant === 'control'
      ? config.controlRetriever
      : config.experimentalRetriever;

  const chunks = await retriever.retrieve(query);

  // Log for analysis
  await logsDb.insert('ab_tests', {
    testId: config.testId,
    query,
    variant: assignedVariant,
    timestamp: Date.now(),
  });

  return {
    retrievedChunks: chunks,
    variant: assignedVariant,
    testId: config.testId,
  };
}

// Analyze results after test duration
async function analyzeABTest(testId: string): Promise<{
  controlMetric: number;
  experimentalMetric: number;
  improvement: number;
  pValue: number; // Statistical significance
  recommendation: 'rollout' | 'continue_test' | 'rollback';
}> {
  const logs = await logsDb.query({
    collection: 'ab_tests',
    filter: { testId },
  });

  const controlLogs = logs.filter(l => l.variant === 'control');
  const experimentalLogs = logs.filter(l => l.variant === 'experimental');

  // Compute metric (e.g., average relevance score)
  const controlMetric =
    controlLogs.reduce((sum, l) => sum + l.relevanceScore, 0) /
    controlLogs.length;

  const experimentalMetric =
    experimentalLogs.reduce((sum, l) => sum + l.relevanceScore, 0) /
    experimentalLogs.length;

  const improvement = experimentalMetric - controlMetric;

  // Compute statistical significance (t-test)
  const pValue = await computeTTest(controlLogs, experimentalLogs);

  // Recommend action
  let recommendation: 'rollout' | 'continue_test' | 'rollback';

  if (pValue < 0.05 && improvement > 0.02) {
    recommendation = 'rollout'; // Statistically significant improvement
  } else if (pValue < 0.1) {
    recommendation = 'continue_test'; // Trending but not significant yet
  } else {
    recommendation = 'rollback'; // No improvement or worse
  }

  return {
    controlMetric,
    experimentalMetric,
    improvement,
    pValue,
    recommendation,
  };
}

A/B testing prevents regressions and validates improvements before deployment.

Dataset Flywheel: Bad Examples → Training Data

Turn retrieval failures into training data for fine-tuning.

interface TrainingExample {
  query: string;
  goldDocuments: DocumentChunk[];
  negativeExamples: DocumentChunk[];
  label: 'bad_retrieval' | 'good_retrieval';
}

async function harvestTrainingData(
  logsDb: LogsDatabase
): Promise<TrainingExample[]> {
  // Find queries with low evaluation scores
  const badRetrievals = await logsDb.query({
    collection: 'retrievals',
    filter: { llmQualityScore: { $lt: 0.5 } },
    limit: 1000,
  });

  // Find queries with high evaluation scores
  const goodRetrievals = await logsDb.query({
    collection: 'retrievals',
    filter: { llmQualityScore: { $gt: 0.8 } },
    limit: 1000,
  });

  const trainingData: TrainingExample[] = [];

  // For bad retrievals, create hard examples
  for (const log of badRetrievals) {
    if (log.userCorrection) {
      // User provided correct answer; use as gold document
      trainingData.push({
        query: log.userQuery,
        goldDocuments: [
          { id: generateId(), content: log.userCorrection, similarity: 1.0 },
        ],
        negativeExamples: log.retrievedChunks,
        label: 'bad_retrieval',
      });
    }
  }

  // For good retrievals, use as positive examples
  for (const log of goodRetrievals) {
    trainingData.push({
      query: log.userQuery,
      goldDocuments: log.retrievedChunks,
      negativeExamples: [], // No hard negatives yet
      label: 'good_retrieval',
    });
  }

  return trainingData;
}

This flywheel continuously improves your retrieval: failures become training data, training data improves your model, improving model reduces failures.

Chunk Quality Scoring

Not all chunks are equally useful. Score them to prioritize high-quality content.

interface ChunkQuality {
  chunkId: string;
  relevanceFrequency: number; // How often this chunk is retrieved
  userFeedbackRate: number; // % of retrievals where user marked helpful
  overallScore: number;
}

async function scoreChunkQuality(
  vectorDb: VectorStore,
  logsDb: LogsDatabase
): Promise<ChunkQuality[]> {
  const allChunks = await vectorDb.getAllChunks();

  const qualities: ChunkQuality[] = [];

  for (const chunk of allChunks) {
    // Count retrievals for this chunk
    const retrievalLogs = await logsDb.query({
      collection: 'retrievals',
      filter: { 'retrievedChunks.id': chunk.id },
    });

    // Count helpful feedback
    const helpfulFeedback = retrievalLogs.filter(
      l => l.userFeedback === 'helpful'
    ).length;

    const relevanceFrequency = retrievalLogs.length;
    const userFeedbackRate =
      relevanceFrequency > 0 ? helpfulFeedback / relevanceFrequency : 0;

    qualities.push({
      chunkId: chunk.id,
      relevanceFrequency,
      userFeedbackRate,
      overallScore:
        (relevanceFrequency / 100) * 0.3 +
        userFeedbackRate * 0.7, // Weight user feedback higher
    });
  }

  return qualities.sort((a, b) => b.overallScore - a.overallScore);
}

// Remove low-quality chunks
async function removeChunksBelow(
  vectorDb: VectorStore,
  threshold: number = 0.3
): Promise<number> {
  const qualities = await scoreChunkQuality(vectorDb, logsDb);

  let deleted = 0;
  for (const quality of qualities) {
    if (quality.overallScore < threshold) {
      await vectorDb.delete(quality.chunkId);
      deleted++;
    }
  }

  return deleted;
}

Low-quality chunks add noise. Removing them improves retrieval recall.

Embedding Model Upgrade Strategy

When a new embedding model is released, how do you migrate?

async function evaluateNewEmbeddingModel(
  testQueries: string[],
  testChunks: DocumentChunk[],
  currentEmbedder: Embedder,
  newEmbedder: Embedder
): Promise<{ currentRecall: number; newRecall: number; improvement: number }> {
  let currentHits = 0;
  let newHits = 0;

  for (const query of testQueries) {
    // Rank chunks with current embedder
    const currentScores = testChunks.map(chunk => ({
      chunk,
      score: cosineSimilarity(
        currentEmbedder.embed(query),
        currentEmbedder.embed(chunk.content)
      ),
    }));

    // Rank chunks with new embedder
    const newScores = testChunks.map(chunk => ({
      chunk,
      score: cosineSimilarity(
        newEmbedder.embed(query),
        newEmbedder.embed(chunk.content)
      ),
    }));

    // Check top-10 recall
    const currentTop10 = currentScores
      .sort((a, b) => b.score - a.score)
      .slice(0, 10);

    const newTop10 = newScores
      .sort((a, b) => b.score - a.score)
      .slice(0, 10);

    // Assume gold chunk is first chunk
    if (currentTop10.some(s => s.chunk === testChunks[0])) currentHits++;
    if (newTop10.some(s => s.chunk === testChunks[0])) newHits++;
  }

  const currentRecall = currentHits / testQueries.length;
  const newRecall = newHits / testQueries.length;

  return {
    currentRecall,
    newRecall,
    improvement: newRecall - currentRecall,
  };
}

// Gradual rollout
async function gradualEmbeddingRollout(
  vectorDb: VectorStore,
  newEmbedder: Embedder,
  batchSize: number = 10_000
): Promise<void> {
  const allChunks = await vectorDb.getAllChunks();
  const totalChunks = allChunks.length;

  for (let i = 0; i < totalChunks; i += batchSize) {
    const batch = allChunks.slice(i, i + batchSize);

    // Re-embed batch
    const reembedded = await Promise.all(
      batch.map(async (chunk) => ({
        ...chunk,
        embedding: await newEmbedder.embed(chunk.content),
      }))
    );

    // Store new embeddings
    await vectorDb.updateEmbeddings(reembedded);

    console.log(
      `Re-embedded ${Math.min(i + batchSize, totalChunks)} / ${totalChunks}`
    );
  }
}

Test new embedders on your production queries before full migration.

Prompt Optimization From Production Failures

Analyze generation failures to improve prompts.

interface PromptOptimizationSignal {
  queryId: string;
  userQuery: string;
  retrievedContext: string;
  generatedAnswer: string;
  userCorrection: string;
  issue: 'hallucination' | 'incompleteness' | 'incorrectness' | 'confused_output';
}

async function analyzeGenerationFailures(
  logsDb: LogsDatabase
): Promise<PromptOptimizationSignal[]> {
  const failedQueries = await logsDb.query({
    collection: 'retrievals',
    filter: { userFeedback: 'incorrect' },
  });

  const signals: PromptOptimizationSignal[] = [];

  for (const log of failedQueries) {
    const issue = classifyIssue(log.generatedAnswer, log.userCorrection);

    signals.push({
      queryId: log.queryId,
      userQuery: log.userQuery,
      retrievedContext: log.retrievedChunks.map(c => c.content).join('\n'),
      generatedAnswer: log.generatedAnswer,
      userCorrection: log.userCorrection ?? '',
      issue,
    });
  }

  return signals;
}

function classifyIssue(
  generated: string,
  correction: string
): 'hallucination' | 'incompleteness' | 'incorrectness' | 'confused_output' {
  // Check if generated answer contains info not in context
  if (
    generated.includes('I don't know') &&
    correction.length > 100
  ) {
    return 'incompleteness';
  }

  // Check if generated answer contradicts context
  if (hasFactualContradiction(generated, correction)) {
    return 'incorrectness';
  }

  // Check if answer is incoherent
  if (generated.length > 2000 || generated.split('\n').length > 20) {
    return 'confused_output';
  }

  // Assume hallucination
  return 'hallucination';
}

// Iterate on system prompt
async function optimizeSystemPrompt(
  signals: PromptOptimizationSignal[]
): Promise<string> {
  const currentPrompt = getSystemPrompt();

  const issues = signals.reduce(
    (acc, s) => {
      acc[s.issue] = (acc[s.issue] ?? 0) + 1;
      return acc;
    },
    {} as Record<string, number>
  );

  // Generate improvement prompt
  const improvementPrompt = `
Current system prompt has issues:
${Object.entries(issues)
  .map(([issue, count]) => `- ${issue}: ${count} cases`)
  .join('\n')}

Current prompt: ${currentPrompt}

Suggest improvements to reduce these issues.`;

  const improved = await llm.generate({ prompt: improvementPrompt });

  return improved;
}

Production failures are your best teachers. Use them to refine prompts systematically.

Checklist

  • Log every retrieval with metadata (query, chunks, answer, feedback)
  • Detect conversation patterns indicating retrieval failure
  • Evaluate all retrievals with LLM-based quality scoring
  • Run A/B tests on major retrieval changes before rollout
  • Convert low-quality retrievals to training data for improvement
  • Score chunks by relevance frequency and user feedback; remove low-quality chunks
  • Test new embedding models on production queries before migration
  • Analyze generation failures; iterate on system prompts
  • Monitor feedback loop; look for improvement trends

Conclusion

Continuous improvement transforms RAG from a static system into a learning one. The key is capturing production signals—user feedback, conversation patterns, automatic evaluation—and using them to drive systematic improvements. A/B tests prevent regressions. The training data flywheel compounds improvements. After a few months of this disciplined approach, you'll have a RAG system that's 20-30% better than day-one performance, all driven by real-world data.