- Published on
Continuous RAG Improvement — Using Production Data to Make Your Pipeline Better
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Most RAG systems are static after deployment. But the best teams treat RAG as a learning system: every user interaction is a data point. A user corrects an answer—that's a signal. A query gets zero retrieval results—that's a problem. A user rephrases their question—that's guidance. Mining production data transforms RAG from fixed to continuously improving. This post covers the infrastructure and techniques to capture, evaluate, and act on real-world signals.
- Logging Retrieval Quality Signals
- Identifying Bad Retrievals From Conversation Patterns
- Automatic Evaluation on Production Queries
- A/B Testing Retrieval Changes
- Dataset Flywheel: Bad Examples → Training Data
- Chunk Quality Scoring
- Embedding Model Upgrade Strategy
- Prompt Optimization From Production Failures
- Checklist
- Conclusion
Logging Retrieval Quality Signals
Instrument every retrieval with metadata to evaluate quality later.
interface RetrievalLog {
timestamp: number;
queryId: string;
userQuery: string;
retrievedChunks: Array<{
id: string;
content: string;
similarity: number;
}>;
generatedAnswer: string;
userFeedback?: 'helpful' | 'incorrect' | 'irrelevant' | 'incomplete';
userCorrection?: string; // If user provided the right answer
sessionId: string;
userId: string;
}
async function logRetrieval(
query: string,
chunks: DocumentChunk[],
answer: string,
sessionId: string,
userId: string
): Promise<string> {
const queryId = generateId();
const log: RetrievalLog = {
timestamp: Date.now(),
queryId,
userQuery: query,
retrievedChunks: chunks.map(c => ({
id: c.id,
content: c.content,
similarity: c.similarity,
})),
generatedAnswer: answer,
sessionId,
userId,
};
// Store in database for analysis
await logsDb.insert('retrievals', log);
return queryId;
}
// Collect explicit feedback from user
async function recordUserFeedback(
queryId: string,
feedback: 'helpful' | 'incorrect' | 'irrelevant' | 'incomplete',
correction?: string
): Promise<void> {
await logsDb.update('retrievals', { queryId }, {
userFeedback: feedback,
userCorrection: correction,
});
}
Store all logs in a queryable database. Over time, logs become your ground truth for evaluation.
Identifying Bad Retrievals From Conversation Patterns
Watch for red flags in conversation flow.
interface ConversationPattern {
queryId: string;
pattern: 'rephrasing' | 'clarification' | 'complaint' | 'correction';
confidence: number;
}
async function detectConversationPatterns(
messages: Array<{ role: string; content: string }>
): Promise<ConversationPattern[]> {
const patterns: ConversationPattern[] = [];
for (let i = 1; i < messages.length; i++) {
const prev = messages[i - 1];
const curr = messages[i];
if (prev.role === 'assistant' && curr.role === 'user') {
// Check for rephrasing: user asks similar question again
if (isSimilarQuery(prev.content, curr.content)) {
patterns.push({
queryId: generateId(),
pattern: 'rephrasing',
confidence: computeSimilarity(prev.content, curr.content),
});
}
// Check for clarification: "What do you mean by..."
if (
curr.content.toLowerCase().includes('what do you mean') ||
curr.content.toLowerCase().includes('can you explain')
) {
patterns.push({
queryId: generateId(),
pattern: 'clarification',
confidence: 0.9,
});
}
// Check for complaints: "That's wrong" or "That doesn't make sense"
if (
curr.content.toLowerCase().includes('wrong') ||
curr.content.toLowerCase().includes('incorrect') ||
curr.content.toLowerCase().includes('doesn't make sense')
) {
patterns.push({
queryId: generateId(),
pattern: 'complaint',
confidence: 0.85,
});
}
// Check for correction: "Actually, it's..." or "No, I meant..."
if (
curr.content.toLowerCase().includes('actually') ||
curr.content.toLowerCase().includes('i meant')
) {
patterns.push({
queryId: generateId(),
pattern: 'correction',
confidence: 0.8,
});
}
}
}
return patterns;
}
// Alert when patterns indicate retrieval failure
async function monitorPatterns(
sessionId: string,
messages: Array<{ role: string; content: string }>
): Promise<void> {
const patterns = await detectConversationPatterns(messages);
const negativeCount = patterns.filter(
p => p.pattern === 'complaint' || p.pattern === 'correction'
).length;
if (negativeCount >= 2) {
console.warn(`Session ${sessionId} shows retrieval issues; flagged for review`);
await flagSessionForReview(sessionId, 'high_failure_rate');
}
}
Rephrasing and corrections are strong signals that initial retrieval failed.
Automatic Evaluation on Production Queries
Evaluate retrieval quality using LLM judgment, without requiring user labels.
interface RetrievalQualityScore {
relevance: number; // 0-1: Does the chunk answer the query?
completeness: number; // 0-1: Is the chunk sufficient?
clarity: number; // 0-1: Is it clear and coherent?
overallScore: number;
}
async function evaluateRetrievalQuality(
query: string,
chunks: DocumentChunk[],
generatedAnswer: string,
llm: LanguageModel
): Promise<RetrievalQualityScore> {
const evaluation = await llm.generate({
prompt: `Rate the quality of this retrieval for the given query.
Query: "${query}"
Retrieved chunks:
${chunks.map(c => c.content).join('\n---\n')}
Generated answer: "${generatedAnswer}"
Score each dimension 0-10:
- Relevance: Does the chunk directly address the query?
- Completeness: Does the chunk fully answer the question?
- Clarity: Is the information clear and coherent?
Return JSON: { relevance, completeness, clarity }`,
});
const scores = JSON.parse(evaluation);
return {
relevance: scores.relevance / 10,
completeness: scores.completeness / 10,
clarity: scores.clarity / 10,
overallScore:
(scores.relevance + scores.completeness + scores.clarity) / 30,
};
}
async function evaluateBatch(logs: RetrievalLog[]): Promise<void> {
for (const log of logs) {
const quality = await evaluateRetrievalQuality(
log.userQuery,
log.retrievedChunks,
log.generatedAnswer,
llm
);
// Store evaluation
await logsDb.update('retrievals', { queryId: log.queryId }, {
llmQualityScore: quality.overallScore,
relevanceScore: quality.relevance,
completenessScore: quality.completeness,
});
}
}
LLM-based evaluation is fast and flexible. Evaluate every production query overnight.
A/B Testing Retrieval Changes
Test retrieval improvements before rolling out.
interface ABTestConfig {
testId: string;
controlRetriever: Retriever; // Current production
experimentalRetriever: Retriever; // New approach
splitRatio: number; // % of traffic to experiment
duration: number; // milliseconds
metric: 'relevance_score' | 'click_through_rate' | 'user_feedback';
}
async function abTestRetrieval(
query: string,
config: ABTestConfig
): Promise<{
retrievedChunks: DocumentChunk[];
variant: 'control' | 'experimental';
testId: string;
}> {
const assignedVariant = Math.random() < config.splitRatio
? 'experimental'
: 'control';
const retriever =
assignedVariant === 'control'
? config.controlRetriever
: config.experimentalRetriever;
const chunks = await retriever.retrieve(query);
// Log for analysis
await logsDb.insert('ab_tests', {
testId: config.testId,
query,
variant: assignedVariant,
timestamp: Date.now(),
});
return {
retrievedChunks: chunks,
variant: assignedVariant,
testId: config.testId,
};
}
// Analyze results after test duration
async function analyzeABTest(testId: string): Promise<{
controlMetric: number;
experimentalMetric: number;
improvement: number;
pValue: number; // Statistical significance
recommendation: 'rollout' | 'continue_test' | 'rollback';
}> {
const logs = await logsDb.query({
collection: 'ab_tests',
filter: { testId },
});
const controlLogs = logs.filter(l => l.variant === 'control');
const experimentalLogs = logs.filter(l => l.variant === 'experimental');
// Compute metric (e.g., average relevance score)
const controlMetric =
controlLogs.reduce((sum, l) => sum + l.relevanceScore, 0) /
controlLogs.length;
const experimentalMetric =
experimentalLogs.reduce((sum, l) => sum + l.relevanceScore, 0) /
experimentalLogs.length;
const improvement = experimentalMetric - controlMetric;
// Compute statistical significance (t-test)
const pValue = await computeTTest(controlLogs, experimentalLogs);
// Recommend action
let recommendation: 'rollout' | 'continue_test' | 'rollback';
if (pValue < 0.05 && improvement > 0.02) {
recommendation = 'rollout'; // Statistically significant improvement
} else if (pValue < 0.1) {
recommendation = 'continue_test'; // Trending but not significant yet
} else {
recommendation = 'rollback'; // No improvement or worse
}
return {
controlMetric,
experimentalMetric,
improvement,
pValue,
recommendation,
};
}
A/B testing prevents regressions and validates improvements before deployment.
Dataset Flywheel: Bad Examples → Training Data
Turn retrieval failures into training data for fine-tuning.
interface TrainingExample {
query: string;
goldDocuments: DocumentChunk[];
negativeExamples: DocumentChunk[];
label: 'bad_retrieval' | 'good_retrieval';
}
async function harvestTrainingData(
logsDb: LogsDatabase
): Promise<TrainingExample[]> {
// Find queries with low evaluation scores
const badRetrievals = await logsDb.query({
collection: 'retrievals',
filter: { llmQualityScore: { $lt: 0.5 } },
limit: 1000,
});
// Find queries with high evaluation scores
const goodRetrievals = await logsDb.query({
collection: 'retrievals',
filter: { llmQualityScore: { $gt: 0.8 } },
limit: 1000,
});
const trainingData: TrainingExample[] = [];
// For bad retrievals, create hard examples
for (const log of badRetrievals) {
if (log.userCorrection) {
// User provided correct answer; use as gold document
trainingData.push({
query: log.userQuery,
goldDocuments: [
{ id: generateId(), content: log.userCorrection, similarity: 1.0 },
],
negativeExamples: log.retrievedChunks,
label: 'bad_retrieval',
});
}
}
// For good retrievals, use as positive examples
for (const log of goodRetrievals) {
trainingData.push({
query: log.userQuery,
goldDocuments: log.retrievedChunks,
negativeExamples: [], // No hard negatives yet
label: 'good_retrieval',
});
}
return trainingData;
}
This flywheel continuously improves your retrieval: failures become training data, training data improves your model, improving model reduces failures.
Chunk Quality Scoring
Not all chunks are equally useful. Score them to prioritize high-quality content.
interface ChunkQuality {
chunkId: string;
relevanceFrequency: number; // How often this chunk is retrieved
userFeedbackRate: number; // % of retrievals where user marked helpful
overallScore: number;
}
async function scoreChunkQuality(
vectorDb: VectorStore,
logsDb: LogsDatabase
): Promise<ChunkQuality[]> {
const allChunks = await vectorDb.getAllChunks();
const qualities: ChunkQuality[] = [];
for (const chunk of allChunks) {
// Count retrievals for this chunk
const retrievalLogs = await logsDb.query({
collection: 'retrievals',
filter: { 'retrievedChunks.id': chunk.id },
});
// Count helpful feedback
const helpfulFeedback = retrievalLogs.filter(
l => l.userFeedback === 'helpful'
).length;
const relevanceFrequency = retrievalLogs.length;
const userFeedbackRate =
relevanceFrequency > 0 ? helpfulFeedback / relevanceFrequency : 0;
qualities.push({
chunkId: chunk.id,
relevanceFrequency,
userFeedbackRate,
overallScore:
(relevanceFrequency / 100) * 0.3 +
userFeedbackRate * 0.7, // Weight user feedback higher
});
}
return qualities.sort((a, b) => b.overallScore - a.overallScore);
}
// Remove low-quality chunks
async function removeChunksBelow(
vectorDb: VectorStore,
threshold: number = 0.3
): Promise<number> {
const qualities = await scoreChunkQuality(vectorDb, logsDb);
let deleted = 0;
for (const quality of qualities) {
if (quality.overallScore < threshold) {
await vectorDb.delete(quality.chunkId);
deleted++;
}
}
return deleted;
}
Low-quality chunks add noise. Removing them improves retrieval recall.
Embedding Model Upgrade Strategy
When a new embedding model is released, how do you migrate?
async function evaluateNewEmbeddingModel(
testQueries: string[],
testChunks: DocumentChunk[],
currentEmbedder: Embedder,
newEmbedder: Embedder
): Promise<{ currentRecall: number; newRecall: number; improvement: number }> {
let currentHits = 0;
let newHits = 0;
for (const query of testQueries) {
// Rank chunks with current embedder
const currentScores = testChunks.map(chunk => ({
chunk,
score: cosineSimilarity(
currentEmbedder.embed(query),
currentEmbedder.embed(chunk.content)
),
}));
// Rank chunks with new embedder
const newScores = testChunks.map(chunk => ({
chunk,
score: cosineSimilarity(
newEmbedder.embed(query),
newEmbedder.embed(chunk.content)
),
}));
// Check top-10 recall
const currentTop10 = currentScores
.sort((a, b) => b.score - a.score)
.slice(0, 10);
const newTop10 = newScores
.sort((a, b) => b.score - a.score)
.slice(0, 10);
// Assume gold chunk is first chunk
if (currentTop10.some(s => s.chunk === testChunks[0])) currentHits++;
if (newTop10.some(s => s.chunk === testChunks[0])) newHits++;
}
const currentRecall = currentHits / testQueries.length;
const newRecall = newHits / testQueries.length;
return {
currentRecall,
newRecall,
improvement: newRecall - currentRecall,
};
}
// Gradual rollout
async function gradualEmbeddingRollout(
vectorDb: VectorStore,
newEmbedder: Embedder,
batchSize: number = 10_000
): Promise<void> {
const allChunks = await vectorDb.getAllChunks();
const totalChunks = allChunks.length;
for (let i = 0; i < totalChunks; i += batchSize) {
const batch = allChunks.slice(i, i + batchSize);
// Re-embed batch
const reembedded = await Promise.all(
batch.map(async (chunk) => ({
...chunk,
embedding: await newEmbedder.embed(chunk.content),
}))
);
// Store new embeddings
await vectorDb.updateEmbeddings(reembedded);
console.log(
`Re-embedded ${Math.min(i + batchSize, totalChunks)} / ${totalChunks}`
);
}
}
Test new embedders on your production queries before full migration.
Prompt Optimization From Production Failures
Analyze generation failures to improve prompts.
interface PromptOptimizationSignal {
queryId: string;
userQuery: string;
retrievedContext: string;
generatedAnswer: string;
userCorrection: string;
issue: 'hallucination' | 'incompleteness' | 'incorrectness' | 'confused_output';
}
async function analyzeGenerationFailures(
logsDb: LogsDatabase
): Promise<PromptOptimizationSignal[]> {
const failedQueries = await logsDb.query({
collection: 'retrievals',
filter: { userFeedback: 'incorrect' },
});
const signals: PromptOptimizationSignal[] = [];
for (const log of failedQueries) {
const issue = classifyIssue(log.generatedAnswer, log.userCorrection);
signals.push({
queryId: log.queryId,
userQuery: log.userQuery,
retrievedContext: log.retrievedChunks.map(c => c.content).join('\n'),
generatedAnswer: log.generatedAnswer,
userCorrection: log.userCorrection ?? '',
issue,
});
}
return signals;
}
function classifyIssue(
generated: string,
correction: string
): 'hallucination' | 'incompleteness' | 'incorrectness' | 'confused_output' {
// Check if generated answer contains info not in context
if (
generated.includes('I don't know') &&
correction.length > 100
) {
return 'incompleteness';
}
// Check if generated answer contradicts context
if (hasFactualContradiction(generated, correction)) {
return 'incorrectness';
}
// Check if answer is incoherent
if (generated.length > 2000 || generated.split('\n').length > 20) {
return 'confused_output';
}
// Assume hallucination
return 'hallucination';
}
// Iterate on system prompt
async function optimizeSystemPrompt(
signals: PromptOptimizationSignal[]
): Promise<string> {
const currentPrompt = getSystemPrompt();
const issues = signals.reduce(
(acc, s) => {
acc[s.issue] = (acc[s.issue] ?? 0) + 1;
return acc;
},
{} as Record<string, number>
);
// Generate improvement prompt
const improvementPrompt = `
Current system prompt has issues:
${Object.entries(issues)
.map(([issue, count]) => `- ${issue}: ${count} cases`)
.join('\n')}
Current prompt: ${currentPrompt}
Suggest improvements to reduce these issues.`;
const improved = await llm.generate({ prompt: improvementPrompt });
return improved;
}
Production failures are your best teachers. Use them to refine prompts systematically.
Checklist
- Log every retrieval with metadata (query, chunks, answer, feedback)
- Detect conversation patterns indicating retrieval failure
- Evaluate all retrievals with LLM-based quality scoring
- Run A/B tests on major retrieval changes before rollout
- Convert low-quality retrievals to training data for improvement
- Score chunks by relevance frequency and user feedback; remove low-quality chunks
- Test new embedding models on production queries before migration
- Analyze generation failures; iterate on system prompts
- Monitor feedback loop; look for improvement trends
Conclusion
Continuous improvement transforms RAG from a static system into a learning one. The key is capturing production signals—user feedback, conversation patterns, automatic evaluation—and using them to drive systematic improvements. A/B tests prevent regressions. The training data flywheel compounds improvements. After a few months of this disciplined approach, you'll have a RAG system that's 20-30% better than day-one performance, all driven by real-world data.