Evaluating Your RAG Pipeline — RAGAS, Faithfulness, and Answer Quality Metrics
Advertisement
Introduction
You can't improve what you don't measure. Yet most RAG systems ship without proper evaluation infrastructure, leading to silent quality degradation in production.
This post covers RAGAS (Retrieval-Augmented Generation Assessment), LLM-as-judge evaluation, and automated quality monitoring.
- RAGAS Framework
- Building a Golden Q&A Dataset
- Retrieval Evaluation
- Generation Quality Evaluation
- Continuous Evaluation in CI/CD
- Production Monitoring
- Checklist
- Conclusion
RAGAS Framework
RAGAS measures four critical dimensions of RAG quality:
interface RAGASMetrics {
faithfulness: number; // 0-1: Is answer grounded in context?
answerRelevancy: number; // 0-1: Does answer address question?
contextPrecision: number; // 0-1: Are retrieved chunks relevant?
contextRecall: number; // 0-1: Are all relevant chunks retrieved?
}
interface RAGSample {
query: string;
groundTruth: string; // Expected answer
retrievedContext: string[]; // Retrieved documents
generatedAnswer: string; // Model output
}
// Faithfulness: Is the answer supported by retrieved context?
async function computeFaithfulness(
sample: RAGSample,
llm: LLMClient
): Promise<number> {
const faithfulnessPrompt = `
Given the question, generated answer, and context chunks, determine if the answer
is grounded in the provided context.
Question: "${sample.query}"
Answer: "${sample.generatedAnswer}"
Context:
${sample.retrievedContext.join('\n---\n')}
Is the answer entirely supported by the context? Respond with a JSON:
{
"faithfulness_score": 0.0 to 1.0,
"reasoning": "explanation"
}`;
const response = await llm.generate({
messages: [{ role: 'user', content: faithfulnessPrompt }],
maxTokens: 200,
});
const parsed = JSON.parse(response.text);
return parsed.faithfulness_score;
}
// Answer Relevancy: Does the answer address the question?
async function computeAnswerRelevancy(
sample: RAGSample,
llm: LLMClient
): Promise<number> {
const relevancyPrompt = `
Evaluate how well the generated answer addresses the question.
Question: "${sample.query}"
Generated Answer: "${sample.generatedAnswer}"
Expected Answer: "${sample.groundTruth}"
Rate relevancy 0-1:
- 1.0: Answers the question completely and accurately
- 0.5: Partially answers or has some irrelevant content
- 0.0: Doesn't address the question
Respond with JSON:
{
"relevancy_score": 0.0 to 1.0,
"reasoning": "explanation"
}`;
const response = await llm.generate({
messages: [{ role: 'user', content: relevancyPrompt }],
maxTokens: 200,
});
const parsed = JSON.parse(response.text);
return parsed.relevancy_score;
}
// Context Precision: Are retrieved chunks relevant?
async function computeContextPrecision(
sample: RAGSample,
llm: LLMClient
): Promise<number> {
const precisionPrompt = `
For each retrieved context chunk, determine if it's relevant to answering the question.
Question: "${sample.query}"
Context chunks:
${sample.retrievedContext.map((ctx, i) => `[${i}] ${ctx}`).join('\n\n')}
Respond with JSON:
{
"chunk_relevance": [0.0 to 1.0 for each chunk],
"relevant_count": number,
"total_count": number
}`;
const response = await llm.generate({
messages: [{ role: 'user', content: precisionPrompt }],
maxTokens: 300,
});
const parsed = JSON.parse(response.text);
return parsed.relevant_count / parsed.total_count;
}
// Context Recall: Did we retrieve all relevant chunks?
async function computeContextRecall(
sample: RAGSample,
llm: LLMClient
): Promise<number> {
const recallPrompt = `
Given the expected answer and retrieved context, estimate how many relevant chunks
were retrieved.
Expected Answer: "${sample.groundTruth}"
Retrieved Context:
${sample.retrievedContext.map((ctx, i) => `[${i}] ${ctx}`).join('\n\n')}
What fraction of information needed to construct the expected answer is present
in the retrieved context?
Respond with JSON:
{
"recall_score": 0.0 to 1.0,
"missing_information": "what's missing",
"confidence": 0.0 to 1.0
}`;
const response = await llm.generate({
messages: [{ role: 'user', content: recallPrompt }],
maxTokens: 200,
});
const parsed = JSON.parse(response.text);
return parsed.recall_score;
}
// Compute all RAGAS metrics
async function computeRAGAS(
sample: RAGSample,
llm: LLMClient
): Promise<RAGASMetrics> {
const [faithfulness, answerRelevancy, contextPrecision, contextRecall] =
await Promise.all([
computeFaithfulness(sample, llm),
computeAnswerRelevancy(sample, llm),
computeContextPrecision(sample, llm),
computeContextRecall(sample, llm),
]);
return {
faithfulness,
answerRelevancy,
contextPrecision,
contextRecall,
};
}
Building a Golden Q&A Dataset
Create a validation set for evaluation:
interface GoldenSample {
id: string;
question: string;
expectedAnswer: string;
sources: string[]; // Which documents contain the answer
difficulty: 'easy' | 'medium' | 'hard';
category: string;
}
async function buildGoldenDataset(
documentStore: DocumentStore,
sampleSize: number = 100
): Promise<GoldenSample[]> {
const samples: GoldenSample[] = [];
// Sample diverse documents
const docs = await documentStore.sampleDocuments(sampleSize);
for (const doc of docs) {
// Use LLM to generate questions from document
const generatePrompt = `
Read this document and generate 2-3 challenging questions that require specific
information from this text.
Document:
${doc.text}
Respond with JSON:
{
"questions": [
{
"question": "...",
"answer": "...",
"difficulty": "easy|medium|hard"
}
]
}`;
const response = await llm.generate({
messages: [{ role: 'user', content: generatePrompt }],
maxTokens: 500,
});
const parsed = JSON.parse(response.text);
for (const q of parsed.questions) {
samples.push({
id: `golden_${samples.length}`,
question: q.question,
expectedAnswer: q.answer,
sources: [doc.id],
difficulty: q.difficulty,
category: doc.metadata?.category || 'general',
});
}
}
return samples.slice(0, sampleSize);
}
// Store and version golden dataset
interface GoldenDatasetVersion {
version: string;
timestamp: number;
samples: GoldenSample[];
stats: {
totalSamples: number;
byDifficulty: Record<string, number>;
byCategory: Record<string, number>;
};
}
async function saveGoldenDataset(
samples: GoldenSample[],
version: string
): Promise<void> {
const stats = {
totalSamples: samples.length,
byDifficulty: {
easy: samples.filter(s => s.difficulty === 'easy').length,
medium: samples.filter(s => s.difficulty === 'medium').length,
hard: samples.filter(s => s.difficulty === 'hard').length,
},
byCategory: {} as Record<string, number>,
};
for (const sample of samples) {
stats.byCategory[sample.category] =
(stats.byCategory[sample.category] || 0) + 1;
}
const dataset: GoldenDatasetVersion = {
version,
timestamp: Date.now(),
samples,
stats,
};
// Save to disk/database
await fs.writeFile(
`/data/golden_dataset_${version}.json`,
JSON.stringify(dataset, null, 2)
);
}
Retrieval Evaluation
Measure if your retriever finds relevant documents:
interface RetrievalEvaluationResult {
hitRate: number; // % of queries where relevant doc is in top-k
mrr: number; // Mean Reciprocal Rank
map: number; // Mean Average Precision
ndcg: number; // Normalized Discounted Cumulative Gain
}
async function evaluateRetrieval(
goldenSamples: GoldenSample[],
retriever: (query: string, topK: number) => Promise<Array<{ id: string; score: number }>>,
topK: number = 5
): Promise<RetrievalEvaluationResult> {
const hits: number[] = [];
const ranks: number[] = [];
const aps: number[] = [];
const ndcgs: number[] = [];
for (const sample of goldenSamples) {
// Retrieve documents
const retrieved = await retriever(sample.question, topK);
const retrievedIds = retrieved.map(r => r.id);
// Check if any source appears in top-k
const hit = sample.sources.some(source => retrievedIds.includes(source));
hits.push(hit ? 1 : 0);
// Compute MRR (position of first relevant)
const firstRelevantRank = retrievedIds.findIndex(id =>
sample.sources.includes(id)
);
ranks.push(firstRelevantRank === -1 ? 0 : 1 / (firstRelevantRank + 1));
// Compute MAP (average precision at each position)
let relevantCount = 0;
let apSum = 0;
for (let i = 0; i < retrievedIds.length; i++) {
if (sample.sources.includes(retrievedIds[i])) {
relevantCount++;
apSum += relevantCount / (i + 1);
}
}
aps.push(relevantCount === 0 ? 0 : apSum / Math.min(topK, sample.sources.length));
// Compute NDCG
const dcg = retrievedIds.reduce((sum, id, i) => {
const rel = sample.sources.includes(id) ? 1 : 0;
return sum + rel / Math.log2(i + 2);
}, 0);
const idcg = Array(Math.min(topK, sample.sources.length))
.fill(1)
.reduce((sum, _, i) => sum + 1 / Math.log2(i + 2), 0);
ndcgs.push(idcg === 0 ? 0 : dcg / idcg);
}
return {
hitRate: hits.reduce((a, b) => a + b, 0) / hits.length,
mrr: ranks.reduce((a, b) => a + b, 0) / ranks.length,
map: aps.reduce((a, b) => a + b, 0) / aps.length,
ndcg: ndcgs.reduce((a, b) => a + b, 0) / ndcgs.length,
};
}
Generation Quality Evaluation
Measure answer quality using multiple approaches:
interface GenerationEvaluationMetrics {
rougeL: number; // Lexical overlap with expected answer
bleu: number; // Precision of n-gram matches
bertScore: number; // Semantic similarity
exactMatch: number; // Is answer exactly correct
parseability: number; // Can answer be parsed (if structured)
}
// ROUGE-L: Longest common subsequence ratio
function computeRougeL(generated: string, expected: string): number {
const gWords = generated.toLowerCase().split(/\s+/);
const eWords = expected.toLowerCase().split(/\s+/);
const lcs = longestCommonSubsequence(gWords, eWords);
return (2 * lcs) / (gWords.length + eWords.length);
}
// BLEU: N-gram precision
function computeBLEU(generated: string, expected: string, n: number = 4): number {
const gNgrams = ngramsFromText(generated, n);
const eNgrams = ngramsFromText(expected, n);
const matches = Array.from(gNgrams).filter(ngram => eNgrams.has(ngram)).length;
return matches / Math.max(1, gNgrams.size);
}
// Helper: extract n-grams from text
function ngramsFromText(text: string, n: number): Set<string> {
const tokens = text.toLowerCase().split(/\s+/);
const ngrams = new Set<string>();
for (let i = 0; i <= tokens.length - n; i++) {
ngrams.add(tokens.slice(i, i + n).join(' '));
}
return ngrams;
}
// Helper: longest common subsequence
function longestCommonSubsequence(a: string[], b: string[]): number {
const dp: number[][] = Array(a.length + 1)
.fill(null)
.map(() => Array(b.length + 1).fill(0));
for (let i = 1; i <= a.length; i++) {
for (let j = 1; j <= b.length; j++) {
if (a[i - 1] === b[j - 1]) {
dp[i][j] = dp[i - 1][j - 1] + 1;
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
return dp[a.length][b.length];
}
async function evaluateGeneration(
goldenSamples: GoldenSample[],
generator: (query: string, context: string) => Promise<string>,
retriever: (query: string, topK: number) => Promise<string[]>,
topK: number = 5
): Promise<GenerationEvaluationMetrics> {
const metrics = {
rougeL: [] as number[],
bleu: [] as number[],
exactMatch: [] as number[],
};
for (const sample of goldenSamples) {
// Retrieve context
const context = (await retriever(sample.question, topK)).join('\n\n');
// Generate answer
const generated = await generator(sample.question, context);
// Compute metrics
metrics.rougeL.push(computeRougeL(generated, sample.expectedAnswer));
metrics.bleu.push(computeBLEU(generated, sample.expectedAnswer));
metrics.exactMatch.push(
generated.toLowerCase() === sample.expectedAnswer.toLowerCase() ? 1 : 0
);
}
return {
rougeL: metrics.rougeL.reduce((a, b) => a + b, 0) / metrics.rougeL.length,
bleu: metrics.bleu.reduce((a, b) => a + b, 0) / metrics.bleu.length,
bertScore: 0, // Implement via sentence-transformers
exactMatch: metrics.exactMatch.reduce((a, b) => a + b, 0) / metrics.exactMatch.length,
parseability: 1.0, // Implement domain-specific validation
};
}
Continuous Evaluation in CI/CD
Monitor quality regressions automatically:
interface EvaluationResult {
timestamp: number;
version: string;
metrics: RAGASMetrics & RetrievalEvaluationResult & GenerationEvaluationMetrics;
regressions: string[];
}
async function continuousEvaluation(
goldenDataset: GoldenSample[],
ragPipeline: {
retriever: (q: string, k: number) => Promise<Array<{ id: string; text: string }>>;
generator: (q: string, c: string) => Promise<string>;
},
thresholds: Record<string, number> = {
faithfulness: 0.8,
answerRelevancy: 0.75,
contextPrecision: 0.7,
hitRate: 0.85,
}
): Promise<EvaluationResult> {
const regressions: string[] = [];
// Evaluate on golden dataset
const sample = goldenDataset[0];
const retrieved = await ragPipeline.retriever(sample.question, 5);
const context = retrieved.map(r => r.text).join('\n\n');
const generated = await ragPipeline.generator(sample.question, context);
const ragasSample: RAGSample = {
query: sample.question,
groundTruth: sample.expectedAnswer,
retrievedContext: retrieved.map(r => r.text),
generatedAnswer: generated,
};
const metrics = await computeRAGAS(ragasSample, llm);
// Check thresholds
if (metrics.faithfulness < thresholds.faithfulness) {
regressions.push(
`Faithfulness ${metrics.faithfulness.toFixed(2)} < threshold ${thresholds.faithfulness}`
);
}
if (metrics.answerRelevancy < thresholds.answerRelevancy) {
regressions.push(
`Answer relevancy ${metrics.answerRelevancy.toFixed(2)} < threshold ${thresholds.answerRelevancy}`
);
}
return {
timestamp: Date.now(),
version: process.env.COMMIT_SHA || 'unknown',
metrics: metrics as any,
regressions,
};
}
// Integration with GitHub Actions
async function cicdEvaluation(): Promise<void> {
const goldenDataset = await loadGoldenDataset('v1.0');
const result = await continuousEvaluation(goldenDataset, {
retriever: async (q: string, k: number) => {
// Call deployed retriever
return [];
},
generator: async (q: string, c: string) => {
// Call deployed generator
return '';
},
});
if (result.regressions.length > 0) {
console.error('Quality regressions detected:');
result.regressions.forEach(r => console.error(` - ${r}`));
process.exit(1);
}
console.log('✓ All quality thresholds passed');
}
Production Monitoring
Track quality metrics continuously:
interface ProductionMetrics {
timestamp: number;
queryCount: number;
avgFaithfulness: number;
avgContextPrecision: number;
avgRetrievalLatency: number;
avgGenerationLatency: number;
hallucinations: number; // Detected via citation checking
noResultQueries: number; // Queries with empty retrieval
}
async function monitorProduction(
sampleRate: number = 0.1 // Evaluate 10% of production queries
): Promise<void> {
const metrics: Partial<ProductionMetrics> = {
timestamp: Date.now(),
queryCount: 0,
hallucinations: 0,
noResultQueries: 0,
};
// Listen to production queries
eventBus.on('rag.query', async (event: {
query: string;
context: string[];
answer: string;
latencies: { retrieval: number; generation: number };
}) => {
metrics.queryCount = (metrics.queryCount || 0) + 1;
// Sample-based evaluation
if (Math.random() < sampleRate) {
const sample: RAGSample = {
query: event.query,
groundTruth: '', // Not available in production
retrievedContext: event.context,
generatedAnswer: event.answer,
};
const faithfulness = await computeFaithfulness(sample, llm);
if (event.context.length === 0) {
metrics.noResultQueries = (metrics.noResultQueries || 0) + 1;
}
if (faithfulness < 0.5) {
metrics.hallucinations = (metrics.hallucinations || 0) + 1;
}
metrics.avgFaithfulness = (metrics.avgFaithfulness || 0) + faithfulness;
metrics.avgRetrievalLatency =
(metrics.avgRetrievalLatency || 0) + event.latencies.retrieval;
metrics.avgGenerationLatency =
(metrics.avgGenerationLatency || 0) + event.latencies.generation;
}
});
// Periodic reporting
setInterval(() => {
const querysSampled = Math.floor((metrics.queryCount || 0) * sampleRate);
console.log('Production Metrics:', {
...metrics,
avgFaithfulness: (metrics.avgFaithfulness || 0) / Math.max(1, querysSampled),
avgRetrievalLatency: (metrics.avgRetrievalLatency || 0) / Math.max(1, querysSampled),
avgGenerationLatency: (metrics.avgGenerationLatency || 0) / Math.max(1, querysSampled),
halluccinationRate: (metrics.hallucinations || 0) / querysSampled,
});
}, 60000); // Every minute
}
Checklist
- Build golden dataset with 100+ Q&A pairs
- Implement RAGAS metrics (faithfulness, relevancy, precision, recall)
- Set quality thresholds for each metric
- Measure retrieval hit rate baseline
- Track ROUGE-L and BLEU scores for generation
- Add CI/CD evaluation on every commit
- Implement production sampling (5-10%)
- Set up alerts for metric regressions
- Version your golden dataset
- Monitor hallucination rate via faithfulness score
Conclusion
Quality measurement is the foundation of RAG improvement. Without RAGAS or similar metrics, you're flying blind. Start with a golden dataset and automated evaluation, then layer in production monitoring. The investment pays dividends: you'll catch regressions before users do and have data-driven decisions about where to optimize next.
Advertisement