- Published on
AI Evaluation Frameworks — LLM-as-Judge, DeepEval, and Automated Testing
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
You ship a new prompt version to production. Within hours, user complaints flood in. The response quality degraded, but your tests didn't catch it. Without systematic evaluation, you''re flying blind on LLM quality.
This post covers modern evaluation frameworks that use LLMs as judges, metrics libraries like DeepEval, and regression detection in CI pipelines.
- LLM-as-Judge: G-Eval and Prometheus
- DeepEval Metrics for Automated Assessment
- RAGAS for RAG Evaluation
- Building and Maintaining a Golden Dataset
- Automated Eval in CI Pipeline
- Conclusion
LLM-as-Judge: G-Eval and Prometheus
The LLM-as-judge pattern uses a strong model to evaluate responses from another model. Build a golden dataset of <prompt, expected_response> pairs, then score your model''s output:
import Anthropic from "@anthropic-ai/sdk";
interface EvaluationCriteria {
name: string;
description: string;
weight: number; // 0-1
}
interface EvaluationResult {
score: number; // 0-10
reasoning: string;
passed: boolean;
threshold: number;
}
const judgeClient = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function evaluateResponseWithLLMJudge(
prompt: string,
generatedResponse: string,
expectedResponse: string,
criteria: EvaluationCriteria[]
): Promise<EvaluationResult[]> {
const criteriaText = criteria
.map(
(c) =>
`- ${c.name} (weight: ${c.weight}): ${c.description}`
)
.join("\n");
const evaluationPrompt = `
You are an expert evaluator. Score the generated response against the criteria.
Return a JSON object with score (0-10), reasoning, and pass (boolean).
Prompt: ${prompt}
Expected Response: ${expectedResponse}
Generated Response: ${generatedResponse}
Evaluation Criteria:
${criteriaText}
Return ONLY valid JSON in this format:
{
"evaluations": [
{
"criterion": "criterion_name",
"score": 8,
"reasoning": "why this score",
"passed": true
}
],
"overall_score": 8.5,
"recommendation": "pass or fail"
}
`;
const message = await judgeClient.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: [{ role: "user", content: evaluationPrompt }],
});
const responseText =
message.content[0].type === "text" ? message.content[0].text : "";
const parsed = JSON.parse(responseText);
return parsed.evaluations.map(
(eval: {
score: number;
reasoning: string;
passed: boolean;
}) => ({
score: eval.score,
reasoning: eval.reasoning,
passed: eval.passed,
threshold: 7, // Configurable threshold
})
);
}
export { evaluateResponseWithLLMJudge, EvaluationCriteria, EvaluationResult };
DeepEval Metrics for Automated Assessment
DeepEval provides pre-built metrics for common evaluation scenarios. Integrate it into your test suite:
interface DeepEvalConfig {
modelName: string;
apiKey: string;
}
class DeepEvalMetrics {
private config: DeepEvalConfig;
constructor(config: DeepEvalConfig) {
this.config = config;
}
async evaluateAnswerRelevancy(
question: string,
answer: string
): Promise<number> {
// Checks if answer is relevant to the question
// Score: 0-1, measures how directly the answer addresses the question
const relevancyScore = await this.scoreRelevancy(question, answer);
return relevancyScore;
}
async evaluateFaithfulness(
context: string,
answer: string
): Promise<number> {
// Checks if answer is faithful to context
// Score: 0-1, measures if answer contradicts or goes beyond context
const faithfulnessScore = await this.scoreFaithfulness(
context,
answer
);
return faithfulnessScore;
}
async evaluateContextualPrecision(
question: string,
answer: string,
context: string
): Promise<number> {
// Checks if answer uses only necessary context
// Score: 0-1, lower is better for precision
const precisionScore = await this.scorePrecision(
question,
answer,
context
);
return precisionScore;
}
async evaluateCitation(
answer: string,
context: string
): Promise<number> {
// Checks if answer properly cites the context
// Score: 0-1
const citationScore = await this.scoreCitation(answer, context);
return citationScore;
}
private async scoreRelevancy(
question: string,
answer: string
): Promise<number> {
// Implementation would call evaluator model
return Math.random();
}
private async scoreFaithfulness(
context: string,
answer: string
): Promise<number> {
return Math.random();
}
private async scorePrecision(
question: string,
answer: string,
context: string
): Promise<number> {
return Math.random();
}
private async scoreCitation(
answer: string,
context: string
): Promise<number> {
return Math.random();
}
}
export { DeepEvalMetrics, DeepEvalConfig };
RAGAS for RAG Evaluation
If you''re building retrieval-augmented generation (RAG), RAGAS (Retrieval-Augmented Generation Assessment) provides metrics for retrieval quality and answer generation:
interface RAGSTestCase {
question: string;
groundTruthAnswer: string;
retrievedContexts: string[];
generatedAnswer: string;
}
interface RAGSMetrics {
retrievalPrecision: number;
retrievalRecall: number;
answerRelevancy: number;
answerFaithfulness: number;
contextRelevancy: number;
averageScore: number;
}
class RAGSEvaluator {
async evaluateRAGPipeline(
testCase: RAGSTestCase
): Promise<RAGSMetrics> {
// Measure how well retrieval selected relevant documents
const retrievalPrecision = await this.calculateRetrievalPrecision(
testCase.retrievedContexts,
testCase.groundTruthAnswer
);
const retrievalRecall = await this.calculateRetrievalRecall(
testCase.retrievedContexts,
testCase.groundTruthAnswer
);
// Measure if generated answer matches ground truth
const answerRelevancy = await this.compareAnswers(
testCase.generatedAnswer,
testCase.groundTruthAnswer
);
// Measure if answer is faithful to retrieved context
const answerFaithfulness = await this.checkFaithfulness(
testCase.generatedAnswer,
testCase.retrievedContexts
);
// Measure if contexts are relevant to question
const contextRelevancy = await this.checkContextRelevance(
testCase.question,
testCase.retrievedContexts
);
const averageScore =
(retrievalPrecision +
retrievalRecall +
answerRelevancy +
answerFaithfulness +
contextRelevancy) /
5;
return {
retrievalPrecision,
retrievalRecall,
answerRelevancy,
answerFaithfulness,
contextRelevancy,
averageScore,
};
}
private async calculateRetrievalPrecision(
contexts: string[],
answer: string
): Promise<number> {
// Count how many retrieved docs are actually relevant
return 0.85;
}
private async calculateRetrievalRecall(
contexts: string[],
answer: string
): Promise<number> {
// Of all relevant docs, what percentage did we retrieve?
return 0.72;
}
private async compareAnswers(
generated: string,
groundTruth: string
): Promise<number> {
return 0.88;
}
private async checkFaithfulness(
answer: string,
contexts: string[]
): Promise<number> {
return 0.92;
}
private async checkContextRelevance(
question: string,
contexts: string[]
): Promise<number> {
return 0.81;
}
}
export { RAGSEvaluator, RAGSTestCase, RAGSMetrics };
Building and Maintaining a Golden Dataset
Your evaluation pipeline is only as good as your golden dataset. Structure it for scale:
interface GoldenDatasetExample {
id: string;
prompt: string;
expectedOutput: string;
category: string; // e.g., "summarization", "qa", "reasoning"
difficulty: "easy" | "medium" | "hard";
metadata: Record<string, unknown>;
}
class GoldenDatasetManager {
async createGoldenDataset(
examples: GoldenDatasetExample[]
): Promise<void> {
// Store in database (PostgreSQL, Mongo, etc)
// Ensure examples are representative of production traffic
console.log(`Created golden dataset with ${examples.length} examples`);
}
async evaluatePromptVersion(
newPrompt: string,
examples: GoldenDatasetExample[]
): Promise<{ passedCount: number; failedCount: number; score: number }> {
let passedCount = 0;
let failedCount = 0;
for (const example of examples) {
// Skip evaluation for "hard" examples in rapid iteration
if (example.difficulty === "hard") {
// Run harder examples every 10th iteration
if (Math.random() > 0.9) continue;
}
// Evaluate this example with new prompt
const result = await evaluateResponseWithLLMJudge(
example.prompt,
"mock_response",
example.expectedOutput,
[{ name: "correctness", description: "Answer is correct", weight: 1 }]
);
if (result.every((r) => r.passed)) {
passedCount++;
} else {
failedCount++;
}
}
const score = passedCount / (passedCount + failedCount);
return { passedCount, failedCount, score };
}
}
export { GoldenDatasetManager, GoldenDatasetExample };
Automated Eval in CI Pipeline
Wire evaluation into your CI/CD so every prompt change is tested:
interface CIPipelineConfig {
goldenDatasetPath: string;
regressionThreshold: number; // e.g., 0.95 (5% regression allowed)
performanceThresholdMs: number;
}
class EvaluationCIPipeline {
constructor(private config: CIPipelineConfig) {}
async runEvaluationGate(
currentBranchVersion: string,
mainBranchVersion: string
): Promise<boolean> {
// Load golden dataset
const examples: GoldenDatasetExample[] = [];
// Evaluate both versions
const currentScore = await this.evaluatePromptVersion(
currentBranchVersion,
examples
);
const mainScore = await this.evaluatePromptVersion(
mainBranchVersion,
examples
);
// Check for regression
const regression = 1 - currentScore.score / mainScore.score;
const isRegression = regression > (1 - this.config.regressionThreshold);
if (isRegression) {
console.error(
`Regression detected: ${(regression * 100).toFixed(2)}% drop in score`
);
return false;
}
console.log(`Evaluation passed. Score: ${currentScore.score.toFixed(3)}`);
return true;
}
private async evaluatePromptVersion(
version: string,
examples: GoldenDatasetExample[]
): Promise<{ score: number }> {
return { score: 0.92 };
}
}
export { EvaluationCIPipeline, CIPipelineConfig };
Conclusion
Systematic evaluation catches quality regressions before users see them. Use LLM-as-judge for custom criteria, DeepEval for standard metrics, RAGAS for RAG systems, and golden datasets for reproducibility. Integrate evaluation into CI so every prompt change is validated.
Build evaluation early. The cost of scoring 100 examples is negligible compared to shipping a bad prompt to production.