Published on

AI Evaluation Frameworks — LLM-as-Judge, DeepEval, and Automated Testing

Authors

Introduction

You ship a new prompt version to production. Within hours, user complaints flood in. The response quality degraded, but your tests didn't catch it. Without systematic evaluation, you''re flying blind on LLM quality.

This post covers modern evaluation frameworks that use LLMs as judges, metrics libraries like DeepEval, and regression detection in CI pipelines.

LLM-as-Judge: G-Eval and Prometheus

The LLM-as-judge pattern uses a strong model to evaluate responses from another model. Build a golden dataset of <prompt, expected_response> pairs, then score your model''s output:

import Anthropic from "@anthropic-ai/sdk";

interface EvaluationCriteria {
  name: string;
  description: string;
  weight: number; // 0-1
}

interface EvaluationResult {
  score: number; // 0-10
  reasoning: string;
  passed: boolean;
  threshold: number;
}

const judgeClient = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function evaluateResponseWithLLMJudge(
  prompt: string,
  generatedResponse: string,
  expectedResponse: string,
  criteria: EvaluationCriteria[]
): Promise<EvaluationResult[]> {
  const criteriaText = criteria
    .map(
      (c) =>
        `- ${c.name} (weight: ${c.weight}): ${c.description}`
    )
    .join("\n");

  const evaluationPrompt = `
You are an expert evaluator. Score the generated response against the criteria.
Return a JSON object with score (0-10), reasoning, and pass (boolean).

Prompt: ${prompt}

Expected Response: ${expectedResponse}

Generated Response: ${generatedResponse}

Evaluation Criteria:
${criteriaText}

Return ONLY valid JSON in this format:
{
  "evaluations": [
    {
      "criterion": "criterion_name",
      "score": 8,
      "reasoning": "why this score",
      "passed": true
    }
  ],
  "overall_score": 8.5,
  "recommendation": "pass or fail"
}
`;

  const message = await judgeClient.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 1024,
    messages: [{ role: "user", content: evaluationPrompt }],
  });

  const responseText =
    message.content[0].type === "text" ? message.content[0].text : "";
  const parsed = JSON.parse(responseText);

  return parsed.evaluations.map(
    (eval: {
      score: number;
      reasoning: string;
      passed: boolean;
    }) => ({
      score: eval.score,
      reasoning: eval.reasoning,
      passed: eval.passed,
      threshold: 7, // Configurable threshold
    })
  );
}

export { evaluateResponseWithLLMJudge, EvaluationCriteria, EvaluationResult };

DeepEval Metrics for Automated Assessment

DeepEval provides pre-built metrics for common evaluation scenarios. Integrate it into your test suite:

interface DeepEvalConfig {
  modelName: string;
  apiKey: string;
}

class DeepEvalMetrics {
  private config: DeepEvalConfig;

  constructor(config: DeepEvalConfig) {
    this.config = config;
  }

  async evaluateAnswerRelevancy(
    question: string,
    answer: string
  ): Promise<number> {
    // Checks if answer is relevant to the question
    // Score: 0-1, measures how directly the answer addresses the question

    const relevancyScore = await this.scoreRelevancy(question, answer);
    return relevancyScore;
  }

  async evaluateFaithfulness(
    context: string,
    answer: string
  ): Promise<number> {
    // Checks if answer is faithful to context
    // Score: 0-1, measures if answer contradicts or goes beyond context

    const faithfulnessScore = await this.scoreFaithfulness(
      context,
      answer
    );
    return faithfulnessScore;
  }

  async evaluateContextualPrecision(
    question: string,
    answer: string,
    context: string
  ): Promise<number> {
    // Checks if answer uses only necessary context
    // Score: 0-1, lower is better for precision

    const precisionScore = await this.scorePrecision(
      question,
      answer,
      context
    );
    return precisionScore;
  }

  async evaluateCitation(
    answer: string,
    context: string
  ): Promise<number> {
    // Checks if answer properly cites the context
    // Score: 0-1

    const citationScore = await this.scoreCitation(answer, context);
    return citationScore;
  }

  private async scoreRelevancy(
    question: string,
    answer: string
  ): Promise<number> {
    // Implementation would call evaluator model
    return Math.random();
  }

  private async scoreFaithfulness(
    context: string,
    answer: string
  ): Promise<number> {
    return Math.random();
  }

  private async scorePrecision(
    question: string,
    answer: string,
    context: string
  ): Promise<number> {
    return Math.random();
  }

  private async scoreCitation(
    answer: string,
    context: string
  ): Promise<number> {
    return Math.random();
  }
}

export { DeepEvalMetrics, DeepEvalConfig };

RAGAS for RAG Evaluation

If you''re building retrieval-augmented generation (RAG), RAGAS (Retrieval-Augmented Generation Assessment) provides metrics for retrieval quality and answer generation:

interface RAGSTestCase {
  question: string;
  groundTruthAnswer: string;
  retrievedContexts: string[];
  generatedAnswer: string;
}

interface RAGSMetrics {
  retrievalPrecision: number;
  retrievalRecall: number;
  answerRelevancy: number;
  answerFaithfulness: number;
  contextRelevancy: number;
  averageScore: number;
}

class RAGSEvaluator {
  async evaluateRAGPipeline(
    testCase: RAGSTestCase
  ): Promise<RAGSMetrics> {
    // Measure how well retrieval selected relevant documents
    const retrievalPrecision = await this.calculateRetrievalPrecision(
      testCase.retrievedContexts,
      testCase.groundTruthAnswer
    );

    const retrievalRecall = await this.calculateRetrievalRecall(
      testCase.retrievedContexts,
      testCase.groundTruthAnswer
    );

    // Measure if generated answer matches ground truth
    const answerRelevancy = await this.compareAnswers(
      testCase.generatedAnswer,
      testCase.groundTruthAnswer
    );

    // Measure if answer is faithful to retrieved context
    const answerFaithfulness = await this.checkFaithfulness(
      testCase.generatedAnswer,
      testCase.retrievedContexts
    );

    // Measure if contexts are relevant to question
    const contextRelevancy = await this.checkContextRelevance(
      testCase.question,
      testCase.retrievedContexts
    );

    const averageScore =
      (retrievalPrecision +
        retrievalRecall +
        answerRelevancy +
        answerFaithfulness +
        contextRelevancy) /
      5;

    return {
      retrievalPrecision,
      retrievalRecall,
      answerRelevancy,
      answerFaithfulness,
      contextRelevancy,
      averageScore,
    };
  }

  private async calculateRetrievalPrecision(
    contexts: string[],
    answer: string
  ): Promise<number> {
    // Count how many retrieved docs are actually relevant
    return 0.85;
  }

  private async calculateRetrievalRecall(
    contexts: string[],
    answer: string
  ): Promise<number> {
    // Of all relevant docs, what percentage did we retrieve?
    return 0.72;
  }

  private async compareAnswers(
    generated: string,
    groundTruth: string
  ): Promise<number> {
    return 0.88;
  }

  private async checkFaithfulness(
    answer: string,
    contexts: string[]
  ): Promise<number> {
    return 0.92;
  }

  private async checkContextRelevance(
    question: string,
    contexts: string[]
  ): Promise<number> {
    return 0.81;
  }
}

export { RAGSEvaluator, RAGSTestCase, RAGSMetrics };

Building and Maintaining a Golden Dataset

Your evaluation pipeline is only as good as your golden dataset. Structure it for scale:

interface GoldenDatasetExample {
  id: string;
  prompt: string;
  expectedOutput: string;
  category: string; // e.g., "summarization", "qa", "reasoning"
  difficulty: "easy" | "medium" | "hard";
  metadata: Record<string, unknown>;
}

class GoldenDatasetManager {
  async createGoldenDataset(
    examples: GoldenDatasetExample[]
  ): Promise<void> {
    // Store in database (PostgreSQL, Mongo, etc)
    // Ensure examples are representative of production traffic
    console.log(`Created golden dataset with ${examples.length} examples`);
  }

  async evaluatePromptVersion(
    newPrompt: string,
    examples: GoldenDatasetExample[]
  ): Promise<{ passedCount: number; failedCount: number; score: number }> {
    let passedCount = 0;
    let failedCount = 0;

    for (const example of examples) {
      // Skip evaluation for "hard" examples in rapid iteration
      if (example.difficulty === "hard") {
        // Run harder examples every 10th iteration
        if (Math.random() > 0.9) continue;
      }

      // Evaluate this example with new prompt
      const result = await evaluateResponseWithLLMJudge(
        example.prompt,
        "mock_response",
        example.expectedOutput,
        [{ name: "correctness", description: "Answer is correct", weight: 1 }]
      );

      if (result.every((r) => r.passed)) {
        passedCount++;
      } else {
        failedCount++;
      }
    }

    const score = passedCount / (passedCount + failedCount);
    return { passedCount, failedCount, score };
  }
}

export { GoldenDatasetManager, GoldenDatasetExample };

Automated Eval in CI Pipeline

Wire evaluation into your CI/CD so every prompt change is tested:

interface CIPipelineConfig {
  goldenDatasetPath: string;
  regressionThreshold: number; // e.g., 0.95 (5% regression allowed)
  performanceThresholdMs: number;
}

class EvaluationCIPipeline {
  constructor(private config: CIPipelineConfig) {}

  async runEvaluationGate(
    currentBranchVersion: string,
    mainBranchVersion: string
  ): Promise<boolean> {
    // Load golden dataset
    const examples: GoldenDatasetExample[] = [];

    // Evaluate both versions
    const currentScore = await this.evaluatePromptVersion(
      currentBranchVersion,
      examples
    );
    const mainScore = await this.evaluatePromptVersion(
      mainBranchVersion,
      examples
    );

    // Check for regression
    const regression = 1 - currentScore.score / mainScore.score;
    const isRegression = regression > (1 - this.config.regressionThreshold);

    if (isRegression) {
      console.error(
        `Regression detected: ${(regression * 100).toFixed(2)}% drop in score`
      );
      return false;
    }

    console.log(`Evaluation passed. Score: ${currentScore.score.toFixed(3)}`);
    return true;
  }

  private async evaluatePromptVersion(
    version: string,
    examples: GoldenDatasetExample[]
  ): Promise<{ score: number }> {
    return { score: 0.92 };
  }
}

export { EvaluationCIPipeline, CIPipelineConfig };

Conclusion

Systematic evaluation catches quality regressions before users see them. Use LLM-as-judge for custom criteria, DeepEval for standard metrics, RAGAS for RAG systems, and golden datasets for reproducibility. Integrate evaluation into CI so every prompt change is validated.

Build evaluation early. The cost of scoring 100 examples is negligible compared to shipping a bad prompt to production.