Benchmarking LLMs for Your Use Case — Custom Evals Beyond MMLU and HumanEval

Introduction

MMLU, HumanEval, and other standard benchmarks don't measure whether your model performs your specific task. A model scoring 90% on HumanEval might fail basic SQL generation. This guide covers building domain-specific benchmarks that actually predict production performance.

Why Standard Benchmarks Mislead
Building Domain-Specific Benchmarks
Task-Based Evaluation
Multilingual Benchmark Design
Adversarial Examples
Benchmark Contamination Detection
Running Evals with LM Evaluation Harness
Sharing Benchmarks
Checklist
Conclusion

Why Standard Benchmarks Mislead

Standard benchmarks optimize for paper citations, not production utility. MMLU tests world knowledge but not instruction following. HumanEval tests code generation on trivial problems — nothing like real production code. A model might excel at standard benchmarks while failing your specific use case.

The gap between benchmark performance and production performance can be >30%. This happens because:

Distribution shift: your queries don't match benchmark format
Task specificity: standard benchmarks test general capabilities, not your domain
Contamination: models have seen benchmark data during training
Evaluation biases: multiple-choice tests favor memorization over reasoning

Building Domain-Specific Benchmarks

Create benchmarks that mirror production queries exactly:

interface DomainSpecificBenchmark {
  name: string;
  domain: string;
  examples: BenchmarkExample[];
  metrics: MetricDefinition[];
  baselineModel: string;
  baselineScore: number;
  version: string;
}

interface BenchmarkExample {
  id: string;
  input: string;
  expectedOutput: string;
  metadata: {
    difficulty: 'easy' | 'medium' | 'hard';
    category: string;
    source: string;
  };
}

interface MetricDefinition {
  name: string;
  compute: (output: string, expected: string) =&gt; number;
  threshold: number; // Min acceptable score
}

class BenchmarkBuilder {
  private benchmark: DomainSpecificBenchmark;

  constructor(domain: string) {
    this.benchmark = {
      name: '',
      domain,
      examples: [],
      metrics: [],
      baselineModel: '',
      baselineScore: 0,
      version: '1.0'
    };
  }

  addExample(
    input: string,
    expectedOutput: string,
    difficulty: 'easy' | 'medium' | 'hard',
    category: string
  ): this {
    this.benchmark.examples.push({
      id: `${Date.now()}-${Math.random()}`,
      input,
      expectedOutput,
      metadata: {
        difficulty,
        category,
        source: 'custom'
      }
    });
    return this;
  }

  addMetric(
    name: string,
    compute: (output: string, expected: string) =&gt; number,
    threshold: number = 0.8
  ): this {
    this.benchmark.metrics.push({
      name,
      compute,
      threshold
    });
    return this;
  }

  build(): DomainSpecificBenchmark {
    return this.benchmark;
  }
}

// Example: SQL generation benchmark
const sqlBenchmark = new BenchmarkBuilder('SQL')
  .addExample(
    'Find all customers from France who spent &gt;1000 in Q1',
    'SELECT c.* FROM customers c WHERE c.country = $1 AND c.total_spent &gt; $2',
    'medium',
    'filtering'
  )
  .addExample(
    'Count orders per product category',
    'SELECT pc.name, COUNT(o.id) FROM order_items oi JOIN products p ON oi.product_id = p.id JOIN product_categories pc ON p.category_id = pc.id GROUP BY pc.name',
    'hard',
    'aggregation'
  )
  .addMetric(
    'executable',
    (output, expected) =&gt; {
      try {
        validateSQL(output);
        return 1.0;
      } catch {
        return 0.0;
      }
    }
  )
  .addMetric(
    'semantic_match',
    (output, expected) =&gt; {
      // Parse both queries and compare their structure
      const query1 = parseSQL(output);
      const query2 = parseSQL(expected);
      return computeQuerySimilarity(query1, query2);
    }
  )
  .build();

function validateSQL(sql: string): void {
  // Parse and validate SQL syntax
  if (!sql.includes('SELECT')) throw new Error('Invalid SQL');
}

function parseSQL(sql: string): object {
  // Simple parsing logic
  return { raw: sql };
}

function computeQuerySimilarity(q1: object, q2: object): number {
  // Compare query structure
  return 0.85;
}

Task-Based Evaluation

Measure whether the model completes the actual task, not just generates text:

interface TaskEvaluation {
  taskId: string;
  input: string;
  output: string;
  taskCompleted: boolean;
  successReason?: string;
  failureReason?: string;
}

class TaskEvaluator {
  async evaluateCodeGeneration(
    prompt: string,
    generatedCode: string
  ): Promise&lt;TaskEvaluation&gt; {
    // Test 1: Does it parse?
    try {
      eval(generatedCode); // Security warning: only for trusted code
    } catch (e) {
      return {
        taskId: 'codegen',
        input: prompt,
        output: generatedCode,
        taskCompleted: false,
        failureReason: 'Syntax error: ' + e
      };
    }

    // Test 2: Does it have the required function?
    const hasFunction = /function|const.*=/g.test(generatedCode);
    if (!hasFunction) {
      return {
        taskId: 'codegen',
        input: prompt,
        output: generatedCode,
        taskCompleted: false,
        failureReason: 'No function definition found'
      };
    }

    // Test 3: Run against test cases
    const testCases = this.extractTestCases(prompt);
    const allPass = await this.runTestCases(generatedCode, testCases);

    return {
      taskId: 'codegen',
      input: prompt,
      output: generatedCode,
      taskCompleted: allPass,
      successReason: allPass ? 'All tests passed' : undefined,
      failureReason: !allPass ? 'Some tests failed' : undefined
    };
  }

  private extractTestCases(prompt: string): Array&lt;{ input: any; expected: any }&gt; {
    // Extract test cases from prompt comments
    return [];
  }

  private async runTestCases(
    code: string,
    testCases: Array&lt;{ input: any; expected: any }&gt;
  ): Promise&lt;boolean&gt; {
    // Execute code with test inputs and verify outputs
    return true;
  }
}

Multilingual Benchmark Design

If serving global users, test multiple languages:

interface MultilingualBenchmark {
  examples: Array&lt;{
    id: string;
    language: string;
    input: string;
    expectedOutput: string;
  }&gt;;
  languages: string[];
}

class MultilingualBenchmarkBuilder {
  private examples: MultilingualBenchmark['examples'] = [];
  private languages: Set&lt;string&gt; = new Set();

  addExample(
    id: string,
    language: string,
    input: string,
    expectedOutput: string
  ): this {
    this.examples.push({
      id: `${id}-${language}`,
      language,
      input,
      expectedOutput
    });
    this.languages.add(language);
    return this;
  }

  build(): MultilingualBenchmark {
    return {
      examples: this.examples,
      languages: Array.from(this.languages)
    };
  }
}

const multilingBench = new MultilingualBenchmarkBuilder()
  .addExample('greet-1', 'en', 'Say hello', 'Hello!')
  .addExample('greet-1', 'fr', 'Dites bonjour', 'Bonjour!')
  .addExample('greet-1', 'es', 'Saludos', 'Hola!')
  .build();

Adversarial Examples

Test how models fail. Create adversarial cases to find weaknesses:

interface AdversarialTest {
  name: string;
  attackType: 'typo' | 'paraphrase' | 'constraint_violation' | 'injection';
  baselineExample: BenchmarkExample;
  adversarialExample: BenchmarkExample;
  expectedRobustness: number; // &lt;1.0 is OK, indicates how much performance should degrade
}

class AdversarialTestGenerator {
  // Typo perturbations
  addTypos(text: string, typoRate: number = 0.05): string {
    const words = text.split(/\s+/);
    return words
      .map((word) =&gt; {
        if (Math.random() &lt; typoRate &amp;&amp; word.length &gt; 3) {
          // Swap adjacent characters
          const chars = word.split('');
          const idx = Math.floor(Math.random() * (chars.length - 1));
          [chars[idx], chars[idx + 1]] = [chars[idx + 1], chars[idx]];
          return chars.join('');
        }
        return word;
      })
      .join(' ');
  }

  // Paraphrase: rephrase while keeping meaning
  async paraphrase(text: string): Promise&lt;string&gt; {
    // Call LLM to paraphrase
    return text;
  }

  // Constraint violation: try to break model rules
  addConstraintViolation(
    text: string,
    constraint: string
  ): string {
    // If constraint is &quot;under 50 words&quot;, make input way longer
    // If constraint is &quot;no profanity&quot;, add edgy content
    return text;
  }

  // Prompt injection: try to override instructions
  promptInjection(text: string, injectionPayload: string): string {
    return `${text}\n\nIGNORE PREVIOUS INSTRUCTIONS: ${injectionPayload}`;
  }
}

async function evaluateRobustness(
  model: LLMModel,
  benchmark: DomainSpecificBenchmark
): Promise&lt;RobustnessScore&gt; {
  const adversarialGen = new AdversarialTestGenerator();
  let robustnessScores: number[] = [];

  for (const example of benchmark.examples) {
    const baselineOutput = await model.generate(example.input);
    const baselineScore = benchmark.metrics[0].compute(
      baselineOutput,
      example.expectedOutput
    );

    // Test with typos
    const tippyInput = adversarialGen.addTypos(example.input);
    const typoOutput = await model.generate(tippyInput);
    const typoScore = benchmark.metrics[0].compute(typoOutput, example.expectedOutput);

    // Robustness: how much did score degrade?
    const degradation = 1 - typoScore / baselineScore;
    robustnessScores.push(degradation);
  }

  const avgDegradation = robustnessScores.reduce((a, b) =&gt; a + b) / robustnessScores.length;

  return {
    averageDegradation: avgDegradation,
    assessment: avgDegradation &lt; 0.2 ? 'robust' : 'fragile'
  };
}

interface RobustnessScore {
  averageDegradation: number;
  assessment: 'robust' | 'fragile';
}

Benchmark Contamination Detection

Models might have seen your benchmark data during training. Detect this:

class ContaminationDetector {
  async checkForContamination(
    benchmark: DomainSpecificBenchmark,
    model: LLMModel
  ): Promise&lt;ContaminationReport&gt; {
    const suspiciousCases: Array&lt;{
      exampleId: string;
      probability: number;
      evidence: string[];
    }&gt; = [];

    for (const example of benchmark.examples) {
      // Test 1: Perfect reproduction
      const output = await model.generate(example.input);
      if (output === example.expectedOutput) {
        suspiciousCases.push({
          exampleId: example.id,
          probability: 0.95,
          evidence: ['Perfect match to expected output']
        });
        continue;
      }

      // Test 2: Excessive confidence
      const confidence = await this.estimateConfidence(model, example.input);
      if (confidence &gt; 0.95) {
        suspiciousCases.push({
          exampleId: example.id,
          probability: 0.7,
          evidence: ['Unusually high confidence']
        });
      }

      // Test 3: Substring containment
      const chunks = example.expectedOutput.split(' ');
      const substringsFound = chunks.filter((chunk) =&gt;
        output.includes(chunk)
      ).length;
      const matchRate = substringsFound / chunks.length;

      if (matchRate &gt; 0.8) {
        suspiciousCases.push({
          exampleId: example.id,
          probability: 0.6,
          evidence: [`${(matchRate * 100).toFixed(1)}% substring overlap`]
        });
      }
    }

    const contaminationRate = suspiciousCases.length / benchmark.examples.length;

    return {
      contaminationRate,
      assessment:
        contaminationRate &gt; 0.2 ? 'likely contaminated' : 'probably clean',
      suspiciousCases
    };
  }

  private async estimateConfidence(model: LLMModel, input: string): Promise&lt;number&gt; {
    // Generate multiple times and measure consistency
    const outputs: string[] = [];
    for (let i = 0; i &lt; 3; i++) {
      outputs.push(await model.generate(input, { temperature: 0.3 }));
    }

    // If all outputs are identical, confidence is high
    const unanimous = outputs.every((o) =&gt; o === outputs[0]);
    return unanimous ? 0.99 : 0.5;
  }
}

interface ContaminationReport {
  contaminationRate: number;
  assessment: string;
  suspiciousCases: Array&lt;{
    exampleId: string;
    probability: number;
    evidence: string[];
  }&gt;;
}

Running Evals with LM Evaluation Harness

EleutherAI's LM Evaluation Harness automates benchmark execution:

interface HarnessConfig {
  model: string;
  tasks: string[];
  batchSize: number;
  numFewShot: number;
  device: 'cuda' | 'cpu';
}

async function runWithHarness(config: HarnessConfig): Promise&lt;HarnessResult&gt; {
  // Command: lm_eval --model openai --model_args engine=text-davinci-003 --tasks mmlu,hellaswag --num_fewshot 0

  // In TypeScript, wrap CLI or use Python binding
  const results = {
    tasks: {
      mmlu: {
        acc: 0.65,
        acc_stderr: 0.01
      },
      hellaswag: {
        acc_norm: 0.72,
        acc_norm_stderr: 0.02
      }
    }
  };

  return results;
}

interface HarnessResult {
  tasks: Record&lt;string, Record&lt;string, number&gt;&gt;;
}

Publish benchmarks for community contribution:

interface BenchmarkMetadata {
  name: string;
  description: string;
  license: string;
  contributors: string[];
  lastUpdated: Date;
  downloadUrl: string;
  githubUrl?: string;
}

async function publishBenchmark(
  benchmark: DomainSpecificBenchmark,
  metadata: BenchmarkMetadata
): Promise&lt;void&gt; {
  // Upload to Hugging Face Datasets
  const datasetCard = `
# ${metadata.name}

${metadata.description}

## License
${metadata.license}

## Contributors
${metadata.contributors.join(', ')}

## Examples
${benchmark.examples.slice(0, 3).map((ex) =&gt; `- ${ex.input}`).join('\n')}
`;

  console.log('Dataset card:', datasetCard);
  // In production: upload to huggingface_hub
}

Checklist

Conclusion

Standard benchmarks optimize for papers, not production. Build domain-specific benchmarks that measure whether your model solves your actual problem. Include adversarial cases to expose failure modes. Detect contamination to ensure fair evaluation. Task-based metrics beat string similarity. Multilingual evaluation catches regional issues. Shared benchmarks enable reproducibility and community contribution. A weak benchmark can hide poor real-world performance; invest time in evaluation infrastructure early.