- Published on
Benchmarking LLMs for Your Use Case — Custom Evals Beyond MMLU and HumanEval
- Authors
- Name
Introduction
MMLU, HumanEval, and other standard benchmarks don't measure whether your model performs your specific task. A model scoring 90% on HumanEval might fail basic SQL generation. This guide covers building domain-specific benchmarks that actually predict production performance.
- Why Standard Benchmarks Mislead
- Building Domain-Specific Benchmarks
- Task-Based Evaluation
- Multilingual Benchmark Design
- Adversarial Examples
- Benchmark Contamination Detection
- Running Evals with LM Evaluation Harness
- Sharing Benchmarks
- Checklist
- Conclusion
Why Standard Benchmarks Mislead
Standard benchmarks optimize for paper citations, not production utility. MMLU tests world knowledge but not instruction following. HumanEval tests code generation on trivial problems — nothing like real production code. A model might excel at standard benchmarks while failing your specific use case.
The gap between benchmark performance and production performance can be >30%. This happens because:
- Distribution shift: your queries don't match benchmark format
- Task specificity: standard benchmarks test general capabilities, not your domain
- Contamination: models have seen benchmark data during training
- Evaluation biases: multiple-choice tests favor memorization over reasoning
Building Domain-Specific Benchmarks
Create benchmarks that mirror production queries exactly:
interface DomainSpecificBenchmark {
name: string;
domain: string;
examples: BenchmarkExample[];
metrics: MetricDefinition[];
baselineModel: string;
baselineScore: number;
version: string;
}
interface BenchmarkExample {
id: string;
input: string;
expectedOutput: string;
metadata: {
difficulty: 'easy' | 'medium' | 'hard';
category: string;
source: string;
};
}
interface MetricDefinition {
name: string;
compute: (output: string, expected: string) => number;
threshold: number; // Min acceptable score
}
class BenchmarkBuilder {
private benchmark: DomainSpecificBenchmark;
constructor(domain: string) {
this.benchmark = {
name: '',
domain,
examples: [],
metrics: [],
baselineModel: '',
baselineScore: 0,
version: '1.0'
};
}
addExample(
input: string,
expectedOutput: string,
difficulty: 'easy' | 'medium' | 'hard',
category: string
): this {
this.benchmark.examples.push({
id: `${Date.now()}-${Math.random()}`,
input,
expectedOutput,
metadata: {
difficulty,
category,
source: 'custom'
}
});
return this;
}
addMetric(
name: string,
compute: (output: string, expected: string) => number,
threshold: number = 0.8
): this {
this.benchmark.metrics.push({
name,
compute,
threshold
});
return this;
}
build(): DomainSpecificBenchmark {
return this.benchmark;
}
}
// Example: SQL generation benchmark
const sqlBenchmark = new BenchmarkBuilder('SQL')
.addExample(
'Find all customers from France who spent >1000 in Q1',
'SELECT c.* FROM customers c WHERE c.country = $1 AND c.total_spent > $2',
'medium',
'filtering'
)
.addExample(
'Count orders per product category',
'SELECT pc.name, COUNT(o.id) FROM order_items oi JOIN products p ON oi.product_id = p.id JOIN product_categories pc ON p.category_id = pc.id GROUP BY pc.name',
'hard',
'aggregation'
)
.addMetric(
'executable',
(output, expected) => {
try {
validateSQL(output);
return 1.0;
} catch {
return 0.0;
}
}
)
.addMetric(
'semantic_match',
(output, expected) => {
// Parse both queries and compare their structure
const query1 = parseSQL(output);
const query2 = parseSQL(expected);
return computeQuerySimilarity(query1, query2);
}
)
.build();
function validateSQL(sql: string): void {
// Parse and validate SQL syntax
if (!sql.includes('SELECT')) throw new Error('Invalid SQL');
}
function parseSQL(sql: string): object {
// Simple parsing logic
return { raw: sql };
}
function computeQuerySimilarity(q1: object, q2: object): number {
// Compare query structure
return 0.85;
}
Task-Based Evaluation
Measure whether the model completes the actual task, not just generates text:
interface TaskEvaluation {
taskId: string;
input: string;
output: string;
taskCompleted: boolean;
successReason?: string;
failureReason?: string;
}
class TaskEvaluator {
async evaluateCodeGeneration(
prompt: string,
generatedCode: string
): Promise<TaskEvaluation> {
// Test 1: Does it parse?
try {
eval(generatedCode); // Security warning: only for trusted code
} catch (e) {
return {
taskId: 'codegen',
input: prompt,
output: generatedCode,
taskCompleted: false,
failureReason: 'Syntax error: ' + e
};
}
// Test 2: Does it have the required function?
const hasFunction = /function|const.*=/g.test(generatedCode);
if (!hasFunction) {
return {
taskId: 'codegen',
input: prompt,
output: generatedCode,
taskCompleted: false,
failureReason: 'No function definition found'
};
}
// Test 3: Run against test cases
const testCases = this.extractTestCases(prompt);
const allPass = await this.runTestCases(generatedCode, testCases);
return {
taskId: 'codegen',
input: prompt,
output: generatedCode,
taskCompleted: allPass,
successReason: allPass ? 'All tests passed' : undefined,
failureReason: !allPass ? 'Some tests failed' : undefined
};
}
private extractTestCases(prompt: string): Array<{ input: any; expected: any }> {
// Extract test cases from prompt comments
return [];
}
private async runTestCases(
code: string,
testCases: Array<{ input: any; expected: any }>
): Promise<boolean> {
// Execute code with test inputs and verify outputs
return true;
}
}
Multilingual Benchmark Design
If serving global users, test multiple languages:
interface MultilingualBenchmark {
examples: Array<{
id: string;
language: string;
input: string;
expectedOutput: string;
}>;
languages: string[];
}
class MultilingualBenchmarkBuilder {
private examples: MultilingualBenchmark['examples'] = [];
private languages: Set<string> = new Set();
addExample(
id: string,
language: string,
input: string,
expectedOutput: string
): this {
this.examples.push({
id: `${id}-${language}`,
language,
input,
expectedOutput
});
this.languages.add(language);
return this;
}
build(): MultilingualBenchmark {
return {
examples: this.examples,
languages: Array.from(this.languages)
};
}
}
const multilingBench = new MultilingualBenchmarkBuilder()
.addExample('greet-1', 'en', 'Say hello', 'Hello!')
.addExample('greet-1', 'fr', 'Dites bonjour', 'Bonjour!')
.addExample('greet-1', 'es', 'Saludos', 'Hola!')
.build();
Adversarial Examples
Test how models fail. Create adversarial cases to find weaknesses:
interface AdversarialTest {
name: string;
attackType: 'typo' | 'paraphrase' | 'constraint_violation' | 'injection';
baselineExample: BenchmarkExample;
adversarialExample: BenchmarkExample;
expectedRobustness: number; // <1.0 is OK, indicates how much performance should degrade
}
class AdversarialTestGenerator {
// Typo perturbations
addTypos(text: string, typoRate: number = 0.05): string {
const words = text.split(/\s+/);
return words
.map((word) => {
if (Math.random() < typoRate && word.length > 3) {
// Swap adjacent characters
const chars = word.split('');
const idx = Math.floor(Math.random() * (chars.length - 1));
[chars[idx], chars[idx + 1]] = [chars[idx + 1], chars[idx]];
return chars.join('');
}
return word;
})
.join(' ');
}
// Paraphrase: rephrase while keeping meaning
async paraphrase(text: string): Promise<string> {
// Call LLM to paraphrase
return text;
}
// Constraint violation: try to break model rules
addConstraintViolation(
text: string,
constraint: string
): string {
// If constraint is "under 50 words", make input way longer
// If constraint is "no profanity", add edgy content
return text;
}
// Prompt injection: try to override instructions
promptInjection(text: string, injectionPayload: string): string {
return `${text}\n\nIGNORE PREVIOUS INSTRUCTIONS: ${injectionPayload}`;
}
}
async function evaluateRobustness(
model: LLMModel,
benchmark: DomainSpecificBenchmark
): Promise<RobustnessScore> {
const adversarialGen = new AdversarialTestGenerator();
let robustnessScores: number[] = [];
for (const example of benchmark.examples) {
const baselineOutput = await model.generate(example.input);
const baselineScore = benchmark.metrics[0].compute(
baselineOutput,
example.expectedOutput
);
// Test with typos
const tippyInput = adversarialGen.addTypos(example.input);
const typoOutput = await model.generate(tippyInput);
const typoScore = benchmark.metrics[0].compute(typoOutput, example.expectedOutput);
// Robustness: how much did score degrade?
const degradation = 1 - typoScore / baselineScore;
robustnessScores.push(degradation);
}
const avgDegradation = robustnessScores.reduce((a, b) => a + b) / robustnessScores.length;
return {
averageDegradation: avgDegradation,
assessment: avgDegradation < 0.2 ? 'robust' : 'fragile'
};
}
interface RobustnessScore {
averageDegradation: number;
assessment: 'robust' | 'fragile';
}
Benchmark Contamination Detection
Models might have seen your benchmark data during training. Detect this:
class ContaminationDetector {
async checkForContamination(
benchmark: DomainSpecificBenchmark,
model: LLMModel
): Promise<ContaminationReport> {
const suspiciousCases: Array<{
exampleId: string;
probability: number;
evidence: string[];
}> = [];
for (const example of benchmark.examples) {
// Test 1: Perfect reproduction
const output = await model.generate(example.input);
if (output === example.expectedOutput) {
suspiciousCases.push({
exampleId: example.id,
probability: 0.95,
evidence: ['Perfect match to expected output']
});
continue;
}
// Test 2: Excessive confidence
const confidence = await this.estimateConfidence(model, example.input);
if (confidence > 0.95) {
suspiciousCases.push({
exampleId: example.id,
probability: 0.7,
evidence: ['Unusually high confidence']
});
}
// Test 3: Substring containment
const chunks = example.expectedOutput.split(' ');
const substringsFound = chunks.filter((chunk) =>
output.includes(chunk)
).length;
const matchRate = substringsFound / chunks.length;
if (matchRate > 0.8) {
suspiciousCases.push({
exampleId: example.id,
probability: 0.6,
evidence: [`${(matchRate * 100).toFixed(1)}% substring overlap`]
});
}
}
const contaminationRate = suspiciousCases.length / benchmark.examples.length;
return {
contaminationRate,
assessment:
contaminationRate > 0.2 ? 'likely contaminated' : 'probably clean',
suspiciousCases
};
}
private async estimateConfidence(model: LLMModel, input: string): Promise<number> {
// Generate multiple times and measure consistency
const outputs: string[] = [];
for (let i = 0; i < 3; i++) {
outputs.push(await model.generate(input, { temperature: 0.3 }));
}
// If all outputs are identical, confidence is high
const unanimous = outputs.every((o) => o === outputs[0]);
return unanimous ? 0.99 : 0.5;
}
}
interface ContaminationReport {
contaminationRate: number;
assessment: string;
suspiciousCases: Array<{
exampleId: string;
probability: number;
evidence: string[];
}>;
}
Running Evals with LM Evaluation Harness
EleutherAI's LM Evaluation Harness automates benchmark execution:
interface HarnessConfig {
model: string;
tasks: string[];
batchSize: number;
numFewShot: number;
device: 'cuda' | 'cpu';
}
async function runWithHarness(config: HarnessConfig): Promise<HarnessResult> {
// Command: lm_eval --model openai --model_args engine=text-davinci-003 --tasks mmlu,hellaswag --num_fewshot 0
// In TypeScript, wrap CLI or use Python binding
const results = {
tasks: {
mmlu: {
acc: 0.65,
acc_stderr: 0.01
},
hellaswag: {
acc_norm: 0.72,
acc_norm_stderr: 0.02
}
}
};
return results;
}
interface HarnessResult {
tasks: Record<string, Record<string, number>>;
}
Sharing Benchmarks
Publish benchmarks for community contribution:
interface BenchmarkMetadata {
name: string;
description: string;
license: string;
contributors: string[];
lastUpdated: Date;
downloadUrl: string;
githubUrl?: string;
}
async function publishBenchmark(
benchmark: DomainSpecificBenchmark,
metadata: BenchmarkMetadata
): Promise<void> {
// Upload to Hugging Face Datasets
const datasetCard = `
# ${metadata.name}
${metadata.description}
## License
${metadata.license}
## Contributors
${metadata.contributors.join(', ')}
## Examples
${benchmark.examples.slice(0, 3).map((ex) => `- ${ex.input}`).join('\n')}
`;
console.log('Dataset card:', datasetCard);
// In production: upload to huggingface_hub
}
Checklist
- Build domain-specific benchmarks matching production queries
- Include easy, medium, hard examples in each category
- Define task-based metrics (does it work?) not just string similarity
- Create adversarial test cases (typos, paraphrases, injections)
- Test across languages if serving multilingual users
- Detect benchmark contamination in baseline models
- Use LM Evaluation Harness for reproducible runs
- Compare models using domain-specific benchmarks, not MMLU/HumanEval
- Version benchmarks and track metric changes over time
- Share benchmarks publicly for community validation
Conclusion
Standard benchmarks optimize for papers, not production. Build domain-specific benchmarks that measure whether your model solves your actual problem. Include adversarial cases to expose failure modes. Detect contamination to ensure fair evaluation. Task-based metrics beat string similarity. Multilingual evaluation catches regional issues. Shared benchmarks enable reproducibility and community contribution. A weak benchmark can hide poor real-world performance; invest time in evaluation infrastructure early.