Published on

AI Training Data Quality — Cleaning, Deduplication, and Quality Scoring

Authors
  • Name
    Twitter

Introduction

"Garbage in, garbage out" applies to LLMs. Training data quality directly determines model quality. Duplicates waste compute. Errors propagate to outputs. Biases embed in model behavior. This guide covers systematically measuring and improving training data quality.

Data Quality Dimensions

Data quality isn't binary. Measure across dimensions:

interface DataQualityMetrics {
  accuracy: number; // Factual correctness
  completeness: number; // All required fields present
  consistency: number; // No contradictions within document
  timeliness: number; // How recent is the data?
  validity: number; // Conforms to expected format
  uniqueness: number; // No near-duplicates
}

class DataQualityAssessor {
  async assessQuality(document: Document): Promise<DataQualityMetrics> {
    return {
      accuracy: await this.checkAccuracy(document),
      completeness: await this.checkCompleteness(document),
      consistency: await this.checkConsistency(document),
      timeliness: this.checkTimeliness(document),
      validity: this.checkValidity(document),
      uniqueness: 1.0 // Will measure against corpus
    };
  }

  private async checkAccuracy(doc: Document): Promise<number> {
    // Run fact-checking model or compare with trusted sources
    const facts = this.extractFacts(doc);
    const checkedFacts = await Promise.all(
      facts.map((f) => this.verifyFact(f))
    );

    const correctFacts = checkedFacts.filter((c) => c).length;
    return correctFacts / facts.length;
  }

  private async checkCompleteness(doc: Document): Promise<number> {
    // Check if required fields are present
    const requiredFields = ['title', 'content', 'date'];
    const presentFields = requiredFields.filter((f) => doc[f] !== undefined && doc[f] !== '');

    return presentFields.length / requiredFields.length;
  }

  private async checkConsistency(doc: Document): Promise<number> {
    // Look for contradictions
    // E.g., "John is 30 years old" vs "John was born in 1995"
    const contradictions = await this.findContradictions(doc);
    return Math.max(0, 1 - contradictions.length * 0.1);
  }

  private checkTimeliness(doc: Document): number {
    // How old is this data?
    const docDate = new Date(doc.date);
    const daysOld = (Date.now() - docDate.getTime()) / (1000 * 60 * 60 * 24);

    if (daysOld < 30) return 1.0;
    if (daysOld < 365) return 0.8;
    if (daysOld < 1825) return 0.5; // 5 years
    return 0.2;
  }

  private checkValidity(doc: Document): number {
    // Check schema compliance
    try {
      validateDocumentSchema(doc);
      return 1.0;
    } catch {
      return 0.0;
    }
  }

  private extractFacts(doc: Document): string[] {
    // Named entity recognition + relation extraction
    return [];
  }

  private async verifyFact(fact: string): Promise<boolean> {
    // Query knowledge base or fact-checking service
    return true;
  }

  private async findContradictions(doc: Document): Promise<string[]> {
    // Parse sentences, extract claims, check for logical conflicts
    return [];
  }
}

interface Document {
  id: string;
  title: string;
  content: string;
  date: string;
  [key: string]: any;
}

Deduplication with MinHash

Exact deduplication (checking for identical documents) is simple but misses near-duplicates. MinHash enables fast approximate deduplication:

class MinHashDeduplicator {
  private numHashes: number = 128;
  private signatures: Map<string, number[]> = new Map();

  getSignature(document: string): number[] {
    const shingles = this.getShingles(document);
    const signature: number[] = [];

    for (let i = 0; i < this.numHashes; i++) {
      let minHash = Number.MAX_SAFE_INTEGER;

      for (const shingle of shingles) {
        const hash = this.hashShingle(shingle, i);
        minHash = Math.min(minHash, hash);
      }

      signature.push(minHash);
    }

    return signature;
  }

  private getShingles(text: string, k: number = 5): Set<string> {
    // k-shingles: k-gram of words
    const tokens = text.toLowerCase().split(/\s+/);
    const shingles = new Set<string>();

    for (let i = 0; i <= tokens.length - k; i++) {
      shingles.add(tokens.slice(i, i + k).join(' '));
    }

    return shingles;
  }

  private hashShingle(shingle: string, seed: number): number {
    // Simple hash function
    let hash = seed;
    for (let i = 0; i < shingle.length; i++) {
      hash = (hash << 5) - hash + shingle.charCodeAt(i);
      hash = hash & hash; // Keep as 32-bit integer
    }
    return Math.abs(hash);
  }

  deduplicateCorpus(documents: Document[]): Document[] {
    const unique: Document[] = [];
    const seenSignatures: string[] = [];

    for (const doc of documents) {
      const signature = this.getSignature(doc.content);
      const signatureStr = signature.join(',');

      // Check for duplicates
      let isDuplicate = false;
      for (const seenSig of seenSignatures) {
        const similarity = this.computeJaccardSimilarity(signatureStr, seenSig);
        if (similarity > 0.9) {
          isDuplicate = true;
          break;
        }
      }

      if (!isDuplicate) {
        unique.push(doc);
        seenSignatures.push(signatureStr);
      }
    }

    return unique;
  }

  private computeJaccardSimilarity(sig1: string, sig2: string): number {
    const hashes1 = sig1.split(',').map(Number);
    const hashes2 = sig2.split(',').map(Number);

    const matching = hashes1.filter((h, i) => h === hashes2[i]).length;
    return matching / hashes1.length;
  }
}

Locality-Sensitive Hashing (LSH)

For large-scale deduplication, LSH groups similar documents efficiently:

class LSHDeduplicator {
  private numBands: number = 10;
  private rowsPerBand: number = 12;
  private buckets: Map<string, string[]> = new Map();

  async deduplicateLargeCorpus(documents: Document[]): Promise<Document[]> {
    // Step 1: Compute MinHash signatures
    const minhasher = new MinHashDeduplicator();
    const signatures = new Map<string, number[]>();

    for (const doc of documents) {
      signatures.set(doc.id, minhasher.getSignature(doc.content));
    }

    // Step 2: LSH: hash signatures into bands/buckets
    for (const [docId, signature] of signatures.entries()) {
      for (let band = 0; band < this.numBands; band++) {
        const bandHash = this.hashBand(
          signature.slice(
            band * this.rowsPerBand,
            (band + 1) * this.rowsPerBand
          ),
          band
        );

        const bucketKey = `band-${band}-${bandHash}`;
        if (!this.buckets.has(bucketKey)) {
          this.buckets.set(bucketKey, []);
        }
        this.buckets.get(bucketKey)!.push(docId);
      }
    }

    // Step 3: Find duplicates within buckets
    const duplicateGroups: string[][] = [];
    for (const bucket of this.buckets.values()) {
      if (bucket.length > 1) {
        // All docs in same bucket are candidates
        duplicateGroups.push(bucket);
      }
    }

    // Step 4: Keep only one from each group
    const keep = new Set<string>();
    for (const group of duplicateGroups) {
      keep.add(group[0]); // Keep first, discard others
    }

    // Add all non-duplicate docs
    for (const doc of documents) {
      if (!this.isInDuplicateGroup(doc.id, duplicateGroups)) {
        keep.add(doc.id);
      }
    }

    return documents.filter((d) => keep.has(d.id));
  }

  private hashBand(band: number[], bandId: number): string {
    return band.join('-') + '-' + bandId;
  }

  private isInDuplicateGroup(docId: string, groups: string[][]): boolean {
    return groups.some((group) => group.includes(docId));
  }
}

Data Cleaning Pipeline

interface CleaningStep {
  name: string;
  apply: (doc: Document) => Document;
  severity: 'error' | 'warning';
}

class DataCleaningPipeline {
  private steps: CleaningStep[] = [];

  addStep(step: CleaningStep): this {
    this.steps.push(step);
    return this;
  }

  async clean(documents: Document[]): Promise<Document[]> {
    const cleaned: Document[] = [];

    for (const doc of documents) {
      let cleaned_doc = doc;

      for (const step of this.steps) {
        try {
          cleaned_doc = step.apply(cleaned_doc);
        } catch (error) {
          if (step.severity === 'error') {
            console.log(`Skipping document due to ${step.name}: ${error}`);
            continue;
          }
        }
      }

      cleaned.push(cleaned_doc);
    }

    return cleaned;
  }
}

const cleaningPipeline = new DataCleaningPipeline()
  .addStep({
    name: 'remove-html',
    apply: (doc) => ({
      ...doc,
      content: doc.content.replace(/<[^>]*>/g, '')
    }),
    severity: 'warning'
  })
  .addStep({
    name: 'normalize-whitespace',
    apply: (doc) => ({
      ...doc,
      content: doc.content.replace(/\s+/g, ' ').trim()
    }),
    severity: 'warning'
  })
  .addStep({
    name: 'remove-control-characters',
    apply: (doc) => ({
      ...doc,
      content: doc.content.replace(/[\x00-\x08\x0B-\x0C\x0E-\x1F]/g, '')
    }),
    severity: 'warning'
  })
  .addStep({
    name: 'validate-length',
    apply: (doc) => {
      if (doc.content.length < 10) {
        throw new Error('Content too short');
      }
      return doc;
    },
    severity: 'error'
  });

Quality Scoring with LLM-as-Judge

Use an LLM to score quality across multiple dimensions:

class LLMQualityScorer {
  private judgeModel: LLMModel;

  async scoreQuality(document: Document): Promise<QualityScore> {
    const rubric = `
Rate this document on:
1. Factual Accuracy (0-100): Are claims accurate?
2. Clarity (0-100): Is it well-written and clear?
3. Completeness (0-100): Does it address the topic fully?
4. Relevance (0-100): Is it relevant to the domain?
5. Coherence (0-100): Is the logic consistent?

Document:
${document.content}

Provide scores as JSON: { accuracy, clarity, completeness, relevance, coherence }`;

    const response = await this.judgeModel.generate(rubric);
    const scores = JSON.parse(response);

    return {
      accuracy: scores.accuracy / 100,
      clarity: scores.clarity / 100,
      completeness: scores.completeness / 100,
      relevance: scores.relevance / 100,
      coherence: scores.coherence / 100,
      overall:
        (scores.accuracy + scores.clarity + scores.completeness + scores.relevance + scores.coherence) / 500
    };
  }
}

interface QualityScore {
  accuracy: number;
  clarity: number;
  completeness: number;
  relevance: number;
  coherence: number;
  overall: number;
}

Toxic and Biased Content Filtering

Filter dangerous or biased content:

class ContentFilter {
  private toxicityModel: ToxicityDetector;
  private biasDetector: BiasDetector;

  async filterContent(documents: Document[]): Promise<Document[]> {
    const filtered: Document[] = [];

    for (const doc of documents) {
      const isToxic = await this.toxicityModel.detect(doc.content);
      if (isToxic.score > 0.7) {
        console.log(`Removing toxic document: ${doc.id}`);
        continue;
      }

      const biases = await this.biasDetector.detect(doc.content);
      if (biases.length > 0) {
        // Flag but don't remove (bias is contextual)
        console.log(`Document ${doc.id} contains potential bias: ${biases.join(', ')}`);
      }

      filtered.push(doc);
    }

    return filtered;
  }
}

interface ToxicityDetector {
  detect(text: string): Promise<{ score: number; labels: string[] }>;
}

interface BiasDetector {
  detect(text: string): Promise<string[]>;
}

PII Removal

Remove personally identifiable information:

class PIIRemover {
  async removePII(document: Document): Promise<Document> {
    let content = document.content;

    // Email addresses
    content = content.replace(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, '[EMAIL]');

    // Phone numbers
    content = content.replace(/\+?1?\s*\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}/g, '[PHONE]');

    // Social security numbers
    content = content.replace(/\d{3}-\d{2}-\d{4}/g, '[SSN]');

    // Credit card numbers
    content = content.replace(/\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}/g, '[CARD]');

    // Names (using NER)
    const entities = await this.extractNamedEntities(content);
    for (const entity of entities) {
      if (entity.type === 'PERSON') {
        content = content.replace(entity.text, '[PERSON]');
      }
    }

    return { ...document, content };
  }

  private async extractNamedEntities(text: string): Promise<NamedEntity[]> {
    // Use NER model to detect persons, locations, etc.
    return [];
  }
}

interface NamedEntity {
  text: string;
  type: 'PERSON' | 'LOCATION' | 'ORGANIZATION';
  start: number;
  end: number;
}

Data Versioning with DVC

Track data versions alongside code:

class DataVersionControl {
  async trackDataset(
    datasetPath: string,
    metadata: {
      name: string;
      description: string;
      version: string;
      size: number;
      hash: string;
    }
  ): Promise<void> {
    // DVC: dvc add data/training-set.csv
    // Creates data/training-set.csv.dvc with metadata

    // In TypeScript:
    const dvcFile = {
      outs: [
        {
          path: datasetPath,
          md5: metadata.hash,
          size: metadata.size,
          hash: 'md5'
        }
      ],
      meta: metadata
    };

    console.log(`Dataset ${metadata.name}@${metadata.version} tracked`);
  }

  async compareVersions(
    version1: string,
    version2: string
  ): Promise<ComparisonReport> {
    // dvc plots diff v1 v2
    return {
      version1,
      version2,
      sizeDiff: 1024,
      duplicatesDiff: 50,
      qualityMetricsDiff: {}
    };
  }
}

interface ComparisonReport {
  version1: string;
  version2: string;
  sizeDiff: number;
  duplicatesDiff: number;
  qualityMetricsDiff: Record<string, number>;
}

Quality vs Quantity Trade-off

More data isn't always better:

async function findOptimalDatasetSize(
  model: LLMModel,
  fullDataset: Document[]
): Promise<{
  optimalSize: number;
  qualityScore: number;
  performanceGain: number;
}> {
  const results: Array<{ size: number; performance: number }> = [];

  // Test different sizes
  for (let size = 100; size <= fullDataset.length; size += Math.floor(fullDataset.length / 20)) {
    const subset = fullDataset.slice(0, size);
    const trainedModel = await model.train(subset);
    const performance = await evaluateModel(trainedModel);

    results.push({ size, performance });
  }

  // Find inflection point (diminishing returns)
  let optimalSize = 100;
  let maxEfficiency = 0;

  for (let i = 1; i < results.length; i++) {
    const efficiency =
      (results[i].performance - results[i - 1].performance) /
      (results[i].size - results[i - 1].size);

    if (efficiency > maxEfficiency) {
      maxEfficiency = efficiency;
      optimalSize = results[i].size;
    }
  }

  return {
    optimalSize,
    qualityScore: 0.85,
    performanceGain: results[results.length - 1].performance - results[0].performance
  };
}

async function evaluateModel(model: LLMModel): Promise<number> {
  // Return evaluation score
  return 0.8;
}

Checklist

  • Measure data quality across all dimensions (accuracy, completeness, consistency, timeliness, validity)
  • Implement MinHash/LSH for scalable deduplication
  • Build data cleaning pipeline with multiple steps
  • Use LLM-as-judge for multi-dimensional quality scoring
  • Filter toxic and potentially harmful content
  • Remove PII before training
  • Version datasets with DVC
  • Implement data quality monitoring in CI/CD
  • Find optimal dataset size (diminishing returns)
  • Document data provenance and quality metrics

Conclusion

Data quality directly determines model quality. Deduplication with MinHash removes costly duplicates. LLM-as-judge provides multi-dimensional quality scores. Cleaning pipelines normalize format and content. Filtering removes toxic/biased content. PII removal ensures privacy. Data versioning tracks lineage. Together, these practices build high-quality training corpora. Invest in data quality infrastructure — it pays dividends across the entire machine learning pipeline.