Published on

Synthetic Data Generation With LLMs — Building Training Datasets at Scale

Authors

Introduction

Generating synthetic training data with LLMs accelerates model development, covers edge cases, and creates virtuous cycles where better models generate better data. This guide covers practical techniques and quality filters for production datasets.

When Synthetic Data Works Best

Synthetic data generation is most effective for:

  • Bootstrapping new domains with limited real examples
  • Covering rare edge cases and corner scenarios
  • Augmenting small datasets (100-1000 examples) to 10k+
  • Creating specific formatting or style variations
  • Reducing costs vs manual data collection

Real data should still be included for grounding, but synthetic data can reach 70-80% of final dataset volume safely.

GPT-4 for Instruction Generation

Use GPT-4 as a generator for high-diversity synthetic examples:

interface SyntheticDataRequest {
  domain: string;
  topic: string;
  count: number;
  constraints: string[];
}

interface GeneratedExample {
  instruction: string;
  input: string;
  output: string;
  metadata: {
    generationModel: string;
    domain: string;
    difficulty: 'easy' | 'medium' | 'hard';
  };
}

async function generateSyntheticExamples(
  request: SyntheticDataRequest
): Promise<GeneratedExample[]> {
  const prompt = `You are an expert data annotator for ${request.domain} tasks.
Generate ${request.count} diverse, realistic examples for the following topic: ${request.topic}

Constraints:
${request.constraints.map((c, i) => `${i + 1}. ${c}`).join('\n')}

Output format: JSON array of objects with fields: instruction, input, output.
Ensure high diversity in phrasing and content.
Make outputs realistic and representative of real use cases.`;

  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }],
      temperature: 1.0, // High temperature for diversity
      top_p: 0.95
    })
  });

  const data = await response.json();
  const content = data.choices[0].message.content;

  // Parse JSON from response
  const jsonMatch = content.match(/\[[\s\S]*\]/);
  if (!jsonMatch) throw new Error('Failed to parse generated examples');

  const baseExamples = JSON.parse(jsonMatch[0]);

  return baseExamples.map((ex: any, i: number) => ({
    instruction: ex.instruction,
    input: ex.input,
    output: ex.output,
    metadata: {
      generationModel: 'gpt-4',
      domain: request.domain,
      difficulty: getDifficulty(ex.instruction, i)
    }
  }));
}

function getDifficulty(
  instruction: string,
  index: number
): 'easy' | 'medium' | 'hard' {
  if (instruction.length &lt; 50) return 'easy';
  if (instruction.length &gt; 200) return 'hard';
  return 'medium';
}

// Example usage
const examples = await generateSyntheticExamples({
  domain: 'customer support',
  topic: 'billing inquiries',
  count: 50,
  constraints: [
    'Mix questions about invoices, payments, and subscriptions',
    'Include common typos and informal phrasing',
    'Generate diverse customer scenarios'
  ]
});

Evol-Instruct Technique

Iteratively evolve instructions to increase complexity:

interface EvolutionStep {
  originalInstruction: string;
  evolvedInstructions: string[];
  depth: number;
}

async function evolveInstruction(
  instruction: string,
  depth: number = 3
): Promise<EvolutionStep> {
  const evolvedInstructions: string[] = [instruction];

  for (let i = 0; i &lt; depth; i++) {
    const currentInstruction = evolvedInstructions[evolvedInstructions.length - 1];

    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [
          {
            role: 'user',
            content: `Make this instruction more complex and diverse while maintaining the core task:
Original: ${currentInstruction}

Provide ONE evolved instruction that adds constraints, requires deeper reasoning, or increases specificity.
Output only the new instruction, nothing else.`
          }
        ],
        temperature: 0.8,
        max_tokens: 200
      })
    });

    const data = await response.json();
    const evolved = data.choices[0].message.content.trim();
    evolvedInstructions.push(evolved);
  }

  return {
    originalInstruction: instruction,
    evolvedInstructions,
    depth
  };
}

// Creates instruction chains:
// Simple &rarr; Medium &rarr; Hard &rarr; Expert
// Builds diverse difficulty levels automatically

Self-Play for Diverse Examples

Generate diverse outputs for same input by having model critique itself:

interface SelfPlayRound {
  input: string;
  responses: Array<{
    text: string;
    score: number;
    feedback: string;
  }>;
  selected: string;
}

async function selfPlayGeneration(
  input: string,
  rounds: number = 3
): Promise<SelfPlayRound> {
  const responses: Array<{
    text: string;
    score: number;
    feedback: string;
  }> = [];

  for (let round = 0; round &lt; rounds; round++) {
    // Generate diverse response
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [
          { role: 'user', content: input }
        ],
        temperature: 0.8 + round * 0.1, // Increase randomness each round
        max_tokens: 500
      })
    });

    const data = await response.json();
    const text = data.choices[0].message.content;

    // Score response quality
    const scoreResponse = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [
          {
            role: 'user',
            content: `Rate this response on a scale of 1-10:\n${text}\n\nProvide score and brief feedback.`
          }
        ],
        temperature: 0
      })
    });

    const scoreData = await scoreResponse.json();
    const scoreText = scoreData.choices[0].message.content;
    const scoreMatch = scoreText.match(/(\d+)/);
    const score = scoreMatch ? parseInt(scoreMatch[1]) : 5;

    responses.push({
      text,
      score,
      feedback: scoreText
    });
  }

  // Select best response
  const selected = responses.reduce((best, current) =>
    current.score &gt; best.score ? current : best
  ).text;

  return {
    input,
    responses,
    selected
  };
}

Quality Filtering Strategy

interface QualityFilter {
  name: string;
  evaluate: (example: GeneratedExample) => { pass: boolean; reason: string };
}

function createQualityFilters(): QualityFilter[] {
  return [
    {
      name: 'Length Check',
      evaluate: (ex: GeneratedExample) => {
        const minLen = 10;
        const maxLen = 2000;
        const outputLen = ex.output.length;

        if (outputLen &lt; minLen || outputLen &gt; maxLen) {
          return { pass: false, reason: `Output length ${outputLen} outside range` };
        }
        return { pass: true, reason: 'Length valid' };
      }
    },
    {
      name: 'Deduplication',
      evaluate: (ex: GeneratedExample) => {
        // In production: check against existing dataset
        const isDuplicate = false; // Simplified
        return {
          pass: !isDuplicate,
          reason: isDuplicate ? 'Duplicate found' : 'Unique'
        };
      }
    },
    {
      name: 'Perplexity Filter',
      evaluate: (ex: GeneratedExample) => {
        // Reject low-perplexity (repetitive) outputs
        const entropy = calculateEntropy(ex.output);
        const minEntropy = 2.0;

        if (entropy &lt; minEntropy) {
          return { pass: false, reason: `Low perplexity: ${entropy.toFixed(2)}` };
        }
        return { pass: true, reason: `Entropy: ${entropy.toFixed(2)}` };
      }
    },
    {
      name: 'Semantic Coherence',
      evaluate: (ex: GeneratedExample) => {
        // Check if output answers instruction
        const coherence = true; // Would use embedding similarity
        return {
          pass: coherence,
          reason: coherence ? 'Coherent' : 'Incoherent'
        };
      }
    }
  ];
}

function calculateEntropy(text: string): number {
  const words = text.toLowerCase().split(/\s+/);
  const freq: { [key: string]: number } = {};

  words.forEach(word => {
    freq[word] = (freq[word] || 0) + 1;
  });

  let entropy = 0;
  const total = words.length;

  Object.values(freq).forEach(count => {
    const p = count / total;
    entropy -= p * Math.log2(p);
  });

  return entropy;
}

async function filterDataset(
  examples: GeneratedExample[]
): Promise<{
  passed: GeneratedExample[];
  rejected: GeneratedExample[];
  report: { [key: string]: number };
}> {
  const filters = createQualityFilters();
  const passed: GeneratedExample[] = [];
  const rejected: GeneratedExample[] = [];
  const report: { [key: string]: number } = {};

  filters.forEach(f => {
    report[f.name] = 0;
  });

  examples.forEach(ex => {
    let shouldKeep = true;

    for (const filter of filters) {
      const result = filter.evaluate(ex);
      if (!result.pass) {
        shouldKeep = false;
        report[filter.name]++;
        break;
      }
    }

    if (shouldKeep) {
      passed.push(ex);
    } else {
      rejected.push(ex);
    }
  });

  return { passed, rejected, report };
}

Diversity Metrics

interface DiversityMetrics {
  tokenCoverage: number; // % of vocabulary used
  lexicalDiversity: number; // Type/token ratio
  instructionVariety: number; // % unique instruction templates
  responseVariety: number; // % unique response patterns
}

function calculateDiversityMetrics(
  examples: GeneratedExample[]
): DiversityMetrics {
  // Collect all tokens
  const allTokens: Set<string> = new Set();
  const instructionPatterns: Set<string> = new Set();
  const responsePatterns: Set<string> = new Set();
  let totalTokens = 0;

  examples.forEach(ex => {
    const instrTokens = ex.instruction.toLowerCase().split(/\s+/);
    const respTokens = ex.output.toLowerCase().split(/\s+/);

    instrTokens.forEach(t => {
      allTokens.add(t);
      totalTokens++;
    });
    respTokens.forEach(t => {
      allTokens.add(t);
      totalTokens++;
    });

    // Extract pattern (first 5 words)
    const instrPattern = instrTokens.slice(0, 5).join(' ');
    const respPattern = respTokens.slice(0, 5).join(' ');

    instructionPatterns.add(instrPattern);
    responsePatterns.add(respPattern);
  });

  return {
    tokenCoverage: (allTokens.size / 50000) * 100, // Assume 50k vocab
    lexicalDiversity: allTokens.size / totalTokens,
    instructionVariety: (instructionPatterns.size / examples.length) * 100,
    responseVariety: (responsePatterns.size / examples.length) * 100
  };
}

// Target metrics:
// - Token coverage: &gt;15%
// - Lexical diversity: &gt;0.5 (high type/token ratio)
// - Instruction variety: &gt;60%
// - Response variety: &gt;70%

Synthetic Data for Classification Tasks

interface ClassificationDataset {
  classes: string[];
  examples: Array<{
    text: string;
    label: string;
    confidence: number;
  }>;
}

async function generateClassificationData(
  classes: string[],
  examplesPerClass: number
): Promise<ClassificationDataset> {
  const examples: Array<{
    text: string;
    label: string;
    confidence: number;
  }> = [];

  for (const label of classes) {
    const prompt = `Generate ${examplesPerClass} realistic examples of text that belong to the "${label}" category.
Make them diverse in length, style, and content.
Output as JSON array with field "text" only.`;

    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [{ role: 'user', content: prompt }],
        temperature: 0.9
      })
    });

    const data = await response.json();
    const content = data.choices[0].message.content;
    const jsonMatch = content.match(/\[[\s\S]*\]/);

    if (jsonMatch) {
      const texts = JSON.parse(jsonMatch[0]);
      texts.forEach((item: any) => {
        examples.push({
          text: item.text,
          label,
          confidence: 0.95 // Synthetic data gets high confidence initially
        });
      });
    }
  }

  return {
    classes,
    examples
  };
}

Data Flywheel: Product → Data → Model → Product

interface DataFlywheel {
  phase: 'bootstrap' | 'iterate' | 'scale';
  sourceData: 'synthetic' | 'real' | 'mixed';
  modelPerformance: number;
  productUse: number; // Percentage of production using this model
}

async function implementDataFlywheel(): Promise<void> {
  // Phase 1: Bootstrap with synthetic data
  console.log('Phase 1: Generate 1000 synthetic examples with GPT-4');

  // Phase 2: Deploy and collect user feedback
  console.log(
    'Phase 2: Deploy model v1, collect corrections on 10% errors'
  );

  // Phase 3: Use corrections to improve synthetic generator
  console.log(
    'Phase 3: Fine-tune prompt using real user corrections'
  );

  // Phase 4: Generate higher-quality synthetic data
  console.log('Phase 4: Generate v2 dataset (2000 examples) with improved prompt');

  // Phase 5: Train improved model
  console.log('Phase 5: Train model v2 on mixed real + synthetic data');

  // Phase 6: Scale
  console.log('Phase 6: Gradual rollout: 5% &rarr; 25% &rarr; 100% traffic');

  // Each iteration:
  // Errors from production → corrections → better synthetic generator
  // Synthetic data quality improves as product matures
}

// Real example: Anthropic''s Constitutional AI
// Started with synthetic examples, iterated on human feedback
// Later generations of synthetic data much higher quality
interface LegalChecklistItem {
  category: string;
  requirement: string;
  status: 'complete' | 'pending' | 'inapplicable';
}

function generateLegalChecklist(): LegalChecklistItem[] {
  return [
    {
      category: 'Data Rights',
      requirement:
        'Base model training does not violate copyright (use commercially-licensed models)',
      status: 'complete'
    },
    {
      category: 'Bias',
      requirement:
        'Synthetic data does not amplify harmful stereotypes or biases',
      status: 'pending'
    },
    {
      category: 'Attribution',
      requirement:
        'Document synthetic data sources and generation methodology',
      status: 'complete'
    },
    {
      category: 'PII',
      requirement:
        'Generated examples do not contain real personally identifiable information',
      status: 'complete'
    },
    {
      category: 'GDPR/Privacy',
      requirement:
        'If deployed in EU: ensure synthetic data doesn''t recreate personal data',
      status: 'pending'
    },
    {
      category: 'Disclosure',
      requirement:
        'If publishing results: disclose percentage of synthetic vs real data',
      status: 'pending'
    }
  ];
}

// Best practices:
// - Document all synthetic data sources
// - Audit for bias before production
// - Be transparent about synthetic data usage
// - Keep real data for validation and grounding

Checklist

  • Identified domain-specific requirements for synthetic data
  • Generated 1000+ synthetic examples with GPT-4 (temperature=0.9-1.0)
  • Applied Evol-Instruct to create difficulty gradients
  • Used self-play to generate diverse outputs per prompt
  • Filtered dataset with length, deduplication, and perplexity checks
  • Verified diversity metrics (lexical diversity >0.5, variety >60%)
  • Mixed synthetic with 10-20% real examples
  • Tested on benchmark tasks vs baseline
  • Set up data flywheel: production errors → synthetic improvement
  • Documented synthetic data sources and generation methodology
  • Obtained legal review for bias and copyright concerns

Conclusion

Synthetic data generation with GPT-4 accelerates dataset creation and enables data flywheels where product errors improve future training data. Combine evolution techniques, self-play, and rigorous quality filtering to reach 70-80% of hand-annotated quality at fraction of cost.