- Published on
Synthetic Data Generation With LLMs — Building Training Datasets at Scale
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Generating synthetic training data with LLMs accelerates model development, covers edge cases, and creates virtuous cycles where better models generate better data. This guide covers practical techniques and quality filters for production datasets.
- When Synthetic Data Works Best
- GPT-4 for Instruction Generation
- Evol-Instruct Technique
- Self-Play for Diverse Examples
- Quality Filtering Strategy
- Diversity Metrics
- Synthetic Data for Classification Tasks
- Data Flywheel: Product → Data → Model → Product
- Legal and Ethical Considerations
- Checklist
- Conclusion
When Synthetic Data Works Best
Synthetic data generation is most effective for:
- Bootstrapping new domains with limited real examples
- Covering rare edge cases and corner scenarios
- Augmenting small datasets (100-1000 examples) to 10k+
- Creating specific formatting or style variations
- Reducing costs vs manual data collection
Real data should still be included for grounding, but synthetic data can reach 70-80% of final dataset volume safely.
GPT-4 for Instruction Generation
Use GPT-4 as a generator for high-diversity synthetic examples:
interface SyntheticDataRequest {
domain: string;
topic: string;
count: number;
constraints: string[];
}
interface GeneratedExample {
instruction: string;
input: string;
output: string;
metadata: {
generationModel: string;
domain: string;
difficulty: 'easy' | 'medium' | 'hard';
};
}
async function generateSyntheticExamples(
request: SyntheticDataRequest
): Promise<GeneratedExample[]> {
const prompt = `You are an expert data annotator for ${request.domain} tasks.
Generate ${request.count} diverse, realistic examples for the following topic: ${request.topic}
Constraints:
${request.constraints.map((c, i) => `${i + 1}. ${c}`).join('\n')}
Output format: JSON array of objects with fields: instruction, input, output.
Ensure high diversity in phrasing and content.
Make outputs realistic and representative of real use cases.`;
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
temperature: 1.0, // High temperature for diversity
top_p: 0.95
})
});
const data = await response.json();
const content = data.choices[0].message.content;
// Parse JSON from response
const jsonMatch = content.match(/\[[\s\S]*\]/);
if (!jsonMatch) throw new Error('Failed to parse generated examples');
const baseExamples = JSON.parse(jsonMatch[0]);
return baseExamples.map((ex: any, i: number) => ({
instruction: ex.instruction,
input: ex.input,
output: ex.output,
metadata: {
generationModel: 'gpt-4',
domain: request.domain,
difficulty: getDifficulty(ex.instruction, i)
}
}));
}
function getDifficulty(
instruction: string,
index: number
): 'easy' | 'medium' | 'hard' {
if (instruction.length < 50) return 'easy';
if (instruction.length > 200) return 'hard';
return 'medium';
}
// Example usage
const examples = await generateSyntheticExamples({
domain: 'customer support',
topic: 'billing inquiries',
count: 50,
constraints: [
'Mix questions about invoices, payments, and subscriptions',
'Include common typos and informal phrasing',
'Generate diverse customer scenarios'
]
});
Evol-Instruct Technique
Iteratively evolve instructions to increase complexity:
interface EvolutionStep {
originalInstruction: string;
evolvedInstructions: string[];
depth: number;
}
async function evolveInstruction(
instruction: string,
depth: number = 3
): Promise<EvolutionStep> {
const evolvedInstructions: string[] = [instruction];
for (let i = 0; i < depth; i++) {
const currentInstruction = evolvedInstructions[evolvedInstructions.length - 1];
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [
{
role: 'user',
content: `Make this instruction more complex and diverse while maintaining the core task:
Original: ${currentInstruction}
Provide ONE evolved instruction that adds constraints, requires deeper reasoning, or increases specificity.
Output only the new instruction, nothing else.`
}
],
temperature: 0.8,
max_tokens: 200
})
});
const data = await response.json();
const evolved = data.choices[0].message.content.trim();
evolvedInstructions.push(evolved);
}
return {
originalInstruction: instruction,
evolvedInstructions,
depth
};
}
// Creates instruction chains:
// Simple → Medium → Hard → Expert
// Builds diverse difficulty levels automatically
Self-Play for Diverse Examples
Generate diverse outputs for same input by having model critique itself:
interface SelfPlayRound {
input: string;
responses: Array<{
text: string;
score: number;
feedback: string;
}>;
selected: string;
}
async function selfPlayGeneration(
input: string,
rounds: number = 3
): Promise<SelfPlayRound> {
const responses: Array<{
text: string;
score: number;
feedback: string;
}> = [];
for (let round = 0; round < rounds; round++) {
// Generate diverse response
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [
{ role: 'user', content: input }
],
temperature: 0.8 + round * 0.1, // Increase randomness each round
max_tokens: 500
})
});
const data = await response.json();
const text = data.choices[0].message.content;
// Score response quality
const scoreResponse = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [
{
role: 'user',
content: `Rate this response on a scale of 1-10:\n${text}\n\nProvide score and brief feedback.`
}
],
temperature: 0
})
});
const scoreData = await scoreResponse.json();
const scoreText = scoreData.choices[0].message.content;
const scoreMatch = scoreText.match(/(\d+)/);
const score = scoreMatch ? parseInt(scoreMatch[1]) : 5;
responses.push({
text,
score,
feedback: scoreText
});
}
// Select best response
const selected = responses.reduce((best, current) =>
current.score > best.score ? current : best
).text;
return {
input,
responses,
selected
};
}
Quality Filtering Strategy
interface QualityFilter {
name: string;
evaluate: (example: GeneratedExample) => { pass: boolean; reason: string };
}
function createQualityFilters(): QualityFilter[] {
return [
{
name: 'Length Check',
evaluate: (ex: GeneratedExample) => {
const minLen = 10;
const maxLen = 2000;
const outputLen = ex.output.length;
if (outputLen < minLen || outputLen > maxLen) {
return { pass: false, reason: `Output length ${outputLen} outside range` };
}
return { pass: true, reason: 'Length valid' };
}
},
{
name: 'Deduplication',
evaluate: (ex: GeneratedExample) => {
// In production: check against existing dataset
const isDuplicate = false; // Simplified
return {
pass: !isDuplicate,
reason: isDuplicate ? 'Duplicate found' : 'Unique'
};
}
},
{
name: 'Perplexity Filter',
evaluate: (ex: GeneratedExample) => {
// Reject low-perplexity (repetitive) outputs
const entropy = calculateEntropy(ex.output);
const minEntropy = 2.0;
if (entropy < minEntropy) {
return { pass: false, reason: `Low perplexity: ${entropy.toFixed(2)}` };
}
return { pass: true, reason: `Entropy: ${entropy.toFixed(2)}` };
}
},
{
name: 'Semantic Coherence',
evaluate: (ex: GeneratedExample) => {
// Check if output answers instruction
const coherence = true; // Would use embedding similarity
return {
pass: coherence,
reason: coherence ? 'Coherent' : 'Incoherent'
};
}
}
];
}
function calculateEntropy(text: string): number {
const words = text.toLowerCase().split(/\s+/);
const freq: { [key: string]: number } = {};
words.forEach(word => {
freq[word] = (freq[word] || 0) + 1;
});
let entropy = 0;
const total = words.length;
Object.values(freq).forEach(count => {
const p = count / total;
entropy -= p * Math.log2(p);
});
return entropy;
}
async function filterDataset(
examples: GeneratedExample[]
): Promise<{
passed: GeneratedExample[];
rejected: GeneratedExample[];
report: { [key: string]: number };
}> {
const filters = createQualityFilters();
const passed: GeneratedExample[] = [];
const rejected: GeneratedExample[] = [];
const report: { [key: string]: number } = {};
filters.forEach(f => {
report[f.name] = 0;
});
examples.forEach(ex => {
let shouldKeep = true;
for (const filter of filters) {
const result = filter.evaluate(ex);
if (!result.pass) {
shouldKeep = false;
report[filter.name]++;
break;
}
}
if (shouldKeep) {
passed.push(ex);
} else {
rejected.push(ex);
}
});
return { passed, rejected, report };
}
Diversity Metrics
interface DiversityMetrics {
tokenCoverage: number; // % of vocabulary used
lexicalDiversity: number; // Type/token ratio
instructionVariety: number; // % unique instruction templates
responseVariety: number; // % unique response patterns
}
function calculateDiversityMetrics(
examples: GeneratedExample[]
): DiversityMetrics {
// Collect all tokens
const allTokens: Set<string> = new Set();
const instructionPatterns: Set<string> = new Set();
const responsePatterns: Set<string> = new Set();
let totalTokens = 0;
examples.forEach(ex => {
const instrTokens = ex.instruction.toLowerCase().split(/\s+/);
const respTokens = ex.output.toLowerCase().split(/\s+/);
instrTokens.forEach(t => {
allTokens.add(t);
totalTokens++;
});
respTokens.forEach(t => {
allTokens.add(t);
totalTokens++;
});
// Extract pattern (first 5 words)
const instrPattern = instrTokens.slice(0, 5).join(' ');
const respPattern = respTokens.slice(0, 5).join(' ');
instructionPatterns.add(instrPattern);
responsePatterns.add(respPattern);
});
return {
tokenCoverage: (allTokens.size / 50000) * 100, // Assume 50k vocab
lexicalDiversity: allTokens.size / totalTokens,
instructionVariety: (instructionPatterns.size / examples.length) * 100,
responseVariety: (responsePatterns.size / examples.length) * 100
};
}
// Target metrics:
// - Token coverage: >15%
// - Lexical diversity: >0.5 (high type/token ratio)
// - Instruction variety: >60%
// - Response variety: >70%
Synthetic Data for Classification Tasks
interface ClassificationDataset {
classes: string[];
examples: Array<{
text: string;
label: string;
confidence: number;
}>;
}
async function generateClassificationData(
classes: string[],
examplesPerClass: number
): Promise<ClassificationDataset> {
const examples: Array<{
text: string;
label: string;
confidence: number;
}> = [];
for (const label of classes) {
const prompt = `Generate ${examplesPerClass} realistic examples of text that belong to the "${label}" category.
Make them diverse in length, style, and content.
Output as JSON array with field "text" only.`;
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
temperature: 0.9
})
});
const data = await response.json();
const content = data.choices[0].message.content;
const jsonMatch = content.match(/\[[\s\S]*\]/);
if (jsonMatch) {
const texts = JSON.parse(jsonMatch[0]);
texts.forEach((item: any) => {
examples.push({
text: item.text,
label,
confidence: 0.95 // Synthetic data gets high confidence initially
});
});
}
}
return {
classes,
examples
};
}
Data Flywheel: Product → Data → Model → Product
interface DataFlywheel {
phase: 'bootstrap' | 'iterate' | 'scale';
sourceData: 'synthetic' | 'real' | 'mixed';
modelPerformance: number;
productUse: number; // Percentage of production using this model
}
async function implementDataFlywheel(): Promise<void> {
// Phase 1: Bootstrap with synthetic data
console.log('Phase 1: Generate 1000 synthetic examples with GPT-4');
// Phase 2: Deploy and collect user feedback
console.log(
'Phase 2: Deploy model v1, collect corrections on 10% errors'
);
// Phase 3: Use corrections to improve synthetic generator
console.log(
'Phase 3: Fine-tune prompt using real user corrections'
);
// Phase 4: Generate higher-quality synthetic data
console.log('Phase 4: Generate v2 dataset (2000 examples) with improved prompt');
// Phase 5: Train improved model
console.log('Phase 5: Train model v2 on mixed real + synthetic data');
// Phase 6: Scale
console.log('Phase 6: Gradual rollout: 5% → 25% → 100% traffic');
// Each iteration:
// Errors from production → corrections → better synthetic generator
// Synthetic data quality improves as product matures
}
// Real example: Anthropic''s Constitutional AI
// Started with synthetic examples, iterated on human feedback
// Later generations of synthetic data much higher quality
Legal and Ethical Considerations
interface LegalChecklistItem {
category: string;
requirement: string;
status: 'complete' | 'pending' | 'inapplicable';
}
function generateLegalChecklist(): LegalChecklistItem[] {
return [
{
category: 'Data Rights',
requirement:
'Base model training does not violate copyright (use commercially-licensed models)',
status: 'complete'
},
{
category: 'Bias',
requirement:
'Synthetic data does not amplify harmful stereotypes or biases',
status: 'pending'
},
{
category: 'Attribution',
requirement:
'Document synthetic data sources and generation methodology',
status: 'complete'
},
{
category: 'PII',
requirement:
'Generated examples do not contain real personally identifiable information',
status: 'complete'
},
{
category: 'GDPR/Privacy',
requirement:
'If deployed in EU: ensure synthetic data doesn''t recreate personal data',
status: 'pending'
},
{
category: 'Disclosure',
requirement:
'If publishing results: disclose percentage of synthetic vs real data',
status: 'pending'
}
];
}
// Best practices:
// - Document all synthetic data sources
// - Audit for bias before production
// - Be transparent about synthetic data usage
// - Keep real data for validation and grounding
Checklist
- Identified domain-specific requirements for synthetic data
- Generated 1000+ synthetic examples with GPT-4 (temperature=0.9-1.0)
- Applied Evol-Instruct to create difficulty gradients
- Used self-play to generate diverse outputs per prompt
- Filtered dataset with length, deduplication, and perplexity checks
- Verified diversity metrics (lexical diversity >0.5, variety >60%)
- Mixed synthetic with 10-20% real examples
- Tested on benchmark tasks vs baseline
- Set up data flywheel: production errors → synthetic improvement
- Documented synthetic data sources and generation methodology
- Obtained legal review for bias and copyright concerns
Conclusion
Synthetic data generation with GPT-4 accelerates dataset creation and enables data flywheels where product errors improve future training data. Combine evolution techniques, self-play, and rigorous quality filtering to reach 70-80% of hand-annotated quality at fraction of cost.