Published on

Prompt Optimization — Automatic and Manual Techniques to Improve LLM Performance

Authors
  • Name
    Twitter

Introduction

Prompt quality determines model output quality more than model size in many scenarios. A weak prompt to GPT-4 often loses to a well-crafted prompt to a 7B model. This guide covers both manual optimization techniques and automated approaches to systematically improve prompt performance.

Structured Prompt Templates

Unstructured prompts lack clarity. Use templates enforcing role, task, format, and examples:

interface StructuredPrompt {
  role: string;
  task: string;
  format: string;
  constraints: string[];
  examples: Example[];
  input: string;
}

class PromptBuilder {
  private role: string = '';
  private task: string = '';
  private format: string = '';
  private constraints: string[] = [];
  private examples: Array<{ input: string; output: string }> = [];

  setRole(role: string): this {
    this.role = role;
    return this;
  }

  setTask(task: string): this {
    this.task = task;
    return this;
  }

  setFormat(format: string): this {
    this.format = format;
    return this;
  }

  addConstraint(constraint: string): this {
    this.constraints.push(constraint);
    return this;
  }

  addExample(input: string, output: string): this {
    this.examples.push({ input, output });
    return this;
  }

  build(): string {
    let prompt = '';
    if (this.role) {
      prompt += `You are a ${this.role}.\n\n`;
    }

    prompt += `Task: ${this.task}\n\n`;

    if (this.format) {
      prompt += `Output format: ${this.format}\n\n`;
    }

    if (this.constraints.length &gt; 0) {
      prompt += `Constraints:\n`;
      this.constraints.forEach((c, i) => {
        prompt += `${i + 1}. ${c}\n`;
      });
      prompt += '\n';
    }

    if (this.examples.length &gt; 0) {
      prompt += `Examples:\n`;
      this.examples.forEach((ex, i) => {
        prompt += `Input: ${ex.input}\nOutput: ${ex.output}\n\n`;
      });
    }

    return prompt;
  }
}

const sqlGenPrompt = new PromptBuilder()
  .setRole('Expert SQL database administrator')
  .setTask('Convert natural language queries to PostgreSQL')
  .setFormat('Return only the SQL query, no explanation')
  .addConstraint('Use parameterized queries to prevent injection')
  .addConstraint('Prefer explicit joins over subqueries')
  .addExample(
    'Show all customers from France who bought in Q1',
    'SELECT c.* FROM customers c WHERE c.country = $1 AND EXTRACT(QUARTER FROM c.purchase_date) = 1'
  )
  .build();

Chain-of-Thought Elicitation

Models perform better when reasoning through problems step-by-step. Simply adding "Let's think step by step" increases accuracy on math and logic problems by 10-30%.

interface ChainOfThoughtPrompt {
  problem: string;
  stepInstructions: string;
  outputFormat: string;
}

function buildChainOfThoughtPrompt(problem: string): string {
  return `Problem: ${problem}

Let&apos;s break this down step by step:
1. Identify what we know and what we need to find
2. Plan the approach
3. Execute the plan
4. Verify the answer

Reasoning:`;
}

// For complex reasoning, structure explicitly:
function buildExplicitReasoningPrompt(query: string): string {
  return `Answer the following query by working through it systematically.

Query: ${query}

Work through the following steps:

Step 1 - Clarify the question:
[What exactly are we being asked?]

Step 2 - Identify relevant information:
[What information is needed to answer?]

Step 3 - Develop reasoning:
[Walk through the logic]

Step 4 - Conclusion:
[Final answer]`;
}

Few-Shot Example Selection

More examples aren't always better. Quality and diversity matter more than quantity.

Poor selection: pick first 3 examples from training data.

Better selection: choose examples that are:

  • Representative of input distribution
  • Diverse (cover different sub-problems, edge cases, answer types)
  • Demonstrative (clearly show how to solve, not just cherry-picked correct answers)
  • Non-contaminated (don't include the test example)
interface FewShotSelector {
  examples: TrainingExample[];
  testInput: string;
}

class DiverseFewShotSelector {
  private examples: TrainingExample[] = [];

  loadExamples(examples: TrainingExample[]): void {
    this.examples = examples;
  }

  selectDiverse(testInput: string, k: number = 3): TrainingExample[] {
    const embeddings = this.embedExamples();
    const testEmbedding = this.embed(testInput);

    // Start with most similar to test input
    const candidates = this.examples
      .map((ex, idx) => ({
        example: ex,
        similarity: cosineSimilarity(embeddings[idx], testEmbedding)
      }))
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, 20);

    // Greedily select diverse subset
    const selected: TrainingExample[] = [];
    const selectedEmbeddings: number[][] = [];

    while (selected.length &lt; k &amp;&amp; candidates.length &gt; 0) {
      const candidate = candidates.shift()!;

      // Check diversity from already selected
      const tooSimilar = selectedEmbeddings.some(
        (emb) =&gt; cosineSimilarity(emb, embeddings[this.examples.indexOf(candidate.example)]) &gt; 0.85
      );

      if (!tooSimilar) {
        selected.push(candidate.example);
        selectedEmbeddings.push(embeddings[this.examples.indexOf(candidate.example)]);
      }
    }

    return selected;
  }

  private embedExamples(): number[][] {
    return this.examples.map((ex) =&gt; this.embed(ex.input + ' ' + ex.output));
  }

  private embed(text: string): number[] {
    // Call embedding API
    return [];
  }
}

Prompt Compression

Long prompts cost more and slow down inference. Compress without losing meaning:

interface CompressionConfig {
  maxTokens: number;
  preserveExamples: boolean;
  strategy: 'aggressive' | 'conservative';
}

async function compressPrompt(
  prompt: string,
  config: CompressionConfig
): Promise<string> {
  // Strategy 1: Remove redundant words
  let compressed = prompt
    .split(/\s+/)
    .filter((word, idx, arr) =&gt; {
      return idx === 0 || arr[idx - 1] !== word;
    })
    .join(' ');

  // Strategy 2: Summarize instructions
  if (config.strategy === 'aggressive') {
    const instructions = extractInstructions(prompt);
    const summary = await summarizeInstructions(instructions);
    compressed = compressed.replace(instructions.join('\n'), summary);
  }

  // Strategy 3: Use token limits
  const tokens = compressed.split(/\s+/);
  if (tokens.length &gt; config.maxTokens) {
    compressed = tokens.slice(0, config.maxTokens).join(' ') + '...';
  }

  return compressed;
}

DSPy for Automatic Prompt Optimization

DSPy (Demonstrate-Search-Predict) automates prompt engineering. Rather than manually tweaking, DSPy learns optimal prompts and few-shot examples.

interface DSPyOptimizer {
  model: LLMModel;
  trainSet: Example[];
  valSet: Example[];
  metric: (pred: string, gold: string) =&gt; number;
}

async function optimizeWithDSPy(
  task: string,
  examples: Example[],
  metric: (output: string, expected: string) =&gt; number
): Promise<OptimizedPrompt&gt; {
  // Framework would handle:
  // 1. Generate candidate prompts
  // 2. Evaluate on validation set
  // 3. Keep best performing variants
  // 4. Iterate with local search

  interface PromptVariant {
    prompt: string;
    score: number;
    examples: Example[];
  }

  const variants: PromptVariant[] = [];

  // Generate variations
  const basePrompt = generateInitialPrompt(task);
  variants.push(await evaluateVariant(basePrompt, examples, metric));

  // Add chain-of-thought variant
  variants.push(
    await evaluateVariant(basePrompt + '\nLet&apos;s think step by step.', examples, metric)
  );

  // Try different example selections
  for (let i = 0; i &lt; 5; i++) {
    const selectedExamples = selectRandomExamples(examples, 3);
    const withExamples = basePrompt + '\n\nExamples:\n' +
      selectedExamples.map((ex) =&gt; `Q: ${ex.input}\nA: ${ex.output}`).join('\n\n');
    variants.push(await evaluateVariant(withExamples, examples, metric));
  }

  // Return best variant
  return variants.reduce((best, v) =&gt; (v.score &gt; best.score ? v : best));
}

Meta-Prompting

Ask the model to generate its own instructions. Meta-prompts often outperform manual prompts.

async function generatePromptWithMetaPrompting(
  task: string,
  examples: Example[]
): Promise<string> {
  const metaPrompt = `You are an expert prompt engineer. Your task is to create an excellent prompt for the following task:

Task: ${task}

Examples of expected behavior:
${examples.map((ex) =&gt; `Input: ${ex.input}\nOutput: ${ex.output}`).join('\n\n')}

Generate a detailed, well-structured prompt that would help an LLM reliably solve this task. Include:
1. A clear role/persona
2. Explicit instructions
3. Output format specifications
4. Any relevant constraints
5. Examples if helpful`;

  const generatedPrompt = await llm.generate(metaPrompt);
  return generatedPrompt;
}

Evaluating Prompt Changes Systematically

Never deploy prompt changes without measurement:

interface PromptEvaluation {
  oldPrompt: string;
  newPrompt: string;
  testSet: Example[];
  results: EvalResult[];
}

async function comparePrompts(
  oldPrompt: string,
  newPrompt: string,
  testSet: Example[],
  metric: (output: string, expected: string) =&gt; number
): Promise<PromptEvaluation&gt; {
  const oldResults = await evaluatePrompt(oldPrompt, testSet, metric);
  const newResults = await evaluatePrompt(newPrompt, testSet, metric);

  const oldScore = oldResults.reduce((a, b) =&gt; a + b.score, 0) / oldResults.length;
  const newScore = newResults.reduce((a, b) =&gt; a + b.score, 0) / newResults.length;

  const improvement = ((newScore - oldScore) / oldScore) * 100;
  console.log(`Score change: ${improvement.toFixed(2)}%`);

  if (improvement &gt; 2) {
    console.log('Statistically significant improvement detected.');
  }

  return {
    oldPrompt,
    newPrompt,
    testSet,
    results: newResults
  };
}

Checklist

  • Use structured prompt templates (role, task, format, constraints)
  • Add chain-of-thought instructions for reasoning tasks
  • Select diverse few-shot examples rather than similar ones
  • Measure compression trade-offs before optimizing for cost
  • Explore DSPy for automatic prompt learning
  • Try meta-prompting for complex tasks
  • Augment prompts with retrieved context when available
  • A/B test all prompt changes on held-out test set
  • Track prompt versions alongside model versions
  • Document prompt decisions and findings

Conclusion

Prompt optimization is an underrated lever for improving model performance. Structured templates provide consistency. Chain-of-thought improves reasoning. Few-shot example selection beats random selection. Automatic techniques like DSPy and meta-prompting often exceed manual engineering. Systematic evaluation prevents deploying changes that feel good but don't deliver results. Treat prompts as code: version them, test them, and document decisions.