RLHF and DPO in Practice — Aligning Open-Source LLMs With Preference Data

Introduction

Unaligned LLMs can be unsafe, unhelpful, or misaligned with user preferences. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) fine-tune models to follow instructions and match human values. This guide covers practical implementation of both approaches.

RLHF Pipeline Overview
Stage 2: Reward Model Training
Stage 3: Policy Optimization with PPO
DPO: Direct Preference Optimization
Preference Data Collection
ORPO: Simpler Alternative
Evaluation: Win Rate and Alignment Metrics
Preventing Catastrophic Forgetting
Checklist
Conclusion

RLHF Pipeline Overview

RLHF requires three stages: supervised fine-tuning, reward modeling, and policy optimization.

Stage 1: Supervised Fine-Tuning (SFT) trains the model on high-quality instruction-response pairs:

interface TrainingExample {
  instruction: string;
  response: string;
}

interface SFTTrainer {
  model: LLMModel;
  trainingData: TrainingExample[];
  epochs: number;
  learningRate: number;
}

async function trainSupervisedFinetuning(
  baseModel: LLMModel,
  examples: TrainingExample[]
): Promise&lt;LLMModel&gt; {
  const trainer = {
    model: baseModel.clone(),
    trainingData: examples,
    epochs: 3,
    learningRate: 2e-5
  };

  for (let epoch = 0; epoch &lt; trainer.epochs; epoch++) {
    let totalLoss = 0;
    for (const example of trainer.trainingData) {
      const prompt = `${example.instruction}\n\n`;
      const loss = await trainer.model.computeLoss(
        prompt,
        example.response
      );
      totalLoss += loss;
      await trainer.model.backpropagate(loss, trainer.learningRate);
    }
    console.log(`Epoch ${epoch + 1} loss: ${totalLoss}`);
  }

  return trainer.model;
}

Stage 2: Reward Model Training

The reward model learns human preferences. Trained on pairs of (prompt, chosen_response, rejected_response):

interface PreferencePair {
  prompt: string;
  chosenResponse: string;
  rejectedResponse: string;
}

class RewardModel {
  private model: LLMModel;

  async train(preferences: PreferencePair[]): Promise&lt;void&gt; {
    for (const pair of preferences) {
      const chosenScore = await this.scoreResponse(pair.prompt, pair.chosenResponse);
      const rejectedScore = await this.scoreResponse(pair.prompt, pair.rejectedResponse);

      // Ranking loss: chosen should score higher
      const rankingLoss = Math.max(0, 1.0 - (chosenScore - rejectedScore));
      await this.model.backward(rankingLoss);
    }
  }

  async scoreResponse(prompt: string, response: string): Promise&lt;number&gt; {
    const combinedText = `${prompt}\n${response}`;
    const embedding = await this.model.getEmbedding(combinedText);
    // Score is a scalar value
    return Math.tanh(embedding[0]);
  }

  async getRewardScore(prompt: string, response: string): Promise&lt;number&gt; {
    return this.scoreResponse(prompt, response);
  }
}

async function trainRewardModel(
  baseModel: LLMModel,
  preferences: PreferencePair[]
): Promise&lt;RewardModel&gt; {
  const rewardModel = new RewardModel();
  rewardModel['model'] = baseModel.clone();

  await rewardModel.train(preferences);
  return rewardModel;
}

Stage 3: Policy Optimization with PPO

PPO (Proximal Policy Optimization) fine-tunes the SFT model using reward signals from the reward model:

interface PPOConfig {
  epochs: number;
  batchSize: number;
  learningRate: number;
  clipRange: number; // e.g., 0.2
  valueCoef: number; // weight of value loss
  entropyCoef: number; // weight of entropy bonus
}

class PPOTrainer {
  private policy: LLMModel;
  private rewardModel: RewardModel;
  private referenceModel: LLMModel;
  private config: PPOConfig;

  constructor(
    policy: LLMModel,
    rewardModel: RewardModel,
    config: PPOConfig
  ) {
    this.policy = policy;
    this.rewardModel = rewardModel;
    this.referenceModel = policy.clone(); // Frozen copy for KL divergence
    this.config = config;
  }

  async train(prompts: string[]): Promise&lt;void&gt; {
    for (let epoch = 0; epoch &lt; this.config.epochs; epoch++) {
      let totalReward = 0;

      for (const prompt of prompts) {
        // Generate response from policy
        const response = await this.policy.generate(prompt);

        // Get reward score
        const reward = await this.rewardModel.getRewardScore(prompt, response);

        // Compute policy loss with KL divergence penalty
        const logprobPolicy = await this.policy.computeLogProb(prompt, response);
        const logprobReference = await this.referenceModel.computeLogProb(prompt, response);

        const klDivergence = logprobReference - logprobPolicy;
        const adjustedReward = reward - 0.02 * klDivergence; // KL penalty coefficient

        const policyLoss = -adjustedReward * logprobPolicy;

        // Update policy
        await this.policy.backward(policyLoss, this.config.learningRate);

        totalReward += reward;
      }

      console.log(`Epoch ${epoch + 1} avg reward: ${totalReward / prompts.length}`);
    }
  }

  async computeLogProb(prompt: string, response: string): Promise&lt;number&gt; {
    // Log probability of generating response given prompt
    return 0.5;
  }
}

DPO: Direct Preference Optimization

DPO skips reward model training. It directly optimizes on preference pairs:

class DPOTrainer {
  private model: LLMModel;
  private betaCoef: number = 0.5; // Scaling coefficient

  async train(preferences: PreferencePair[]): Promise&lt;void&gt; {
    for (const pair of preferences) {
      const logprobChosen = await this.model.computeLogProb(
        pair.prompt,
        pair.chosenResponse
      );

      const logprobRejected = await this.model.computeLogProb(
        pair.prompt,
        pair.rejectedResponse
      );

      // DPO loss: encourage higher likelihood for chosen, lower for rejected
      const logRatio = logprobChosen - logprobRejected;
      const loss = -Math.log(
        1.0 / (1.0 + Math.exp(-this.betaCoef * logRatio))
      );

      await this.model.backward(loss);
    }
  }
}

async function trainWithDPO(
  baseModel: LLMModel,
  preferences: PreferencePair[]
): Promise&lt;LLMModel&gt; {
  const trainer = new DPOTrainer();
  trainer['model'] = baseModel.clone();

  await trainer.train(preferences);
  return trainer['model'];
}

Preference Data Collection

Quality preference data is critical. Design sampling carefully:

interface PreferenceCollectionConfig {
  samplingStrategy: 'stratified' | 'random' | 'hard_negatives';
  labelsPerExample: number;
  disagreementThreshold: number; // Flag if raters disagree &gt; this
}

class PreferenceDataCollector {
  async collectPreferences(
    prompts: string[],
    candidateGenerators: LLMModel[],
    config: PreferenceCollectionConfig
  ): Promise&lt;PreferencePair[]&gt; {
    const preferences: PreferencePair[] = [];

    for (const prompt of prompts) {
      // Generate candidates
      const responses = await Promise.all(
        candidateGenerators.map((gen) =&gt; gen.generate(prompt))
      );

      // Sample pairs for labeling
      const pairs = this.samplePairs(prompt, responses, config);

      for (const pair of pairs) {
        // Collect labels from multiple raters
        const labels = await this.collectLabels(
          prompt,
          pair.response1,
          pair.response2,
          config.labelsPerExample
        );

        // Determine preference (majority vote)
        const chosenIdx = this.computeMajorityPreference(labels);
        const rejectedIdx = 1 - chosenIdx;

        preferences.push({
          prompt,
          chosenResponse: [pair.response1, pair.response2][chosenIdx],
          rejectedResponse: [pair.response1, pair.response2][rejectedIdx]
        });
      }
    }

    return preferences;
  }

  private samplePairs(
    prompt: string,
    responses: string[],
    config: PreferenceCollectionConfig
  ): Array&lt;{ response1: string; response2: string }&gt; {
    const pairs: Array&lt;{ response1: string; response2: string }&gt; = [];

    if (config.samplingStrategy === 'random') {
      for (let i = 0; i &lt; responses.length - 1; i++) {
        pairs.push({
          response1: responses[i],
          response2: responses[i + 1]
        });
      }
    } else if (config.samplingStrategy === 'hard_negatives') {
      // Pair good responses with near-miss responses
      const quality = responses.map((r) =&gt; this.scoreQuality(r));
      const sorted = responses
        .map((r, i) =&gt; ({ response: r, quality: quality[i] }))
        .sort((a, b) =&gt; b.quality - a.quality);

      // Pair top response with mid-range responses
      for (let i = 1; i &lt; sorted.length &amp;&amp; i &lt; 5; i++) {
        pairs.push({
          response1: sorted[0].response,
          response2: sorted[i].response
        });
      }
    }

    return pairs;
  }

  private async collectLabels(
    prompt: string,
    response1: string,
    response2: string,
    count: number
  ): Promise&lt;number[]&gt; {
    // In production: send to labeling service
    // Returns array of 0 or 1 indicating preference
    return [1, 1, 1]; // Placeholder: all prefer response2
  }

  private computeMajorityPreference(labels: number[]): number {
    const sum = labels.reduce((a, b) =&gt; a + b, 0);
    return sum &gt; labels.length / 2 ? 1 : 0;
  }

  private scoreQuality(response: string): number {
    // Simple heuristic: prefer longer, more detailed responses
    return response.split(' ').length;
  }
}

ORPO: Simpler Alternative

ORPO (Odds Ratio Preference Optimization) combines SFT and preference optimization in one step:

class ORPOTrainer {
  private model: LLMModel;
  private lambda: number = 10; // Weight of preference loss vs SFT loss

  async train(
    supervisedExamples: TrainingExample[],
    preferenceExamples: PreferencePair[]
  ): Promise&lt;void&gt; {
    // SFT loss
    for (const example of supervisedExamples) {
      const sftLoss = await this.model.computeLoss(
        example.instruction,
        example.response
      );
      await this.model.backward(sftLoss);
    }

    // Preference loss
    for (const pref of preferenceExamples) {
      const logprobChosen = await this.model.computeLogProb(
        pref.prompt,
        pref.chosenResponse
      );

      const logprobRejected = await this.model.computeLogProb(
        pref.prompt,
        pref.rejectedResponse
      );

      // Odds ratio loss
      const oddsRatioLoss =
        -Math.log(Math.sigmoid(this.lambda * (logprobChosen - logprobRejected)));
      await this.model.backward(oddsRatioLoss);
    }
  }
}

function sigmoid(x: number): number {
  return 1 / (1 + Math.exp(-x));
}

Evaluation: Win Rate and Alignment Metrics

interface AlignmentEvaluation {
  winRate: number; // % of comparisons where aligned model wins
  instructionFollowing: number; // Does model follow explicit constraints?
  harmlessness: number; // Safety metrics
  helpfulness: number; // Usefulness to users
}

async function evaluateAlignment(
  alignedModel: LLMModel,
  baselineModel: LLMModel,
  testSet: { prompt: string; expectedBehavior: string }[]
): Promise&lt;AlignmentEvaluation&gt; {
  let wins = 0;
  let instructionFollowingScore = 0;
  let harmlessScore = 0;
  let helpfulnessScore = 0;

  for (const test of testSet) {
    const alignedResponse = await alignedModel.generate(test.prompt);
    const baselineResponse = await baselineModel.generate(test.prompt);

    // Pairwise comparison
    const alignedWins = await this.preferLLM(
      test.prompt,
      alignedResponse,
      baselineResponse
    );
    if (alignedWins) wins++;

    // Instruction following
    const followsInstructions = this.checkInstructionFollowing(
      test.prompt,
      alignedResponse
    );
    instructionFollowingScore += followsInstructions ? 1 : 0;

    // Harmlessness
    const harmless = !(await this.isHarmful(alignedResponse));
    harmlessScore += harmless ? 1 : 0;

    // Helpfulness
    const helpful = await this.isHelpful(alignedResponse);
    helpfulnessScore += helpful ? 1 : 0;
  }

  const n = testSet.length;
  return {
    winRate: wins / n,
    instructionFollowing: instructionFollowingScore / n,
    harmlessness: harmlessScore / n,
    helpfulness: helpfulnessScore / n
  };
}

Preventing Catastrophic Forgetting

Alignment training can degrade base model capabilities:

interface MixedTrainingConfig {
  sftDataRatio: number; // 0.2 = 20% SFT, 80% preference
  mixingStrategy: 'interleaved' | 'sequential';
  evaluationFrequency: number; // epochs
}

async function trainWithForgettingPrevention(
  baseModel: LLMModel,
  sftExamples: TrainingExample[],
  preferences: PreferencePair[],
  config: MixedTrainingConfig
): Promise&lt;LLMModel&gt; {
  const model = baseModel.clone();

  if (config.mixingStrategy === 'interleaved') {
    const sftBatchSize = Math.floor(sftExamples.length * config.sftDataRatio);
    const prefBatchSize = Math.floor(preferences.length * (1 - config.sftDataRatio));

    let sftIdx = 0;
    let prefIdx = 0;

    for (let i = 0; i &lt; Math.max(sftBatchSize, prefBatchSize); i++) {
      // Train on SFT example
      if (sftIdx &lt; sftExamples.length) {
        await model.trainStep(sftExamples[sftIdx++]);
      }

      // Train on preference pair
      if (prefIdx &lt; preferences.length) {
        await model.trainStep(preferences[prefIdx++]);
      }
    }
  }

  return model;
}

Checklist

Conclusion

Alignment is the bridge between capable models and production-safe systems. RLHF provides fine-grained control through reward models but adds complexity. DPO skips the reward model, making it simpler and faster. Both require quality preference data and careful evaluation. Mixing SFT with preference training prevents capability degradation. Robust evaluation across alignment dimensions — win rate, instruction following, harmlessness, helpfulness — ensures safe deployment.