- Published on
RLHF and DPO in Practice — Aligning Open-Source LLMs With Preference Data
- Authors
- Name
Introduction
Unaligned LLMs can be unsafe, unhelpful, or misaligned with user preferences. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) fine-tune models to follow instructions and match human values. This guide covers practical implementation of both approaches.
- RLHF Pipeline Overview
- Stage 2: Reward Model Training
- Stage 3: Policy Optimization with PPO
- DPO: Direct Preference Optimization
- Preference Data Collection
- ORPO: Simpler Alternative
- Evaluation: Win Rate and Alignment Metrics
- Preventing Catastrophic Forgetting
- Checklist
- Conclusion
RLHF Pipeline Overview
RLHF requires three stages: supervised fine-tuning, reward modeling, and policy optimization.
Stage 1: Supervised Fine-Tuning (SFT) trains the model on high-quality instruction-response pairs:
interface TrainingExample {
instruction: string;
response: string;
}
interface SFTTrainer {
model: LLMModel;
trainingData: TrainingExample[];
epochs: number;
learningRate: number;
}
async function trainSupervisedFinetuning(
baseModel: LLMModel,
examples: TrainingExample[]
): Promise<LLMModel> {
const trainer = {
model: baseModel.clone(),
trainingData: examples,
epochs: 3,
learningRate: 2e-5
};
for (let epoch = 0; epoch < trainer.epochs; epoch++) {
let totalLoss = 0;
for (const example of trainer.trainingData) {
const prompt = `${example.instruction}\n\n`;
const loss = await trainer.model.computeLoss(
prompt,
example.response
);
totalLoss += loss;
await trainer.model.backpropagate(loss, trainer.learningRate);
}
console.log(`Epoch ${epoch + 1} loss: ${totalLoss}`);
}
return trainer.model;
}
Stage 2: Reward Model Training
The reward model learns human preferences. Trained on pairs of (prompt, chosen_response, rejected_response):
interface PreferencePair {
prompt: string;
chosenResponse: string;
rejectedResponse: string;
}
class RewardModel {
private model: LLMModel;
async train(preferences: PreferencePair[]): Promise<void> {
for (const pair of preferences) {
const chosenScore = await this.scoreResponse(pair.prompt, pair.chosenResponse);
const rejectedScore = await this.scoreResponse(pair.prompt, pair.rejectedResponse);
// Ranking loss: chosen should score higher
const rankingLoss = Math.max(0, 1.0 - (chosenScore - rejectedScore));
await this.model.backward(rankingLoss);
}
}
async scoreResponse(prompt: string, response: string): Promise<number> {
const combinedText = `${prompt}\n${response}`;
const embedding = await this.model.getEmbedding(combinedText);
// Score is a scalar value
return Math.tanh(embedding[0]);
}
async getRewardScore(prompt: string, response: string): Promise<number> {
return this.scoreResponse(prompt, response);
}
}
async function trainRewardModel(
baseModel: LLMModel,
preferences: PreferencePair[]
): Promise<RewardModel> {
const rewardModel = new RewardModel();
rewardModel['model'] = baseModel.clone();
await rewardModel.train(preferences);
return rewardModel;
}
Stage 3: Policy Optimization with PPO
PPO (Proximal Policy Optimization) fine-tunes the SFT model using reward signals from the reward model:
interface PPOConfig {
epochs: number;
batchSize: number;
learningRate: number;
clipRange: number; // e.g., 0.2
valueCoef: number; // weight of value loss
entropyCoef: number; // weight of entropy bonus
}
class PPOTrainer {
private policy: LLMModel;
private rewardModel: RewardModel;
private referenceModel: LLMModel;
private config: PPOConfig;
constructor(
policy: LLMModel,
rewardModel: RewardModel,
config: PPOConfig
) {
this.policy = policy;
this.rewardModel = rewardModel;
this.referenceModel = policy.clone(); // Frozen copy for KL divergence
this.config = config;
}
async train(prompts: string[]): Promise<void> {
for (let epoch = 0; epoch < this.config.epochs; epoch++) {
let totalReward = 0;
for (const prompt of prompts) {
// Generate response from policy
const response = await this.policy.generate(prompt);
// Get reward score
const reward = await this.rewardModel.getRewardScore(prompt, response);
// Compute policy loss with KL divergence penalty
const logprobPolicy = await this.policy.computeLogProb(prompt, response);
const logprobReference = await this.referenceModel.computeLogProb(prompt, response);
const klDivergence = logprobReference - logprobPolicy;
const adjustedReward = reward - 0.02 * klDivergence; // KL penalty coefficient
const policyLoss = -adjustedReward * logprobPolicy;
// Update policy
await this.policy.backward(policyLoss, this.config.learningRate);
totalReward += reward;
}
console.log(`Epoch ${epoch + 1} avg reward: ${totalReward / prompts.length}`);
}
}
async computeLogProb(prompt: string, response: string): Promise<number> {
// Log probability of generating response given prompt
return 0.5;
}
}
DPO: Direct Preference Optimization
DPO skips reward model training. It directly optimizes on preference pairs:
class DPOTrainer {
private model: LLMModel;
private betaCoef: number = 0.5; // Scaling coefficient
async train(preferences: PreferencePair[]): Promise<void> {
for (const pair of preferences) {
const logprobChosen = await this.model.computeLogProb(
pair.prompt,
pair.chosenResponse
);
const logprobRejected = await this.model.computeLogProb(
pair.prompt,
pair.rejectedResponse
);
// DPO loss: encourage higher likelihood for chosen, lower for rejected
const logRatio = logprobChosen - logprobRejected;
const loss = -Math.log(
1.0 / (1.0 + Math.exp(-this.betaCoef * logRatio))
);
await this.model.backward(loss);
}
}
}
async function trainWithDPO(
baseModel: LLMModel,
preferences: PreferencePair[]
): Promise<LLMModel> {
const trainer = new DPOTrainer();
trainer['model'] = baseModel.clone();
await trainer.train(preferences);
return trainer['model'];
}
Preference Data Collection
Quality preference data is critical. Design sampling carefully:
interface PreferenceCollectionConfig {
samplingStrategy: 'stratified' | 'random' | 'hard_negatives';
labelsPerExample: number;
disagreementThreshold: number; // Flag if raters disagree > this
}
class PreferenceDataCollector {
async collectPreferences(
prompts: string[],
candidateGenerators: LLMModel[],
config: PreferenceCollectionConfig
): Promise<PreferencePair[]> {
const preferences: PreferencePair[] = [];
for (const prompt of prompts) {
// Generate candidates
const responses = await Promise.all(
candidateGenerators.map((gen) => gen.generate(prompt))
);
// Sample pairs for labeling
const pairs = this.samplePairs(prompt, responses, config);
for (const pair of pairs) {
// Collect labels from multiple raters
const labels = await this.collectLabels(
prompt,
pair.response1,
pair.response2,
config.labelsPerExample
);
// Determine preference (majority vote)
const chosenIdx = this.computeMajorityPreference(labels);
const rejectedIdx = 1 - chosenIdx;
preferences.push({
prompt,
chosenResponse: [pair.response1, pair.response2][chosenIdx],
rejectedResponse: [pair.response1, pair.response2][rejectedIdx]
});
}
}
return preferences;
}
private samplePairs(
prompt: string,
responses: string[],
config: PreferenceCollectionConfig
): Array<{ response1: string; response2: string }> {
const pairs: Array<{ response1: string; response2: string }> = [];
if (config.samplingStrategy === 'random') {
for (let i = 0; i < responses.length - 1; i++) {
pairs.push({
response1: responses[i],
response2: responses[i + 1]
});
}
} else if (config.samplingStrategy === 'hard_negatives') {
// Pair good responses with near-miss responses
const quality = responses.map((r) => this.scoreQuality(r));
const sorted = responses
.map((r, i) => ({ response: r, quality: quality[i] }))
.sort((a, b) => b.quality - a.quality);
// Pair top response with mid-range responses
for (let i = 1; i < sorted.length && i < 5; i++) {
pairs.push({
response1: sorted[0].response,
response2: sorted[i].response
});
}
}
return pairs;
}
private async collectLabels(
prompt: string,
response1: string,
response2: string,
count: number
): Promise<number[]> {
// In production: send to labeling service
// Returns array of 0 or 1 indicating preference
return [1, 1, 1]; // Placeholder: all prefer response2
}
private computeMajorityPreference(labels: number[]): number {
const sum = labels.reduce((a, b) => a + b, 0);
return sum > labels.length / 2 ? 1 : 0;
}
private scoreQuality(response: string): number {
// Simple heuristic: prefer longer, more detailed responses
return response.split(' ').length;
}
}
ORPO: Simpler Alternative
ORPO (Odds Ratio Preference Optimization) combines SFT and preference optimization in one step:
class ORPOTrainer {
private model: LLMModel;
private lambda: number = 10; // Weight of preference loss vs SFT loss
async train(
supervisedExamples: TrainingExample[],
preferenceExamples: PreferencePair[]
): Promise<void> {
// SFT loss
for (const example of supervisedExamples) {
const sftLoss = await this.model.computeLoss(
example.instruction,
example.response
);
await this.model.backward(sftLoss);
}
// Preference loss
for (const pref of preferenceExamples) {
const logprobChosen = await this.model.computeLogProb(
pref.prompt,
pref.chosenResponse
);
const logprobRejected = await this.model.computeLogProb(
pref.prompt,
pref.rejectedResponse
);
// Odds ratio loss
const oddsRatioLoss =
-Math.log(Math.sigmoid(this.lambda * (logprobChosen - logprobRejected)));
await this.model.backward(oddsRatioLoss);
}
}
}
function sigmoid(x: number): number {
return 1 / (1 + Math.exp(-x));
}
Evaluation: Win Rate and Alignment Metrics
interface AlignmentEvaluation {
winRate: number; // % of comparisons where aligned model wins
instructionFollowing: number; // Does model follow explicit constraints?
harmlessness: number; // Safety metrics
helpfulness: number; // Usefulness to users
}
async function evaluateAlignment(
alignedModel: LLMModel,
baselineModel: LLMModel,
testSet: { prompt: string; expectedBehavior: string }[]
): Promise<AlignmentEvaluation> {
let wins = 0;
let instructionFollowingScore = 0;
let harmlessScore = 0;
let helpfulnessScore = 0;
for (const test of testSet) {
const alignedResponse = await alignedModel.generate(test.prompt);
const baselineResponse = await baselineModel.generate(test.prompt);
// Pairwise comparison
const alignedWins = await this.preferLLM(
test.prompt,
alignedResponse,
baselineResponse
);
if (alignedWins) wins++;
// Instruction following
const followsInstructions = this.checkInstructionFollowing(
test.prompt,
alignedResponse
);
instructionFollowingScore += followsInstructions ? 1 : 0;
// Harmlessness
const harmless = !(await this.isHarmful(alignedResponse));
harmlessScore += harmless ? 1 : 0;
// Helpfulness
const helpful = await this.isHelpful(alignedResponse);
helpfulnessScore += helpful ? 1 : 0;
}
const n = testSet.length;
return {
winRate: wins / n,
instructionFollowing: instructionFollowingScore / n,
harmlessness: harmlessScore / n,
helpfulness: helpfulnessScore / n
};
}
Preventing Catastrophic Forgetting
Alignment training can degrade base model capabilities:
interface MixedTrainingConfig {
sftDataRatio: number; // 0.2 = 20% SFT, 80% preference
mixingStrategy: 'interleaved' | 'sequential';
evaluationFrequency: number; // epochs
}
async function trainWithForgettingPrevention(
baseModel: LLMModel,
sftExamples: TrainingExample[],
preferences: PreferencePair[],
config: MixedTrainingConfig
): Promise<LLMModel> {
const model = baseModel.clone();
if (config.mixingStrategy === 'interleaved') {
const sftBatchSize = Math.floor(sftExamples.length * config.sftDataRatio);
const prefBatchSize = Math.floor(preferences.length * (1 - config.sftDataRatio));
let sftIdx = 0;
let prefIdx = 0;
for (let i = 0; i < Math.max(sftBatchSize, prefBatchSize); i++) {
// Train on SFT example
if (sftIdx < sftExamples.length) {
await model.trainStep(sftExamples[sftIdx++]);
}
// Train on preference pair
if (prefIdx < preferences.length) {
await model.trainStep(preferences[prefIdx++]);
}
}
}
return model;
}
Checklist
- Collect high-quality supervised fine-tuning data
- Start with SFT baseline before preference tuning
- Design preference data collection with stratified sampling
- Collect multiple labels per example and resolve disagreement
- Choose between RLHF (complex but powerful) and DPO (simple, effective)
- Train reward model on preference pairs if using RLHF
- Monitor for catastrophic forgetting of base capabilities
- Evaluate alignment using win rate comparisons
- Test instruction following and harmlessness explicitly
- Consider ORPO as faster single-stage alternative
Conclusion
Alignment is the bridge between capable models and production-safe systems. RLHF provides fine-grained control through reward models but adds complexity. DPO skips the reward model, making it simpler and faster. Both require quality preference data and careful evaluation. Mixing SFT with preference training prevents capability degradation. Robust evaluation across alignment dimensions — win rate, instruction following, harmlessness, helpfulness — ensures safe deployment.