Published on

LoRA Fine-Tuning — Adapting Open-Source LLMs Without Full GPU Clusters

Authors

Introduction

Low-Rank Adaptation (LoRA) enables fine-tuning large language models with fraction of the hardware and memory. This guide covers LoRA architecture, QLoRA for 4-bit quantization, and production evaluation with Hugging Face PEFT.

LoRA Fundamentals

LoRA adds trainable low-rank matrices to weight updates without modifying base model parameters:

interface LoRAConfig {
  rank: number;
  alpha: number;
  targetModules: string[];
  modules: {
    [key: string]: {
      inFeatures: number;
      outFeatures: number;
      A: number[][]; // Rank x inFeatures
      B: number[][]; // outFeatures x Rank
    };
  };
}

function initializeLoRA(config: LoRAConfig): LoRAConfig {
  // For attention layers in transformer
  const modules: { [key: string]: any } = {};

  config.targetModules.forEach(moduleName => {
    // Example: q_proj has 4096 input features, 4096 output features
    const inFeatures = 4096;
    const outFeatures = 4096;

    // Matrix A: Gaussian initialization, rank x input
    const A = Array(config.rank)
      .fill(0)
      .map(() =>
        Array(inFeatures)
          .fill(0)
          .map(() => (Math.random() - 0.5) * 0.02)
      );

    // Matrix B: Zero initialization, output x rank
    const B = Array(outFeatures)
      .fill(0)
      .map(() => Array(config.rank).fill(0));

    modules[moduleName] = {
      inFeatures,
      outFeatures,
      A,
      B
    };
  });

  return { ...config, modules };
}

// Forward pass: output = weight @ input + (B @ A) @ input
// Training only updates A and B (2 * rank * dim parameters)
// vs full fine-tune: dim * dim parameters

Key insight: LoRA reduces trainable parameters from millions to thousands. A 7B model with rank=8 needs only 8.4M parameters instead of 7B.

QLoRA for 4-Bit Quantization

QLoRA quantizes the base model to 4-bit while keeping LoRA in full precision:

interface QLoRAConfig {
  baseModelBits: 4;
  loraRank: number;
  loraAlpha: number;
  doubleQuantization: boolean; // Quantize quantization constants
  nfBits: 4; // NormalFloat4 for better precision
  computeDtype: 'float16' | 'float32';
}

async function quantizeAndLoadModel(
  modelName: string,
  config: QLoRAConfig
): Promise<any> {
  // Pseudocode for Hugging Face integration
  const bnb_config = {
    load_in_4bit: true,
    bnb_4bit_quant_type: 'nf4',
    bnb_4bit_use_double_quant: config.doubleQuantization,
    bnb_4bit_compute_dtype: config.computeDtype
  };

  // Load model with quantization
  const model = await loadPretrainedModel(modelName, {
    device_map: 'auto',
    quantization_config: bnb_config
  });

  return model;
}

// Memory savings:
// Llama 2 7B full precision: 28 GB
// Llama 2 7B 4-bit + LoRA: 6 GB total
// Fits on single consumer GPU!

Rank and Alpha Hyperparameters

The scaling factor alpha / rank controls LoRA contribution strength:

interface LoRAHyperparameters {
  rank: number;
  alpha: number;
  scalingFactor: number;
}

function selectLoRAHyperparameters(
  modelSize: 'small' | 'medium' | 'large'
): LoRAHyperparameters {
  const configs: { [key: string]: LoRAHyperparameters } = {
    small: {
      rank: 4,
      alpha: 32,
      scalingFactor: 0 // Will compute
    },
    medium: {
      rank: 8,
      alpha: 16,
      scalingFactor: 0
    },
    large: {
      rank: 16,
      alpha: 32,
      scalingFactor: 0
    }
  };

  const config = configs[modelSize];
  config.scalingFactor = config.alpha / config.rank;

  return config;
}

// Rule of thumb:
// - rank=4-8 for most tasks (minimal impact on latency)
// - rank=16 for complex domain adaptation
// - alpha = 2 * rank (scaling factor &approx; 2)
// - Too high alpha: unstable training
// - Too low alpha: minimal LoRA effect

Target Modules Selection

Which attention heads to adapt affects convergence and final quality:

interface AdapterTargeting {
  moduleName: string;
  updateAttention: boolean;
  updateMLP: boolean;
  updateEmbedding: boolean;
}

function selectTargetModules(
  modelArchitecture: 'llama' | 'mistral' | 'phi',
  budget: 'low' | 'medium' | 'high'
): string[] {
  const targetModules: { [key: string]: { [key: string]: string[] } } = {
    llama: {
      low: ['q_proj', 'v_proj'], // Only query and value
      medium: ['q_proj', 'v_proj', 'down_proj'], // Add output
      high: ['q_proj', 'k_proj', 'v_proj', 'up_proj', 'down_proj'] // All
    },
    mistral: {
      low: ['q_proj', 'v_proj'],
      medium: ['q_proj', 'v_proj', 'down_proj'],
      high: ['q_proj', 'k_proj', 'v_proj', 'up_proj', 'down_proj']
    },
    phi: {
      low: ['q_proj', 'v_proj'],
      medium: ['q_proj', 'v_proj', 'fc2'],
      high: ['q_proj', 'k_proj', 'v_proj', 'fc1', 'fc2']
    }
  };

  return targetModules[modelArchitecture][budget];
}

// Empirical findings:
// - q_proj (query): captures task-specific patterns
// - v_proj (value): captures factual knowledge
// - Adapting both vs full: 80-95% quality at 1% parameters

Dataset Preparation with Alpaca Format

interface AlpacaExample {
  instruction: string;
  input: string;
  output: string;
}

function convertToAlpacaFormat(examples: AlpacaExample[]): string[] {
  return examples.map(example => {
    let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n`;
    prompt += `### Instruction:\n${example.instruction}\n\n`;

    if (example.input) {
      prompt += `### Input:\n${example.input}\n\n`;
    }

    prompt += `### Response:\n${example.output}`;

    return prompt;
  });
}

function formatForTraining(
  examples: AlpacaExample[]
): Array<{ text: string }> {
  const formatted = convertToAlpacaFormat(examples);
  return formatted.map(text => ({ text }));
}

// Example dataset preparation
const trainingExamples: AlpacaExample[] = [
  {
    instruction: 'Classify the sentiment of the following review',
    input: 'This product exceeded my expectations. Highly recommend!',
    output: 'positive'
  },
  {
    instruction: 'Classify the sentiment of the following review',
    input: 'Arrived broken and customer service was unhelpful.',
    output: 'negative'
  }
];

const formattedData = formatForTraining(trainingExamples);
console.log(formattedData[0].text);

Training with Hugging Face PEFT

interface TrainingConfig {
  outputDir: string;
  numEpochs: number;
  batchSize: number;
  learningRate: number;
  warmupSteps: number;
  loggingSteps: number;
}

async function trainLoRAModel(
  modelName: string,
  datasetPath: string,
  config: TrainingConfig
): Promise<void> {
  // Pseudocode for actual Hugging Face Trainer setup
  const trainingArguments = {
    output_dir: config.outputDir,
    num_train_epochs: config.numEpochs,
    per_device_train_batch_size: config.batchSize,
    learning_rate: config.learningRate,
    warmup_steps: config.warmupSteps,
    logging_steps: config.loggingSteps,
    save_strategy: 'epoch',
    remove_unused_columns: false,
    report_to: ['wandb'], // For monitoring
    gradient_accumulation_steps: 4,
    optim: 'paged_adamw_32bit' // Memory efficient
  };

  const peftConfig = {
    r: 8,
    lora_alpha: 16,
    target_modules: ['q_proj', 'v_proj'],
    lora_dropout: 0.05,
    bias: 'none',
    task_type: 'CAUSAL_LM'
  };

  // In production:
  // from peft import get_peft_model, LoraConfig
  // from transformers import Trainer, TrainingArguments
  // peft_config = LoraConfig(**peft_config)
  // model = get_peft_model(base_model, peft_config)
  // trainer = Trainer(
  //   model=model,
  //   args=TrainingArguments(**training_arguments),
  //   train_dataset=train_dataset,
  //   callbacks=[EarlyStoppingCallback()]
  // )
  // trainer.train()

  console.log(
    `Training ${modelName} with config:`,
    trainingArguments,
    peftConfig
  );
}

const config: TrainingConfig = {
  outputDir: './lora_checkpoints',
  numEpochs: 3,
  batchSize: 4,
  learningRate: 2e-4,
  warmupSteps: 100,
  loggingSteps: 50
};

trainLoRAModel('meta-llama/Llama-2-7b', './data', config);

Memory Requirements Per Model Size

interface MemoryEstimate {
  modelSize: string;
  baseMemory: number; // GB
  with4BitQuantization: number;
  withLoRA: number;
  gpusRequired: 'consumer' | 'pro' | 'server';
}

function estimateMemoryRequirements(): MemoryEstimate[] {
  return [
    {
      modelSize: '3B',
      baseMemory: 12,
      with4BitQuantization: 3,
      withLoRA: 4.5,
      gpuRequired: 'consumer' as const
    },
    {
      modelSize: '7B',
      baseMemory: 28,
      with4BitQuantization: 6,
      withLoRA: 8,
      gpuRequired: 'consumer' as const
    },
    {
      modelSize: '13B',
      baseMemory: 52,
      with4BitQuantization: 10,
      withLoRA: 14,
      gpuRequired: 'pro' as const
    },
    {
      modelSize: '70B',
      baseMemory: 280,
      with4BitQuantization: 48,
      withLoRA: 64,
      gpuRequired: 'server' as const
    }
  ];
}

// Makes training accessible!
// Llama 2 7B QLoRA fits on RTX 4090 (24GB) with batch_size=1

Merging LoRA Weights

Convert adapter to standalone model for deployment:

async function mergeLoRAWeights(
  basePath: string,
  adapterPath: string
): Promise<void> {
  // In production Python:
  // from peft import AutoPeftModelForCausalLM
  // model = AutoPeftModelForCausalLM.from_pretrained(
  //   adapter_path, device_map='auto'
  // )
  // merged = model.merge_and_unload()
  // merged.save_pretrained('merged_model')

  // TypeScript wrapper
  const mergedModelPath = `${adapterPath}_merged`;

  console.log(`Merged LoRA adapter into base model`);
  console.log(`Output: ${mergedModelPath}`);

  // Post-merge: single model file, no adapter overhead
  // File size: original + small LoRA parameters
  // Inference latency: no change (merged into weights)
}

Serving: Merged vs Adapter

interface ServingStrategy {
  approach: 'merged' | 'adapter';
  memoryOverhead: number; // MB
  inferenceLatency: number; // ms
  deploymentComplexity: 'simple' | 'complex';
  supportMultipleAdapters: boolean;
}

function compareServingStrategies(): ServingStrategy[] {
  return [
    {
      approach: 'merged',
      memoryOverhead: 0,
      inferenceLatency: 100,
      deploymentComplexity: 'simple',
      supportMultipleAdapters: false
    },
    {
      approach: 'adapter',
      memoryOverhead: 50, // LoRA weights only
      inferenceLatency: 101, // Minimal overhead
      deploymentComplexity: 'complex',
      supportMultipleAdapters: true
    }
  ];
}

// Merged: Single-task APIs, simplest deployment
// Adapter: Multi-tenant systems where one base model + many specialized adapters

Evaluation with Eleuther LM Eval

interface EvaluationTask {
  name: string;
  metric: string;
  shots: number;
}

interface EvalResults {
  task: string;
  score: number;
  stderr: number;
}

async function evaluateWithEleuther(
  modelPath: string,
  tasks: string[]
): Promise<EvalResults[]> {
  // Pseudocode for lm-evaluation-harness integration
  const results: EvalResults[] = [
    {
      task: 'arc_challenge',
      score: 0.72,
      stderr: 0.02
    },
    {
      task: 'hellaswag',
      score: 0.68,
      stderr: 0.01
    },
    {
      task: 'mmlu',
      score: 0.65,
      stderr: 0.015
    },
    {
      task: 'truthfulqa',
      score: 0.58,
      stderr: 0.025
    }
  ];

  return results;
}

// Standard benchmarks for comparing LoRA fine-tuned models
// Track improvements vs base model
const baseline = {
  arc_challenge: 0.60,
  hellaswag: 0.64,
  mmlu: 0.55,
  truthfulqa: 0.42
};

async function reportImprovement(
  modelPath: string
): Promise<{ [key: string]: number }> {
  const results = await evaluateWithEleuther(modelPath, [
    'arc_challenge',
    'hellaswag',
    'mmlu',
    'truthfulqa'
  ]);

  const improvements: { [key: string]: number } = {};
  results.forEach(result => {
    const baseScore = baseline[result.task as keyof typeof baseline];
    improvements[result.task] = (
      ((result.score - baseScore) / baseScore) *
      100
    ).toFixed(1);
  });

  return improvements;
}

Checklist

  • Selected LoRA rank (4-16) and alpha based on model size
  • Identified target modules (q_proj, v_proj minimum)
  • Prepared dataset in Alpaca format with clear instructions
  • Configured QLoRA for 4-bit quantization
  • Verified model fits in available GPU memory
  • Set learning rate to 2e-4 or lower
  • Trained for 2-3 epochs with early stopping
  • Evaluated on benchmark tasks (MMLU, ARC, HellaSwag)
  • Merged weights for production deployment
  • Tested inference latency (should be unchanged)
  • Created model card documenting LoRA configuration

Conclusion

LoRA enables efficient fine-tuning on consumer hardware. QLoRA + Hugging Face PEFT makes adapting 7B-13B models accessible without enterprise infrastructure. Standardize on rank=8, alpha=16 and merge for production to keep deployment simple.