LoRA and QLoRA — Efficient LLM Fine-tuning

Introduction

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) enable fine-tuning of large models on modest hardware. This guide explains the techniques and shows practical implementations.

Understanding LoRA
Core Concept
Memory Savings
Installation and Setup
Basic LoRA Setup
Fine-tuning with LoRA
QLoRA: Quantized LoRA
How QLoRA Works
QLoRA Implementation
Loading and Using Fine-tuned Models
Load LoRA Weights
Merge LoRA Weights
Hyperparameter Tuning
Evaluation and Benchmarking
Comparison: Full vs LoRA vs QLoRA
Production Best Practices
Conclusion
FAQ

Understanding LoRA

LoRA reduces the number of trainable parameters by decomposing weight updates into two low-rank matrices. Instead of fine-tuning all weights (billions), you train only a small set (millions).

Core Concept

Full Fine-tuning:
W_new = W_original + ΔW (billions of parameters)

LoRA:
W_new = W_original + A × B (millions of parameters)

Where A and B are low-rank matrices (r = 8 or 16)

Memory Savings

Full Fine-tuning (7B model):
- Model weights: 7B * 4 bytes = 28 GB
- Activations and gradients: ~84 GB
- Total: ~112 GB (requires A100)

LoRA (7B model, r=8):
- Model weights: 28 GB (frozen)
- LoRA matrices: ~13 MB * num_layers (minimal)
- Activations and gradients: ~8 GB
- Total: ~36 GB (works on RTX 4090)

Installation and Setup

pip install peft torch transformers datasets bitsandbytes

Basic LoRA Setup

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # LoRA rank
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to apply LoRA
    bias="none"
)

# Apply LoRA
model = get_peft_model(model, peft_config)

# Check trainable parameters
model.print_trainable_parameters()

Output:

trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

Fine-tuning with LoRA

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    learning_rate=2e-4,
    save_strategy="epoch"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(10000)),
    eval_dataset=tokenized_datasets["test"].select(range(1000))
)

# Train
trainer.train()

# Save LoRA weights (not the full model)
model.save_pretrained("./lora-model")

QLoRA: Quantized LoRA

QLoRA combines quantization with LoRA for even better efficiency:

How QLoRA Works

QLoRA: 4-bit quantization + LoRA
- Store model weights in 4-bit (0.5 bytes per parameter)
- Fine-tune with 8-bit precision
- LoRA adds adapters for task-specific learning

Benefits:
- 70B model on RTX 4090 (24GB)
- 33B model on RTX 3090 (24GB)
- 13B model on RTX 2080 (11GB)

QLoRA Implementation

pip install qlora

from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Quantization config (4-bit)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, peft_config)

# Train normally with Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Loading and Using Fine-tuned Models

Load LoRA Weights

from peft import AutoPeftModelForCausalLM

# Load model with LoRA weights
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-model",
    device_map="auto"
)

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Merge LoRA Weights

# Merge LoRA weights back into model
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./merged-model")

Hyperparameter Tuning

# LoRA hyperparameters to experiment with

configs = [
    {"r": 8, "lora_alpha": 16, "lora_dropout": 0.05},    # Conservative
    {"r": 16, "lora_alpha": 32, "lora_dropout": 0.1},   # Balanced
    {"r": 32, "lora_alpha": 64, "lora_dropout": 0.15},  # Aggressive
]

# Typical recommendations:
# - r (rank): 8-64 (higher = more expressive but slower)
# - lora_alpha: 2 * r (scaling factor)
# - lora_dropout: 0.05-0.1 (prevents overfitting)
# - target_modules: q_proj, v_proj for most models

Evaluation and Benchmarking

from tqdm import tqdm
import torch

def evaluate_model(model, eval_dataset):
    """Evaluate fine-tuned model."""
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for batch in tqdm(eval_dataset):
            inputs = {k: v.to(model.device) for k, v in batch.items()}
            outputs = model(**inputs)
            total_loss += outputs.loss.item()

    return total_loss / len(eval_dataset)

eval_loss = evaluate_model(model, eval_dataloader)
print(f"Evaluation loss: {eval_loss:.4f}")

Comparison: Full vs LoRA vs QLoRA

# Memory and time comparison

results = {
    "Full Fine-tuning": {
        "memory_gb": 112,
        "time_hours": 4,
        "quality": "Excellent"
    },
    "LoRA": {
        "memory_gb": 36,
        "time_hours": 1.5,
        "quality": "Very Good"
    },
    "QLoRA": {
        "memory_gb": 12,
        "time_hours": 3,
        "quality": "Good"
    }
}

# Trade-offs:
# - Full: Highest quality, most memory
# - LoRA: Good balance, 70% less memory
# - QLoRA: Consumer hardware, 90% less memory, slower

Production Best Practices

Start with LoRA for most cases
Use QLoRA for 70B+ models
Validate on held-out test set
Save both base and merged models
Monitor training loss and validation metrics

# Logging and monitoring
import wandb

training_args = TrainingArguments(
    output_dir="./lora-results",
    report_to="wandb",
    logging_steps=10,
    # ... other args
)

# Monitor on https://wandb.ai

Conclusion

LoRA and QLoRA democratize LLM fine-tuning. With consumer hardware, you can now fine-tune 70B parameter models in days.

FAQ

Q: Should I use LoRA or full fine-tuning? A: Use LoRA unless you need maximum accuracy. It's 70% faster and requires far less memory while maintaining quality.

Q: Can I combine multiple LoRA adapters? A: Yes, PEFT supports adapter stacking. Load multiple LoRA weights and route inputs through different adapters.

Q: How do I choose LoRA rank? A: Start with r=8 or r=16. Higher rank increases expressiveness but also training time. Experiment based on your accuracy vs speed needs.