LoRA and QLoRA — Efficient LLM Fine-tuning
Advertisement
Introduction
LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) enable fine-tuning of large models on modest hardware. This guide explains the techniques and shows practical implementations.
- Understanding LoRA
- Core Concept
- Memory Savings
- Installation and Setup
- Basic LoRA Setup
- Fine-tuning with LoRA
- QLoRA: Quantized LoRA
- How QLoRA Works
- QLoRA Implementation
- Loading and Using Fine-tuned Models
- Load LoRA Weights
- Merge LoRA Weights
- Hyperparameter Tuning
- Evaluation and Benchmarking
- Comparison: Full vs LoRA vs QLoRA
- Production Best Practices
- Conclusion
- FAQ
Understanding LoRA
LoRA reduces the number of trainable parameters by decomposing weight updates into two low-rank matrices. Instead of fine-tuning all weights (billions), you train only a small set (millions).
Core Concept
Full Fine-tuning:
W_new = W_original + ΔW (billions of parameters)
LoRA:
W_new = W_original + A × B (millions of parameters)
Where A and B are low-rank matrices (r = 8 or 16)
Memory Savings
Full Fine-tuning (7B model):
- Model weights: 7B * 4 bytes = 28 GB
- Activations and gradients: ~84 GB
- Total: ~112 GB (requires A100)
LoRA (7B model, r=8):
- Model weights: 28 GB (frozen)
- LoRA matrices: ~13 MB * num_layers (minimal)
- Activations and gradients: ~8 GB
- Total: ~36 GB (works on RTX 4090)
Installation and Setup
pip install peft torch transformers datasets bitsandbytes
Basic LoRA Setup
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to apply LoRA
bias="none"
)
# Apply LoRA
model = get_peft_model(model, peft_config)
# Check trainable parameters
model.print_trainable_parameters()
Output:
trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06
Fine-tuning with LoRA
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset("imdb")
# Tokenize
def tokenize_function(examples):
return tokenizer(
examples["text"],
max_length=512,
truncation=True,
padding="max_length"
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./lora-results",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=100,
weight_decay=0.01,
logging_steps=10,
learning_rate=2e-4,
save_strategy="epoch"
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"].select(range(10000)),
eval_dataset=tokenized_datasets["test"].select(range(1000))
)
# Train
trainer.train()
# Save LoRA weights (not the full model)
model.save_pretrained("./lora-model")
QLoRA: Quantized LoRA
QLoRA combines quantization with LoRA for even better efficiency:
How QLoRA Works
QLoRA: 4-bit quantization + LoRA
- Store model weights in 4-bit (0.5 bytes per parameter)
- Fine-tune with 8-bit precision
- LoRA adds adapters for task-specific learning
Benefits:
- 70B model on RTX 4090 (24GB)
- 33B model on RTX 3090 (24GB)
- 13B model on RTX 2080 (11GB)
QLoRA Implementation
pip install qlora
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Quantization config (4-bit)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16"
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Apply LoRA
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, peft_config)
# Train normally with Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Loading and Using Fine-tuned Models
Load LoRA Weights
from peft import AutoPeftModelForCausalLM
# Load model with LoRA weights
model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-model",
device_map="auto"
)
# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Merge LoRA Weights
# Merge LoRA weights back into model
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("./merged-model")
Hyperparameter Tuning
# LoRA hyperparameters to experiment with
configs = [
{"r": 8, "lora_alpha": 16, "lora_dropout": 0.05}, # Conservative
{"r": 16, "lora_alpha": 32, "lora_dropout": 0.1}, # Balanced
{"r": 32, "lora_alpha": 64, "lora_dropout": 0.15}, # Aggressive
]
# Typical recommendations:
# - r (rank): 8-64 (higher = more expressive but slower)
# - lora_alpha: 2 * r (scaling factor)
# - lora_dropout: 0.05-0.1 (prevents overfitting)
# - target_modules: q_proj, v_proj for most models
Evaluation and Benchmarking
from tqdm import tqdm
import torch
def evaluate_model(model, eval_dataset):
"""Evaluate fine-tuned model."""
model.eval()
total_loss = 0
with torch.no_grad():
for batch in tqdm(eval_dataset):
inputs = {k: v.to(model.device) for k, v in batch.items()}
outputs = model(**inputs)
total_loss += outputs.loss.item()
return total_loss / len(eval_dataset)
eval_loss = evaluate_model(model, eval_dataloader)
print(f"Evaluation loss: {eval_loss:.4f}")
Comparison: Full vs LoRA vs QLoRA
# Memory and time comparison
results = {
"Full Fine-tuning": {
"memory_gb": 112,
"time_hours": 4,
"quality": "Excellent"
},
"LoRA": {
"memory_gb": 36,
"time_hours": 1.5,
"quality": "Very Good"
},
"QLoRA": {
"memory_gb": 12,
"time_hours": 3,
"quality": "Good"
}
}
# Trade-offs:
# - Full: Highest quality, most memory
# - LoRA: Good balance, 70% less memory
# - QLoRA: Consumer hardware, 90% less memory, slower
Production Best Practices
- Start with LoRA for most cases
- Use QLoRA for 70B+ models
- Validate on held-out test set
- Save both base and merged models
- Monitor training loss and validation metrics
# Logging and monitoring
import wandb
training_args = TrainingArguments(
output_dir="./lora-results",
report_to="wandb",
logging_steps=10,
# ... other args
)
# Monitor on https://wandb.ai
Conclusion
LoRA and QLoRA democratize LLM fine-tuning. With consumer hardware, you can now fine-tune 70B parameter models in days.
FAQ
Q: Should I use LoRA or full fine-tuning? A: Use LoRA unless you need maximum accuracy. It's 70% faster and requires far less memory while maintaining quality.
Q: Can I combine multiple LoRA adapters? A: Yes, PEFT supports adapter stacking. Load multiple LoRA weights and route inputs through different adapters.
Q: How do I choose LoRA rank? A: Start with r=8 or r=16. Higher rank increases expressiveness but also training time. Experiment based on your accuracy vs speed needs.
Advertisement