LoRA and QLoRA — Efficient LLM Fine-tuning

Sanjeev SharmaSanjeev Sharma
5 min read

Advertisement

Introduction

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) enable fine-tuning of large models on modest hardware. This guide explains the techniques and shows practical implementations.

Understanding LoRA

LoRA reduces the number of trainable parameters by decomposing weight updates into two low-rank matrices. Instead of fine-tuning all weights (billions), you train only a small set (millions).

Core Concept

Full Fine-tuning:
W_new = W_original + ΔW (billions of parameters)

LoRA:
W_new = W_original + A × B (millions of parameters)

Where A and B are low-rank matrices (r = 8 or 16)

Memory Savings

Full Fine-tuning (7B model):
- Model weights: 7B * 4 bytes = 28 GB
- Activations and gradients: ~84 GB
- Total: ~112 GB (requires A100)

LoRA (7B model, r=8):
- Model weights: 28 GB (frozen)
- LoRA matrices: ~13 MB * num_layers (minimal)
- Activations and gradients: ~8 GB
- Total: ~36 GB (works on RTX 4090)

Installation and Setup

pip install peft torch transformers datasets bitsandbytes

Basic LoRA Setup

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # LoRA rank
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to apply LoRA
    bias="none"
)

# Apply LoRA
model = get_peft_model(model, peft_config)

# Check trainable parameters
model.print_trainable_parameters()

Output:

trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

Fine-tuning with LoRA

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    learning_rate=2e-4,
    save_strategy="epoch"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(10000)),
    eval_dataset=tokenized_datasets["test"].select(range(1000))
)

# Train
trainer.train()

# Save LoRA weights (not the full model)
model.save_pretrained("./lora-model")

QLoRA: Quantized LoRA

QLoRA combines quantization with LoRA for even better efficiency:

How QLoRA Works

QLoRA: 4-bit quantization + LoRA
- Store model weights in 4-bit (0.5 bytes per parameter)
- Fine-tune with 8-bit precision
- LoRA adds adapters for task-specific learning

Benefits:
- 70B model on RTX 4090 (24GB)
- 33B model on RTX 3090 (24GB)
- 13B model on RTX 2080 (11GB)

QLoRA Implementation

pip install qlora
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Quantization config (4-bit)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, peft_config)

# Train normally with Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Loading and Using Fine-tuned Models

Load LoRA Weights

from peft import AutoPeftModelForCausalLM

# Load model with LoRA weights
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-model",
    device_map="auto"
)

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Merge LoRA Weights

# Merge LoRA weights back into model
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./merged-model")

Hyperparameter Tuning

# LoRA hyperparameters to experiment with

configs = [
    {"r": 8, "lora_alpha": 16, "lora_dropout": 0.05},    # Conservative
    {"r": 16, "lora_alpha": 32, "lora_dropout": 0.1},   # Balanced
    {"r": 32, "lora_alpha": 64, "lora_dropout": 0.15},  # Aggressive
]

# Typical recommendations:
# - r (rank): 8-64 (higher = more expressive but slower)
# - lora_alpha: 2 * r (scaling factor)
# - lora_dropout: 0.05-0.1 (prevents overfitting)
# - target_modules: q_proj, v_proj for most models

Evaluation and Benchmarking

from tqdm import tqdm
import torch

def evaluate_model(model, eval_dataset):
    """Evaluate fine-tuned model."""
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for batch in tqdm(eval_dataset):
            inputs = {k: v.to(model.device) for k, v in batch.items()}
            outputs = model(**inputs)
            total_loss += outputs.loss.item()

    return total_loss / len(eval_dataset)

eval_loss = evaluate_model(model, eval_dataloader)
print(f"Evaluation loss: {eval_loss:.4f}")

Comparison: Full vs LoRA vs QLoRA

# Memory and time comparison

results = {
    "Full Fine-tuning": {
        "memory_gb": 112,
        "time_hours": 4,
        "quality": "Excellent"
    },
    "LoRA": {
        "memory_gb": 36,
        "time_hours": 1.5,
        "quality": "Very Good"
    },
    "QLoRA": {
        "memory_gb": 12,
        "time_hours": 3,
        "quality": "Good"
    }
}

# Trade-offs:
# - Full: Highest quality, most memory
# - LoRA: Good balance, 70% less memory
# - QLoRA: Consumer hardware, 90% less memory, slower

Production Best Practices

  1. Start with LoRA for most cases
  2. Use QLoRA for 70B+ models
  3. Validate on held-out test set
  4. Save both base and merged models
  5. Monitor training loss and validation metrics
# Logging and monitoring
import wandb

training_args = TrainingArguments(
    output_dir="./lora-results",
    report_to="wandb",
    logging_steps=10,
    # ... other args
)

# Monitor on https://wandb.ai

Conclusion

LoRA and QLoRA democratize LLM fine-tuning. With consumer hardware, you can now fine-tune 70B parameter models in days.

FAQ

Q: Should I use LoRA or full fine-tuning? A: Use LoRA unless you need maximum accuracy. It's 70% faster and requires far less memory while maintaining quality.

Q: Can I combine multiple LoRA adapters? A: Yes, PEFT supports adapter stacking. Load multiple LoRA weights and route inputs through different adapters.

Q: How do I choose LoRA rank? A: Start with r=8 or r=16. Higher rank increases expressiveness but also training time. Experiment based on your accuracy vs speed needs.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro