- Published on
LoRA Fine-Tuning — Adapting Open-Source LLMs Without Full GPU Clusters
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Low-Rank Adaptation (LoRA) enables fine-tuning large language models with fraction of the hardware and memory. This guide covers LoRA architecture, QLoRA for 4-bit quantization, and production evaluation with Hugging Face PEFT.
- LoRA Fundamentals
- QLoRA for 4-Bit Quantization
- Rank and Alpha Hyperparameters
- Target Modules Selection
- Dataset Preparation with Alpaca Format
- Training with Hugging Face PEFT
- Memory Requirements Per Model Size
- Merging LoRA Weights
- Serving: Merged vs Adapter
- Evaluation with Eleuther LM Eval
- Checklist
- Conclusion
LoRA Fundamentals
LoRA adds trainable low-rank matrices to weight updates without modifying base model parameters:
interface LoRAConfig {
rank: number;
alpha: number;
targetModules: string[];
modules: {
[key: string]: {
inFeatures: number;
outFeatures: number;
A: number[][]; // Rank x inFeatures
B: number[][]; // outFeatures x Rank
};
};
}
function initializeLoRA(config: LoRAConfig): LoRAConfig {
// For attention layers in transformer
const modules: { [key: string]: any } = {};
config.targetModules.forEach(moduleName => {
// Example: q_proj has 4096 input features, 4096 output features
const inFeatures = 4096;
const outFeatures = 4096;
// Matrix A: Gaussian initialization, rank x input
const A = Array(config.rank)
.fill(0)
.map(() =>
Array(inFeatures)
.fill(0)
.map(() => (Math.random() - 0.5) * 0.02)
);
// Matrix B: Zero initialization, output x rank
const B = Array(outFeatures)
.fill(0)
.map(() => Array(config.rank).fill(0));
modules[moduleName] = {
inFeatures,
outFeatures,
A,
B
};
});
return { ...config, modules };
}
// Forward pass: output = weight @ input + (B @ A) @ input
// Training only updates A and B (2 * rank * dim parameters)
// vs full fine-tune: dim * dim parameters
Key insight: LoRA reduces trainable parameters from millions to thousands. A 7B model with rank=8 needs only 8.4M parameters instead of 7B.
QLoRA for 4-Bit Quantization
QLoRA quantizes the base model to 4-bit while keeping LoRA in full precision:
interface QLoRAConfig {
baseModelBits: 4;
loraRank: number;
loraAlpha: number;
doubleQuantization: boolean; // Quantize quantization constants
nfBits: 4; // NormalFloat4 for better precision
computeDtype: 'float16' | 'float32';
}
async function quantizeAndLoadModel(
modelName: string,
config: QLoRAConfig
): Promise<any> {
// Pseudocode for Hugging Face integration
const bnb_config = {
load_in_4bit: true,
bnb_4bit_quant_type: 'nf4',
bnb_4bit_use_double_quant: config.doubleQuantization,
bnb_4bit_compute_dtype: config.computeDtype
};
// Load model with quantization
const model = await loadPretrainedModel(modelName, {
device_map: 'auto',
quantization_config: bnb_config
});
return model;
}
// Memory savings:
// Llama 2 7B full precision: 28 GB
// Llama 2 7B 4-bit + LoRA: 6 GB total
// Fits on single consumer GPU!
Rank and Alpha Hyperparameters
The scaling factor alpha / rank controls LoRA contribution strength:
interface LoRAHyperparameters {
rank: number;
alpha: number;
scalingFactor: number;
}
function selectLoRAHyperparameters(
modelSize: 'small' | 'medium' | 'large'
): LoRAHyperparameters {
const configs: { [key: string]: LoRAHyperparameters } = {
small: {
rank: 4,
alpha: 32,
scalingFactor: 0 // Will compute
},
medium: {
rank: 8,
alpha: 16,
scalingFactor: 0
},
large: {
rank: 16,
alpha: 32,
scalingFactor: 0
}
};
const config = configs[modelSize];
config.scalingFactor = config.alpha / config.rank;
return config;
}
// Rule of thumb:
// - rank=4-8 for most tasks (minimal impact on latency)
// - rank=16 for complex domain adaptation
// - alpha = 2 * rank (scaling factor ≈ 2)
// - Too high alpha: unstable training
// - Too low alpha: minimal LoRA effect
Target Modules Selection
Which attention heads to adapt affects convergence and final quality:
interface AdapterTargeting {
moduleName: string;
updateAttention: boolean;
updateMLP: boolean;
updateEmbedding: boolean;
}
function selectTargetModules(
modelArchitecture: 'llama' | 'mistral' | 'phi',
budget: 'low' | 'medium' | 'high'
): string[] {
const targetModules: { [key: string]: { [key: string]: string[] } } = {
llama: {
low: ['q_proj', 'v_proj'], // Only query and value
medium: ['q_proj', 'v_proj', 'down_proj'], // Add output
high: ['q_proj', 'k_proj', 'v_proj', 'up_proj', 'down_proj'] // All
},
mistral: {
low: ['q_proj', 'v_proj'],
medium: ['q_proj', 'v_proj', 'down_proj'],
high: ['q_proj', 'k_proj', 'v_proj', 'up_proj', 'down_proj']
},
phi: {
low: ['q_proj', 'v_proj'],
medium: ['q_proj', 'v_proj', 'fc2'],
high: ['q_proj', 'k_proj', 'v_proj', 'fc1', 'fc2']
}
};
return targetModules[modelArchitecture][budget];
}
// Empirical findings:
// - q_proj (query): captures task-specific patterns
// - v_proj (value): captures factual knowledge
// - Adapting both vs full: 80-95% quality at 1% parameters
Dataset Preparation with Alpaca Format
interface AlpacaExample {
instruction: string;
input: string;
output: string;
}
function convertToAlpacaFormat(examples: AlpacaExample[]): string[] {
return examples.map(example => {
let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n`;
prompt += `### Instruction:\n${example.instruction}\n\n`;
if (example.input) {
prompt += `### Input:\n${example.input}\n\n`;
}
prompt += `### Response:\n${example.output}`;
return prompt;
});
}
function formatForTraining(
examples: AlpacaExample[]
): Array<{ text: string }> {
const formatted = convertToAlpacaFormat(examples);
return formatted.map(text => ({ text }));
}
// Example dataset preparation
const trainingExamples: AlpacaExample[] = [
{
instruction: 'Classify the sentiment of the following review',
input: 'This product exceeded my expectations. Highly recommend!',
output: 'positive'
},
{
instruction: 'Classify the sentiment of the following review',
input: 'Arrived broken and customer service was unhelpful.',
output: 'negative'
}
];
const formattedData = formatForTraining(trainingExamples);
console.log(formattedData[0].text);
Training with Hugging Face PEFT
interface TrainingConfig {
outputDir: string;
numEpochs: number;
batchSize: number;
learningRate: number;
warmupSteps: number;
loggingSteps: number;
}
async function trainLoRAModel(
modelName: string,
datasetPath: string,
config: TrainingConfig
): Promise<void> {
// Pseudocode for actual Hugging Face Trainer setup
const trainingArguments = {
output_dir: config.outputDir,
num_train_epochs: config.numEpochs,
per_device_train_batch_size: config.batchSize,
learning_rate: config.learningRate,
warmup_steps: config.warmupSteps,
logging_steps: config.loggingSteps,
save_strategy: 'epoch',
remove_unused_columns: false,
report_to: ['wandb'], // For monitoring
gradient_accumulation_steps: 4,
optim: 'paged_adamw_32bit' // Memory efficient
};
const peftConfig = {
r: 8,
lora_alpha: 16,
target_modules: ['q_proj', 'v_proj'],
lora_dropout: 0.05,
bias: 'none',
task_type: 'CAUSAL_LM'
};
// In production:
// from peft import get_peft_model, LoraConfig
// from transformers import Trainer, TrainingArguments
// peft_config = LoraConfig(**peft_config)
// model = get_peft_model(base_model, peft_config)
// trainer = Trainer(
// model=model,
// args=TrainingArguments(**training_arguments),
// train_dataset=train_dataset,
// callbacks=[EarlyStoppingCallback()]
// )
// trainer.train()
console.log(
`Training ${modelName} with config:`,
trainingArguments,
peftConfig
);
}
const config: TrainingConfig = {
outputDir: './lora_checkpoints',
numEpochs: 3,
batchSize: 4,
learningRate: 2e-4,
warmupSteps: 100,
loggingSteps: 50
};
trainLoRAModel('meta-llama/Llama-2-7b', './data', config);
Memory Requirements Per Model Size
interface MemoryEstimate {
modelSize: string;
baseMemory: number; // GB
with4BitQuantization: number;
withLoRA: number;
gpusRequired: 'consumer' | 'pro' | 'server';
}
function estimateMemoryRequirements(): MemoryEstimate[] {
return [
{
modelSize: '3B',
baseMemory: 12,
with4BitQuantization: 3,
withLoRA: 4.5,
gpuRequired: 'consumer' as const
},
{
modelSize: '7B',
baseMemory: 28,
with4BitQuantization: 6,
withLoRA: 8,
gpuRequired: 'consumer' as const
},
{
modelSize: '13B',
baseMemory: 52,
with4BitQuantization: 10,
withLoRA: 14,
gpuRequired: 'pro' as const
},
{
modelSize: '70B',
baseMemory: 280,
with4BitQuantization: 48,
withLoRA: 64,
gpuRequired: 'server' as const
}
];
}
// Makes training accessible!
// Llama 2 7B QLoRA fits on RTX 4090 (24GB) with batch_size=1
Merging LoRA Weights
Convert adapter to standalone model for deployment:
async function mergeLoRAWeights(
basePath: string,
adapterPath: string
): Promise<void> {
// In production Python:
// from peft import AutoPeftModelForCausalLM
// model = AutoPeftModelForCausalLM.from_pretrained(
// adapter_path, device_map='auto'
// )
// merged = model.merge_and_unload()
// merged.save_pretrained('merged_model')
// TypeScript wrapper
const mergedModelPath = `${adapterPath}_merged`;
console.log(`Merged LoRA adapter into base model`);
console.log(`Output: ${mergedModelPath}`);
// Post-merge: single model file, no adapter overhead
// File size: original + small LoRA parameters
// Inference latency: no change (merged into weights)
}
Serving: Merged vs Adapter
interface ServingStrategy {
approach: 'merged' | 'adapter';
memoryOverhead: number; // MB
inferenceLatency: number; // ms
deploymentComplexity: 'simple' | 'complex';
supportMultipleAdapters: boolean;
}
function compareServingStrategies(): ServingStrategy[] {
return [
{
approach: 'merged',
memoryOverhead: 0,
inferenceLatency: 100,
deploymentComplexity: 'simple',
supportMultipleAdapters: false
},
{
approach: 'adapter',
memoryOverhead: 50, // LoRA weights only
inferenceLatency: 101, // Minimal overhead
deploymentComplexity: 'complex',
supportMultipleAdapters: true
}
];
}
// Merged: Single-task APIs, simplest deployment
// Adapter: Multi-tenant systems where one base model + many specialized adapters
Evaluation with Eleuther LM Eval
interface EvaluationTask {
name: string;
metric: string;
shots: number;
}
interface EvalResults {
task: string;
score: number;
stderr: number;
}
async function evaluateWithEleuther(
modelPath: string,
tasks: string[]
): Promise<EvalResults[]> {
// Pseudocode for lm-evaluation-harness integration
const results: EvalResults[] = [
{
task: 'arc_challenge',
score: 0.72,
stderr: 0.02
},
{
task: 'hellaswag',
score: 0.68,
stderr: 0.01
},
{
task: 'mmlu',
score: 0.65,
stderr: 0.015
},
{
task: 'truthfulqa',
score: 0.58,
stderr: 0.025
}
];
return results;
}
// Standard benchmarks for comparing LoRA fine-tuned models
// Track improvements vs base model
const baseline = {
arc_challenge: 0.60,
hellaswag: 0.64,
mmlu: 0.55,
truthfulqa: 0.42
};
async function reportImprovement(
modelPath: string
): Promise<{ [key: string]: number }> {
const results = await evaluateWithEleuther(modelPath, [
'arc_challenge',
'hellaswag',
'mmlu',
'truthfulqa'
]);
const improvements: { [key: string]: number } = {};
results.forEach(result => {
const baseScore = baseline[result.task as keyof typeof baseline];
improvements[result.task] = (
((result.score - baseScore) / baseScore) *
100
).toFixed(1);
});
return improvements;
}
Checklist
- Selected LoRA rank (4-16) and alpha based on model size
- Identified target modules (q_proj, v_proj minimum)
- Prepared dataset in Alpaca format with clear instructions
- Configured QLoRA for 4-bit quantization
- Verified model fits in available GPU memory
- Set learning rate to 2e-4 or lower
- Trained for 2-3 epochs with early stopping
- Evaluated on benchmark tasks (MMLU, ARC, HellaSwag)
- Merged weights for production deployment
- Tested inference latency (should be unchanged)
- Created model card documenting LoRA configuration
Conclusion
LoRA enables efficient fine-tuning on consumer hardware. QLoRA + Hugging Face PEFT makes adapting 7B-13B models accessible without enterprise infrastructure. Standardize on rank=8, alpha=16 and merge for production to keep deployment simple.