Fine-tune LLMs on Custom Data 2026: Complete HuggingFace Guide with QLoRA
Advertisement
Fine-tune LLMs in 2026: From Zero to Custom Model
Fine-tuning lets you make a general LLM (Llama 3, Mistral, Gemma) behave exactly as you want — answering in your company's style, responding in a specific domain, or following custom instructions perfectly.
- When to Fine-tune vs Prompt Engineering
- QLoRA: Fine-tune on Consumer Hardware
- Prepare Your Dataset
- Training Loop
- Inference with Fine-tuned Model
- Deploy to HuggingFace Hub
- Cost Estimate
- Tips for Better Results
When to Fine-tune vs Prompt Engineering
| Situation | Solution |
|---|---|
| Need specific response format | Prompt engineering (cheaper) |
| Specialized domain knowledge | Fine-tuning |
| Consistent tone/persona | Fine-tuning |
| Very long system prompts (slow) | Fine-tune to "bake in" the behavior |
| Reduce hallucinations on your data | Fine-tuning + RAG |
| Need 10x faster inference | Fine-tune a smaller model |
QLoRA: Fine-tune on Consumer Hardware
QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 7B model on a single 16GB GPU:
pip install transformers peft datasets trl bitsandbytes accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
# ── 1. Load base model in 4-bit quantization ──────────────────────────────
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_name = "meta-llama/Llama-3.2-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# ── 2. Configure LoRA adapters ─────────────────────────────────────────────
lora_config = LoraConfig(
r=16, # Rank: higher = more params, better quality
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,035,162,112 || trainable%: 0.052%
Prepare Your Dataset
# Format your data as conversation pairs
training_data = [
{
"instruction": "What is a Python list comprehension?",
"response": "A list comprehension creates a new list by applying an expression to each item in an iterable.\n\nExample:\n```python\nsquares = [x**2 for x in range(10)]\n```\nThis is equivalent to a for loop but more concise."
},
{
"instruction": "Explain the difference between == and is in Python",
"response": "== compares values (are they equal?). is compares identity (are they the same object in memory?).\n\n```python\na = [1, 2, 3]\nb = [1, 2, 3]\nprint(a == b) # True — same values\nprint(a is b) # False — different objects\n```"
},
# ... hundreds or thousands more examples
]
def format_instruction(sample: dict) -> str:
"""Format as chat template."""
return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert Python tutor.<|eot_id|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['response']}<|eot_id|>"""
dataset = Dataset.from_list(training_data)
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
Training Loop
training_args = SFTConfig(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
warmup_ratio=0.03,
lr_scheduler_type="cosine",
max_seq_length=2048,
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
)
trainer.train()
trainer.save_model("./fine-tuned-model")
Inference with Fine-tuned Model
from peft import PeftModel
# Load base model + LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
# Generate
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWhat is a decorator?<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
return_tensors="pt"
).to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Deploy to HuggingFace Hub
# Merge LoRA weights into base model for deployment
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# Push to HuggingFace Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="./merged-model",
repo_id="your-username/my-python-tutor",
repo_type="model"
)
Cost Estimate
| Setup | Time | Cost |
|---|---|---|
| Google Colab T4 (free) | 2-3 hours | Free |
| Google Colab A100 | 30 min | ~$3 |
| Runpod A100 80GB | 20 min | ~$2 |
| Modal.com H100 | 15 min | ~$5 |
Fine-tuning a 7B model on 1,000 examples takes under 30 minutes on a single A100 GPU.
Tips for Better Results
- Quality over quantity: 500 high-quality examples beat 5,000 mediocre ones
- Format matters: Use the model's native chat template exactly
- Diverse examples: Cover all your use cases in training data
- Eval set: Keep 10% of data for evaluation, not training
- Iterative: Fine-tune → evaluate → add more data → repeat
Advertisement