Fine-tune LLMs on Custom Data 2026: Complete HuggingFace Guide with QLoRA

Fine-tune LLMs in 2026: From Zero to Custom Model

Fine-tuning lets you make a general LLM (Llama 3, Mistral, Gemma) behave exactly as you want — answering in your company's style, responding in a specific domain, or following custom instructions perfectly.

When to Fine-tune vs Prompt Engineering
QLoRA: Fine-tune on Consumer Hardware
Prepare Your Dataset
Training Loop
Inference with Fine-tuned Model
Deploy to HuggingFace Hub
Cost Estimate
Tips for Better Results

When to Fine-tune vs Prompt Engineering

Situation	Solution
Need specific response format	Prompt engineering (cheaper)
Specialized domain knowledge	Fine-tuning
Consistent tone/persona	Fine-tuning
Very long system prompts (slow)	Fine-tune to "bake in" the behavior
Reduce hallucinations on your data	Fine-tuning + RAG
Need 10x faster inference	Fine-tune a smaller model

QLoRA: Fine-tune on Consumer Hardware

QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 7B model on a single 16GB GPU:

pip install transformers peft datasets trl bitsandbytes accelerate

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import Dataset

# ── 1. Load base model in 4-bit quantization ──────────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_name = "meta-llama/Llama-3.2-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# ── 2. Configure LoRA adapters ─────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,                  # Rank: higher = more params, better quality
    lora_alpha=32,         # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,035,162,112 || trainable%: 0.052%

Prepare Your Dataset

# Format your data as conversation pairs
training_data = [
    {
        "instruction": "What is a Python list comprehension?",
        "response": "A list comprehension creates a new list by applying an expression to each item in an iterable.\n\nExample:\n```python\nsquares = [x**2 for x in range(10)]\n```\nThis is equivalent to a for loop but more concise."
    },
    {
        "instruction": "Explain the difference between == and is in Python",
        "response": "== compares values (are they equal?). is compares identity (are they the same object in memory?).\n\n```python\na = [1, 2, 3]\nb = [1, 2, 3]\nprint(a == b)  # True — same values\nprint(a is b)  # False — different objects\n```"
    },
    # ... hundreds or thousands more examples
]

def format_instruction(sample: dict) -> str:
    """Format as chat template."""
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert Python tutor.<|eot_id|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['response']}<|eot_id|>"""

dataset = Dataset.from_list(training_data)
dataset = dataset.map(lambda x: {"text": format_instruction(x)})

Training Loop

training_args = SFTConfig(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    max_seq_length=2048,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
)

trainer.train()
trainer.save_model("./fine-tuned-model")

Inference with Fine-tuned Model

from peft import PeftModel

# Load base model + LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")

# Generate
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(
    "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWhat is a decorator?<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=300,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Deploy to HuggingFace Hub

# Merge LoRA weights into base model for deployment
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Push to HuggingFace Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="./merged-model",
    repo_id="your-username/my-python-tutor",
    repo_type="model"
)

Cost Estimate

Setup	Time	Cost
Google Colab T4 (free)	2-3 hours	Free
Google Colab A100	30 min	~$3
Runpod A100 80GB	20 min	~$2
Modal.com H100	15 min	~$5

Fine-tuning a 7B model on 1,000 examples takes under 30 minutes on a single A100 GPU.

Tips for Better Results

Quality over quantity: 500 high-quality examples beat 5,000 mediocre ones
Format matters: Use the model's native chat template exactly
Diverse examples: Cover all your use cases in training data
Eval set: Keep 10% of data for evaluation, not training
Iterative: Fine-tune → evaluate → add more data → repeat