Expert guide for LLM fine-tuning and parameter-efficient training methods. Covers LoRA, QLoRA, PEFT library, adapter tuning, instruction tuning, quantization (GPTQ, AWQ, GGUF, bitsandbytes), dataset preparation for fine-tuning, Hugging Face Trainer/TRL, RLHF/DPO/ORPO alignment, and model merging. Use when user asks about 'fine-tuning', 'LoRA', 'QLoRA', 'PEFT', 'adapter', 'quantization', 'bitsandbytes', '4-bit', '8-bit', 'instruction tuning', 'RLHF', 'DPO', 'model merging', 'Unsloth', 'Axolotl', 'training custom models', 'TRL', or 'SFT'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/fine-tuning-peft Install via the SkillsCat registry.
SKILL.md
Fine-Tuning & PEFT - Parameter-Efficient Model Training
כוונון עדין ושיטות אימון יעילות-פרמטרים למודלים גדולים.
Quick Start - LoRA Fine-Tuning with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset
# 1. Load base model + tokenizer
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank (low = efficient, high = expressive)
lora_alpha=32, # Scaling factor (usually 2*r)
lora_dropout=0.05, # Regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
bias="none"
)
# 3. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,235,814,400 || trainable%: 0.34%
# 4. Prepare dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_prompt(example):
if example["input"]:
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
# 5. Train with SFTTrainer
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
fp16=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
formatting_func=format_prompt,
max_seq_length=512,
)
trainer.train()
# 6. Save & Load
model.save_pretrained("./my-lora-adapter")
# Later: model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")When This Skill Activates
Use this skill when:
- Fine-tuning any pretrained model (LLM, BERT, vision)
- Using LoRA/QLoRA for parameter-efficient training
- Preparing datasets for instruction tuning
- Quantizing models (4-bit, 8-bit)
- Aligning models with RLHF/DPO
- Merging LoRA adapters back into base model
Core Patterns
Pattern 1: LoRA vs QLoRA vs Full Fine-Tuning
Method Selection:
├── Full Fine-Tuning
│ ├── When: Small model (<1B), lots of data, lots of VRAM
│ ├── VRAM: ~4x model size (e.g., 7B → 28GB+)
│ └── Best accuracy, but expensive
│
├── LoRA (Low-Rank Adaptation)
│ ├── When: Large model, limited VRAM, need good quality
│ ├── VRAM: ~1.2x model size
│ ├── Trains only low-rank matrices (0.1-1% of params)
│ └── 90-95% of full fine-tuning quality
│
├── QLoRA (Quantized LoRA)
│ ├── When: Very limited VRAM (single consumer GPU)
│ ├── VRAM: ~0.3x model size (7B → ~6GB!)
│ ├── Base model in 4-bit + LoRA in fp16
│ └── 85-95% of full fine-tuning quality
│
└── Prompt Tuning / Prefix Tuning
├── When: Minimal compute, many tasks
├── VRAM: Minimal overhead
└── Lower quality, but very fastPattern 2: QLoRA - Fine-Tune 7B on Single GPU
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Quantization config (4-bit with NF4)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
bnb_4bit_compute_dtype=torch.float16, # Compute in fp16
bnb_4bit_use_double_quant=True # Double quantization saves memory
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~20M || all params: ~8B || trainable%: 0.25%Pattern 3: Dataset Preparation for Fine-Tuning
from datasets import load_dataset, Dataset
# Format 1: Instruction-Response (Alpaca style)
def format_alpaca(example):
return {
"text": f"""### Instruction:
{example['instruction']}
### Input:
{example.get('input', '')}
### Response:
{example['output']}"""
}
# Format 2: Chat format (ChatML)
def format_chatml(example):
return {
"text": f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{example['question']}<|im_end|>
<|im_start|>assistant
{example['answer']}<|im_end|>"""
}
# Format 3: Custom data from CSV/JSON
import pandas as pd
df = pd.read_csv("my_data.csv") # columns: question, answer
dataset = Dataset.from_pandas(df)
dataset = dataset.map(format_chatml)
dataset = dataset.train_test_split(test_size=0.1)
# Data quality checks
print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")
print(f"Sample: {dataset['train'][0]['text'][:200]}")Pattern 4: LoRA Hyperparameter Guide
LoRA Hyperparameters:
┌──────────────────────────────────────────────────────────┐
│ r (rank) │
│ ├── 4-8: Quick experiments, simple tasks │
│ ├── 16: Good default for most tasks │
│ ├── 32-64: Complex tasks, lots of data │
│ └── Rule: Higher r = more params = more expressive │
│ │
│ lora_alpha │
│ ├── Usually = 2 * r │
│ ├── Controls scaling: effective_lr = alpha/r * lr │
│ └── Higher alpha = stronger adapter effect │
│ │
│ target_modules │
│ ├── Minimum: ["q_proj", "v_proj"] │
│ ├── Better: + ["k_proj", "o_proj"] │
│ ├── Best: + ["gate_proj", "up_proj", "down_proj"] │
│ └── More modules = better quality, more VRAM │
│ │
│ lora_dropout │
│ ├── 0.0: Large datasets │
│ ├── 0.05: Default │
│ └── 0.1: Small datasets (regularization) │
└──────────────────────────────────────────────────────────┘Pattern 5: Quantization Methods Comparison
| Method | Bits | Speed | Quality | Use Case |
|---|---|---|---|---|
| bitsandbytes NF4 | 4-bit | Fast | Good | QLoRA training |
| GPTQ | 4-bit | Fastest | Good | Inference deployment |
| AWQ | 4-bit | Fast | Best 4-bit | Production inference |
| GGUF | 2-8 bit | CPU-friendly | Varies | llama.cpp / local |
# bitsandbytes 8-bit (simpler, good for inference)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
# GPTQ (pre-quantized models from HuggingFace)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
# AWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-AWQ",
fuse_layers=True
)Pattern 6: RLHF / DPO Alignment
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# DPO - Direct Preference Optimization (simpler than RLHF)
# Dataset needs: prompt, chosen (good response), rejected (bad response)
model = AutoModelForCausalLM.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("my-sft-model")
tokenizer = AutoTokenizer.from_pretrained("my-sft-model")
dpo_config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7, # Very low LR for alignment
beta=0.1, # KL penalty strength
logging_steps=10,
)
# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()RLHF Pipeline:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Stage 1 │ │ Stage 2 │ │ Stage 3 │
│ SFT Train │───▶│ Reward Model │───▶│ RL (PPO) │
│ │ │ Training │ │ or DPO │
│ Instruction │ │ Human prefs │ │ Alignment │
│ dataset │ │ chosen/reject │ │ │
└─────────────┘ └──────────────┘ └──────────────┘
DPO Simplification:
Skips Stage 2! Goes directly from SFT → Alignment
Using preference pairs (chosen vs rejected)Pattern 7: Merge LoRA Back to Base Model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
# Merge adapter into base model
merged_model = model.merge_and_unload()
# Save as standalone model
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")
# Now you have a single model without PEFT dependencyReference Navigation
For detailed content, see:
- LoRA & QLoRA:
reference/lora_qlora.md- Architecture, math, hyperparameter tuning - Quantization Methods:
reference/quantization_methods.md- GPTQ, AWQ, GGUF, bitsandbytes - Dataset Preparation:
reference/dataset_preparation.md- Formats, cleaning, quality checks - Training Recipes:
reference/training_recipes.md- End-to-end fine-tuning workflows
Common Mistakes to Avoid
1. Wrong Target Modules
# WRONG: Generic module names (won't match)
lora_config = LoraConfig(target_modules=["attention"])
# CORRECT: Find actual module names first
print(model) # See architecture
# Or: [name for name, _ in model.named_modules() if "proj" in name]2. Not Setting pad_token
# WRONG: Many models don't have pad_token
tokenizer = AutoTokenizer.from_pretrained(model_name)
# RuntimeError: pad_token not set
# CORRECT: Set pad_token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id3. Training on Prompt (not just response)
# WRONG: Model learns to generate the prompt too
# This wastes training signal
# CORRECT: Use response_template in SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
formatting_func=format_prompt,
data_collator=DataCollatorForCompletionOnlyLM(
response_template="### Response:",
tokenizer=tokenizer
),
)4. QLoRA without prepare_model_for_kbit_training
# WRONG: Direct LoRA on quantized model
model = AutoModelForCausalLM.from_pretrained(model, load_in_4bit=True)
model = get_peft_model(model, lora_config) # Gradients won't flow!
# CORRECT: Prepare model first
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)Teaching Mode
LoRA Intuition
Full Fine-Tuning:
Original weights W (huge matrix, e.g., 4096×4096 = 16M params)
Update: W_new = W + ΔW (16M params to train!)
LoRA:
Keep W frozen (no training)
Add: W_new = W + A × B
Where A is 4096×16, B is 16×4096
Total trainable: 4096×16 + 16×4096 = 131K params!
┌─────────────┐
│ W (frozen) │ 4096 × 4096
│ │
└──────┬───────┘
│
┌────┴────┐
│ + A × B │ A: 4096×r B: r×4096 (r=16)
└─────────┘
Trainable: 0.8% of original!
Intuition: Most model knowledge is in W.
LoRA learns a small "correction" on top.Quantization Intuition
Full Precision (FP32): 32 bits per number → 28GB for 7B model
Half Precision (FP16): 16 bits per number → 14GB for 7B model
8-bit (INT8): 8 bits per number → 7GB for 7B model
4-bit (NF4): 4 bits per number → 3.5GB for 7B model
Trade-off: Less precision = less VRAM, but slightly less accurate
NF4 (NormalFloat4): Optimized for normally-distributed weights
→ Better than regular INT4 for neural networksCross-References
- PyTorch basics:
../pytorch-mastery/SKILL.md- Training loops, GPU memory - Transformers:
../transformers-llm/SKILL.md- HuggingFace ecosystem - RAG integration:
../rag-retrieval/SKILL.md- Fine-tuned models in RAG - Deep learning fundamentals:
../deep-learning-core/SKILL.md- Loss, optimizers