levy-n

fine-tuning-peft

Expert guide for LLM fine-tuning and parameter-efficient training methods. Covers LoRA, QLoRA, PEFT library, adapter tuning, instruction tuning, quantization (GPTQ, AWQ, GGUF, bitsandbytes), dataset preparation for fine-tuning, Hugging Face Trainer/TRL, RLHF/DPO/ORPO alignment, and model merging. Use when user asks about 'fine-tuning', 'LoRA', 'QLoRA', 'PEFT', 'adapter', 'quantization', 'bitsandbytes', '4-bit', '8-bit', 'instruction tuning', 'RLHF', 'DPO', 'model merging', 'Unsloth', 'Axolotl', 'training custom models', 'TRL', or 'SFT'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/fine-tuning-peft

Install via the SkillsCat registry.

SKILL.md

Fine-Tuning & PEFT - Parameter-Efficient Model Training

כוונון עדין ושיטות אימון יעילות-פרמטרים למודלים גדולים.

Quick Start - LoRA Fine-Tuning with PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset

# 1. Load base model + tokenizer
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (low = efficient, high = expressive)
    lora_alpha=32,                 # Scaling factor (usually 2*r)
    lora_dropout=0.05,             # Regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    bias="none"
)

# 3. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,235,814,400 || trainable%: 0.34%

# 4. Prepare dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

def format_prompt(example):
    if example["input"]:
        return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

# 5. Train with SFTTrainer
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    formatting_func=format_prompt,
    max_seq_length=512,
)

trainer.train()

# 6. Save & Load
model.save_pretrained("./my-lora-adapter")
# Later: model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

When This Skill Activates

Use this skill when:

  • Fine-tuning any pretrained model (LLM, BERT, vision)
  • Using LoRA/QLoRA for parameter-efficient training
  • Preparing datasets for instruction tuning
  • Quantizing models (4-bit, 8-bit)
  • Aligning models with RLHF/DPO
  • Merging LoRA adapters back into base model

Core Patterns

Pattern 1: LoRA vs QLoRA vs Full Fine-Tuning

Method Selection:
├── Full Fine-Tuning
│   ├── When: Small model (<1B), lots of data, lots of VRAM
│   ├── VRAM: ~4x model size (e.g., 7B → 28GB+)
│   └── Best accuracy, but expensive
│
├── LoRA (Low-Rank Adaptation)
│   ├── When: Large model, limited VRAM, need good quality
│   ├── VRAM: ~1.2x model size
│   ├── Trains only low-rank matrices (0.1-1% of params)
│   └── 90-95% of full fine-tuning quality
│
├── QLoRA (Quantized LoRA)
│   ├── When: Very limited VRAM (single consumer GPU)
│   ├── VRAM: ~0.3x model size (7B → ~6GB!)
│   ├── Base model in 4-bit + LoRA in fp16
│   └── 85-95% of full fine-tuning quality
│
└── Prompt Tuning / Prefix Tuning
    ├── When: Minimal compute, many tasks
    ├── VRAM: Minimal overhead
    └── Lower quality, but very fast

Pattern 2: QLoRA - Fine-Tune 7B on Single GPU

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization config (4-bit with NF4)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype=torch.float16, # Compute in fp16
    bnb_4bit_use_double_quant=True        # Double quantization saves memory
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~20M || all params: ~8B || trainable%: 0.25%

Pattern 3: Dataset Preparation for Fine-Tuning

from datasets import load_dataset, Dataset

# Format 1: Instruction-Response (Alpaca style)
def format_alpaca(example):
    return {
        "text": f"""### Instruction:
{example['instruction']}

### Input:
{example.get('input', '')}

### Response:
{example['output']}"""
    }

# Format 2: Chat format (ChatML)
def format_chatml(example):
    return {
        "text": f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{example['question']}<|im_end|>
<|im_start|>assistant
{example['answer']}<|im_end|>"""
    }

# Format 3: Custom data from CSV/JSON
import pandas as pd
df = pd.read_csv("my_data.csv")  # columns: question, answer

dataset = Dataset.from_pandas(df)
dataset = dataset.map(format_chatml)
dataset = dataset.train_test_split(test_size=0.1)

# Data quality checks
print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")
print(f"Sample: {dataset['train'][0]['text'][:200]}")

Pattern 4: LoRA Hyperparameter Guide

LoRA Hyperparameters:
┌──────────────────────────────────────────────────────────┐
│  r (rank)                                                │
│  ├── 4-8:   Quick experiments, simple tasks              │
│  ├── 16:    Good default for most tasks                  │
│  ├── 32-64: Complex tasks, lots of data                  │
│  └── Rule: Higher r = more params = more expressive      │
│                                                          │
│  lora_alpha                                              │
│  ├── Usually = 2 * r                                     │
│  ├── Controls scaling: effective_lr = alpha/r * lr       │
│  └── Higher alpha = stronger adapter effect              │
│                                                          │
│  target_modules                                          │
│  ├── Minimum: ["q_proj", "v_proj"]                       │
│  ├── Better: + ["k_proj", "o_proj"]                      │
│  ├── Best: + ["gate_proj", "up_proj", "down_proj"]       │
│  └── More modules = better quality, more VRAM            │
│                                                          │
│  lora_dropout                                            │
│  ├── 0.0:   Large datasets                               │
│  ├── 0.05:  Default                                      │
│  └── 0.1:   Small datasets (regularization)              │
└──────────────────────────────────────────────────────────┘

Pattern 5: Quantization Methods Comparison

Method Bits Speed Quality Use Case
bitsandbytes NF4 4-bit Fast Good QLoRA training
GPTQ 4-bit Fastest Good Inference deployment
AWQ 4-bit Fast Best 4-bit Production inference
GGUF 2-8 bit CPU-friendly Varies llama.cpp / local
# bitsandbytes 8-bit (simpler, good for inference)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

# GPTQ (pre-quantized models from HuggingFace)
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)

# AWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True
)

Pattern 6: RLHF / DPO Alignment

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# DPO - Direct Preference Optimization (simpler than RLHF)
# Dataset needs: prompt, chosen (good response), rejected (bad response)

model = AutoModelForCausalLM.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("my-sft-model")
tokenizer = AutoTokenizer.from_pretrained("my-sft-model")

dpo_config = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,            # Very low LR for alignment
    beta=0.1,                       # KL penalty strength
    logging_steps=10,
)

# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()
RLHF Pipeline:
┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│   Stage 1    │    │   Stage 2     │    │   Stage 3    │
│  SFT Train   │───▶│ Reward Model  │───▶│  RL (PPO)    │
│              │    │  Training     │    │  or DPO      │
│ Instruction  │    │ Human prefs   │    │  Alignment   │
│ dataset      │    │ chosen/reject │    │              │
└─────────────┘    └──────────────┘    └──────────────┘

DPO Simplification:
  Skips Stage 2! Goes directly from SFT → Alignment
  Using preference pairs (chosen vs rejected)

Pattern 7: Merge LoRA Back to Base Model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Merge adapter into base model
merged_model = model.merge_and_unload()

# Save as standalone model
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")

# Now you have a single model without PEFT dependency

Reference Navigation

For detailed content, see:

  • LoRA & QLoRA: reference/lora_qlora.md - Architecture, math, hyperparameter tuning
  • Quantization Methods: reference/quantization_methods.md - GPTQ, AWQ, GGUF, bitsandbytes
  • Dataset Preparation: reference/dataset_preparation.md - Formats, cleaning, quality checks
  • Training Recipes: reference/training_recipes.md - End-to-end fine-tuning workflows

Common Mistakes to Avoid

1. Wrong Target Modules

# WRONG: Generic module names (won't match)
lora_config = LoraConfig(target_modules=["attention"])

# CORRECT: Find actual module names first
print(model)  # See architecture
# Or: [name for name, _ in model.named_modules() if "proj" in name]

2. Not Setting pad_token

# WRONG: Many models don't have pad_token
tokenizer = AutoTokenizer.from_pretrained(model_name)
# RuntimeError: pad_token not set

# CORRECT: Set pad_token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

3. Training on Prompt (not just response)

# WRONG: Model learns to generate the prompt too
# This wastes training signal

# CORRECT: Use response_template in SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    formatting_func=format_prompt,
    data_collator=DataCollatorForCompletionOnlyLM(
        response_template="### Response:",
        tokenizer=tokenizer
    ),
)

4. QLoRA without prepare_model_for_kbit_training

# WRONG: Direct LoRA on quantized model
model = AutoModelForCausalLM.from_pretrained(model, load_in_4bit=True)
model = get_peft_model(model, lora_config)  # Gradients won't flow!

# CORRECT: Prepare model first
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

Teaching Mode

LoRA Intuition

Full Fine-Tuning:
  Original weights W (huge matrix, e.g., 4096×4096 = 16M params)
  Update: W_new = W + ΔW (16M params to train!)

LoRA:
  Keep W frozen (no training)
  Add: W_new = W + A × B
  Where A is 4096×16, B is 16×4096
  Total trainable: 4096×16 + 16×4096 = 131K params!

  ┌─────────────┐
  │  W (frozen)  │  4096 × 4096
  │              │
  └──────┬───────┘
         │
    ┌────┴────┐
    │ + A × B │  A: 4096×r  B: r×4096  (r=16)
    └─────────┘
    Trainable: 0.8% of original!

  Intuition: Most model knowledge is in W.
  LoRA learns a small "correction" on top.

Quantization Intuition

Full Precision (FP32):  32 bits per number  → 28GB for 7B model
Half Precision (FP16):  16 bits per number  → 14GB for 7B model
8-bit (INT8):           8 bits per number   →  7GB for 7B model
4-bit (NF4):            4 bits per number   → 3.5GB for 7B model

Trade-off: Less precision = less VRAM, but slightly less accurate

NF4 (NormalFloat4): Optimized for normally-distributed weights
  → Better than regular INT4 for neural networks

Cross-References

  • PyTorch basics: ../pytorch-mastery/SKILL.md - Training loops, GPU memory
  • Transformers: ../transformers-llm/SKILL.md - HuggingFace ecosystem
  • RAG integration: ../rag-retrieval/SKILL.md - Fine-tuned models in RAG
  • Deep learning fundamentals: ../deep-learning-core/SKILL.md - Loss, optimizers