Implements Transformer models and LLM workflows. Covers attention mechanism, BERT fine-tuning, HuggingFace Transformers library (Tokenizer, Trainer, Pipeline), and LLM ecosystem (GPT, Claude, Gemini, Ollama). Use when fine-tuning language models, using HuggingFace, calling LLM APIs, or when user mentions 'transformer', 'attention', 'BERT', 'HuggingFace', 'tokenizer', 'fine-tuning', 'LLM', 'GPT', 'Claude', 'Gemini', 'prompt engineering', 'zero-shot', or 'few-shot learning'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/transformers-llm Install via the SkillsCat registry.
SKILL.md
Transformers & LLMs - Modern NLP
Transformers, BERT, HuggingFace, ו-LLM Ecosystem.
Quick Start - HuggingFace Classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
# Tokenize data
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Train
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
trainer.train()When This Skill Activates
Use this skill when:
- Fine-tuning BERT, RoBERTa, or other transformers
- Using HuggingFace Transformers library
- Calling LLM APIs (OpenAI, Anthropic, Gemini)
- Doing zero-shot or few-shot classification
- Understanding attention mechanisms
- Choosing between LLM providers
Core Patterns
Pattern 1: Transformer Architecture Variants
┌─────────────────────────────────────────────────────────────┐
│ TRANSFORMER VARIANTS │
├─────────────────────────────────────────────────────────────┤
│ │
│ Encoder-Only (BERT, RoBERTa) │
│ ├── Bidirectional (sees full context) │
│ ├── Best for: Classification, NER, Q&A │
│ └── Pre-training: Masked Language Model (MLM) │
│ │
│ Decoder-Only (GPT, LLaMA, Claude) │
│ ├── Autoregressive (left-to-right) │
│ ├── Best for: Text generation, chat, code │
│ └── Pre-training: Next token prediction │
│ │
│ Encoder-Decoder (T5, BART) │
│ ├── Full attention in encoder, causal in decoder │
│ ├── Best for: Translation, summarization │
│ └── Pre-training: Span corruption + reconstruction │
│ │
└─────────────────────────────────────────────────────────────┘Pattern 2: HuggingFace Pipeline (Quick Inference)
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Zero-shot classification
classifier = pipeline("zero-shot-classification")
result = classifier(
"This is a tutorial about PyTorch",
candidate_labels=["education", "politics", "business"]
)
# Question answering
qa = pipeline("question-answering")
result = qa(
question="What is the capital of France?",
context="France is a country in Europe. Paris is its capital."
)
# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)Pattern 3: Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Basic tokenization
tokens = tokenizer("Hello, how are you?")
# {'input_ids': [101, 7592, ...], 'attention_mask': [1, 1, ...]}
# Batch tokenization with padding
batch = tokenizer(
["Hello", "How are you doing today?"],
padding=True, # Pad to longest in batch
truncation=True, # Truncate if > max_length
max_length=128,
return_tensors="pt" # Return PyTorch tensors
)
# Decode back to text
text = tokenizer.decode(tokens["input_ids"])Pattern 4: Fine-Tuning BERT
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
Trainer, TrainingArguments
)
from datasets import load_dataset
# Load data
dataset = load_dataset("imdb")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize, batched=True)
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
# Model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Training arguments
args = TrainingArguments(
output_dir="./bert-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Train
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
trainer.train()Pattern 5: LLM API Calls
# OpenAI
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env var
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Anthropic (Claude)
import anthropic
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain attention mechanism."}
]
)
print(response.content[0].text)
# Google Gemini
import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-pro")
response = model.generate_content("Explain BERT architecture.")
print(response.text)Pattern 6: Prompt Engineering (PTCF Framework)
P - Persona: "You are an expert ML engineer..."
T - Task: "Analyze this code for bugs..."
C - Context: "The code is a PyTorch training loop..."
F - Format: "Return as JSON with fields: bug, severity, fix"system_prompt = """You are an expert ML engineer specializing in PyTorch.
Your task is to review code for bugs and performance issues.
Provide responses in this JSON format:
{
"issues": [
{"bug": "description", "severity": "high/medium/low", "fix": "suggestion"}
]
}"""
user_prompt = """Review this code:
```python
model.train()
for epoch in range(epochs):
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
```"""Pattern 7: Zero-Shot vs Few-Shot
# Zero-shot: No examples
prompt_zero = """Classify the sentiment of this review as POSITIVE or NEGATIVE.
Review: "The food was terrible and the service was slow."
Sentiment:"""
# Few-shot: Provide examples
prompt_few = """Classify the sentiment of reviews as POSITIVE or NEGATIVE.
Review: "Great food, amazing service!"
Sentiment: POSITIVE
Review: "Worst experience ever, never going back."
Sentiment: NEGATIVE
Review: "The food was terrible and the service was slow."
Sentiment:"""
# Chain-of-Thought: Show reasoning
prompt_cot = """Classify the sentiment and explain your reasoning.
Review: "The food was terrible and the service was slow."
Let's think step by step:
1. "terrible" is a strongly negative word
2. "slow" is also negative in service context
3. No positive words present
Sentiment: NEGATIVE"""Pattern 8: Model Selection Guide
| Model | Best For | Context Length |
|---|---|---|
| GPT-4 | General tasks, reasoning | 128k |
| Claude 3 | Long documents, coding, safety | 200k |
| Gemini Pro | Multimodal, Google integration | 32k |
| Llama 3 | Open source, local deployment | 8k |
| Mistral | Fast, efficient, local | 32k |
Decision Tree:
├── Need best quality, cost not issue → GPT-4, Claude
├── Need to run locally/private → Llama, Mistral
├── Need long context → Claude (200k), Gemini
├── Need multimodal (images) → GPT-4V, Gemini Vision
└── Need code generation → Claude, GPT-4, DeepSeekReference Navigation
For detailed content, see:
- Transformer Architecture:
reference/transformer_architecture.md- Self-attention, Positional encoding - BERT Fundamentals:
reference/bert_fundamentals.md- MLM, NSP, Fine-tuning theory - HuggingFace Guide:
reference/huggingface_guide.md- Tokenizers, Trainer, Datasets (17 sections) - LLM Ecosystem:
reference/llm_ecosystem.md- GPT, Claude, Gemini, Ollama, pricing - Prompt Engineering:
reference/prompt_engineering.md- PTCF, Zero/Few-shot, Chain-of-Thought
Common Mistakes to Avoid
1. Not Using Padding and Truncation
# WRONG: Variable length inputs fail in batches
tokens = tokenizer(texts)
# CORRECT: Pad and truncate
tokens = tokenizer(texts, padding=True, truncation=True, max_length=512)2. Wrong Number of Labels
# WRONG: Model expects different number of labels
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased" # Default num_labels=2
)
# But your data has 5 classes!
# CORRECT: Specify num_labels
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=5
)3. Forgetting attention_mask
# WRONG: Model treats padding as real tokens
outputs = model(input_ids)
# CORRECT: Pass attention mask
outputs = model(input_ids=input_ids, attention_mask=attention_mask)4. Using Wrong Tokenizer
# WRONG: Tokenizer/model mismatch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
# Different vocabularies!
# CORRECT: Same model name
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")Teaching Mode
When explaining transformers:
Attention Intuition
Query: "What am I looking for?"
Key: "What do I contain?"
Value: "What will I contribute?"
Example sentence: "The cat sat on the mat"
When processing "sat":
- Query: "What's happening?"
- Keys of all words compete for attention
- "cat" gets high attention (who is doing the action)
- Values of high-attention words contribute to outputSelf-Attention Visual
Input: "The cat sat"
Attention Matrix (who attends to whom):
The cat sat
The [0.1 0.3 0.6] ← "The" looks mostly at "sat"
cat [0.2 0.1 0.7] ← "cat" looks mostly at "sat"
sat [0.3 0.5 0.2] ← "sat" looks at "cat" (subject)
Each word creates a weighted average of all value vectorsBERT vs GPT
BERT (Encoder):
"The [MASK] sat on the mat"
←───────────────────────→
Sees entire sentence bidirectionally
GPT (Decoder):
"The cat sat"
→→→→→→→→→
Only sees left context (autoregressive)