levy-n

transformers-llm

Implements Transformer models and LLM workflows. Covers attention mechanism, BERT fine-tuning, HuggingFace Transformers library (Tokenizer, Trainer, Pipeline), and LLM ecosystem (GPT, Claude, Gemini, Ollama). Use when fine-tuning language models, using HuggingFace, calling LLM APIs, or when user mentions 'transformer', 'attention', 'BERT', 'HuggingFace', 'tokenizer', 'fine-tuning', 'LLM', 'GPT', 'Claude', 'Gemini', 'prompt engineering', 'zero-shot', or 'few-shot learning'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/transformers-llm

Install via the SkillsCat registry.

SKILL.md

Transformers & LLMs - Modern NLP

Transformers, BERT, HuggingFace, ו-LLM Ecosystem.

Quick Start - HuggingFace Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# Tokenize data
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

When This Skill Activates

Use this skill when:

  • Fine-tuning BERT, RoBERTa, or other transformers
  • Using HuggingFace Transformers library
  • Calling LLM APIs (OpenAI, Anthropic, Gemini)
  • Doing zero-shot or few-shot classification
  • Understanding attention mechanisms
  • Choosing between LLM providers

Core Patterns

Pattern 1: Transformer Architecture Variants

┌─────────────────────────────────────────────────────────────┐
│                 TRANSFORMER VARIANTS                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Encoder-Only (BERT, RoBERTa)                               │
│  ├── Bidirectional (sees full context)                      │
│  ├── Best for: Classification, NER, Q&A                     │
│  └── Pre-training: Masked Language Model (MLM)              │
│                                                              │
│  Decoder-Only (GPT, LLaMA, Claude)                          │
│  ├── Autoregressive (left-to-right)                         │
│  ├── Best for: Text generation, chat, code                  │
│  └── Pre-training: Next token prediction                    │
│                                                              │
│  Encoder-Decoder (T5, BART)                                 │
│  ├── Full attention in encoder, causal in decoder           │
│  ├── Best for: Translation, summarization                   │
│  └── Pre-training: Span corruption + reconstruction         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Pattern 2: HuggingFace Pipeline (Quick Inference)

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Zero-shot classification
classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a tutorial about PyTorch",
    candidate_labels=["education", "politics", "business"]
)

# Question answering
qa = pipeline("question-answering")
result = qa(
    question="What is the capital of France?",
    context="France is a country in Europe. Paris is its capital."
)

# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)

Pattern 3: Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer("Hello, how are you?")
# {'input_ids': [101, 7592, ...], 'attention_mask': [1, 1, ...]}

# Batch tokenization with padding
batch = tokenizer(
    ["Hello", "How are you doing today?"],
    padding=True,           # Pad to longest in batch
    truncation=True,        # Truncate if > max_length
    max_length=128,
    return_tensors="pt"     # Return PyTorch tensors
)

# Decode back to text
text = tokenizer.decode(tokens["input_ids"])

Pattern 4: Fine-Tuning BERT

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset

# Load data
dataset = load_dataset("imdb")

# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)

dataset = dataset.map(tokenize, batched=True)
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Training arguments
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

trainer.train()

Pattern 5: LLM API Calls

# OpenAI
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformers in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Anthropic (Claude)
import anthropic

client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain attention mechanism."}
    ]
)

print(response.content[0].text)

# Google Gemini
import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-pro")

response = model.generate_content("Explain BERT architecture.")
print(response.text)

Pattern 6: Prompt Engineering (PTCF Framework)

P - Persona:  "You are an expert ML engineer..."
T - Task:     "Analyze this code for bugs..."
C - Context:  "The code is a PyTorch training loop..."
F - Format:   "Return as JSON with fields: bug, severity, fix"
system_prompt = """You are an expert ML engineer specializing in PyTorch.
Your task is to review code for bugs and performance issues.
Provide responses in this JSON format:
{
    "issues": [
        {"bug": "description", "severity": "high/medium/low", "fix": "suggestion"}
    ]
}"""

user_prompt = """Review this code:
```python
model.train()
for epoch in range(epochs):
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
```"""

Pattern 7: Zero-Shot vs Few-Shot

# Zero-shot: No examples
prompt_zero = """Classify the sentiment of this review as POSITIVE or NEGATIVE.
Review: "The food was terrible and the service was slow."
Sentiment:"""

# Few-shot: Provide examples
prompt_few = """Classify the sentiment of reviews as POSITIVE or NEGATIVE.

Review: "Great food, amazing service!"
Sentiment: POSITIVE

Review: "Worst experience ever, never going back."
Sentiment: NEGATIVE

Review: "The food was terrible and the service was slow."
Sentiment:"""

# Chain-of-Thought: Show reasoning
prompt_cot = """Classify the sentiment and explain your reasoning.

Review: "The food was terrible and the service was slow."

Let's think step by step:
1. "terrible" is a strongly negative word
2. "slow" is also negative in service context
3. No positive words present

Sentiment: NEGATIVE"""

Pattern 8: Model Selection Guide

Model Best For Context Length
GPT-4 General tasks, reasoning 128k
Claude 3 Long documents, coding, safety 200k
Gemini Pro Multimodal, Google integration 32k
Llama 3 Open source, local deployment 8k
Mistral Fast, efficient, local 32k
Decision Tree:
├── Need best quality, cost not issue → GPT-4, Claude
├── Need to run locally/private → Llama, Mistral
├── Need long context → Claude (200k), Gemini
├── Need multimodal (images) → GPT-4V, Gemini Vision
└── Need code generation → Claude, GPT-4, DeepSeek

Reference Navigation

For detailed content, see:

  • Transformer Architecture: reference/transformer_architecture.md - Self-attention, Positional encoding
  • BERT Fundamentals: reference/bert_fundamentals.md - MLM, NSP, Fine-tuning theory
  • HuggingFace Guide: reference/huggingface_guide.md - Tokenizers, Trainer, Datasets (17 sections)
  • LLM Ecosystem: reference/llm_ecosystem.md - GPT, Claude, Gemini, Ollama, pricing
  • Prompt Engineering: reference/prompt_engineering.md - PTCF, Zero/Few-shot, Chain-of-Thought

Common Mistakes to Avoid

1. Not Using Padding and Truncation

# WRONG: Variable length inputs fail in batches
tokens = tokenizer(texts)

# CORRECT: Pad and truncate
tokens = tokenizer(texts, padding=True, truncation=True, max_length=512)

2. Wrong Number of Labels

# WRONG: Model expects different number of labels
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased"  # Default num_labels=2
)
# But your data has 5 classes!

# CORRECT: Specify num_labels
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=5
)

3. Forgetting attention_mask

# WRONG: Model treats padding as real tokens
outputs = model(input_ids)

# CORRECT: Pass attention mask
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

4. Using Wrong Tokenizer

# WRONG: Tokenizer/model mismatch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
# Different vocabularies!

# CORRECT: Same model name
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")

Teaching Mode

When explaining transformers:

Attention Intuition

Query: "What am I looking for?"
Key:   "What do I contain?"
Value: "What will I contribute?"

Example sentence: "The cat sat on the mat"

When processing "sat":
- Query: "What's happening?"
- Keys of all words compete for attention
- "cat" gets high attention (who is doing the action)
- Values of high-attention words contribute to output

Self-Attention Visual

Input: "The cat sat"

Attention Matrix (who attends to whom):
           The   cat   sat
    The   [0.1   0.3   0.6]  ← "The" looks mostly at "sat"
    cat   [0.2   0.1   0.7]  ← "cat" looks mostly at "sat"
    sat   [0.3   0.5   0.2]  ← "sat" looks at "cat" (subject)

Each word creates a weighted average of all value vectors

BERT vs GPT

BERT (Encoder):
"The [MASK] sat on the mat"
  ←───────────────────────→
  Sees entire sentence bidirectionally

GPT (Decoder):
"The cat sat"
  →→→→→→→→→
  Only sees left context (autoregressive)