Explains neural network fundamentals: the Three Pillars (Model, Loss, Optimizer), backpropagation, gradient descent variants (SGD, Adam), regularization (Dropout, BatchNorm), and MLP architecture design. Use when learning how neural networks work, debugging training issues, or when user asks about 'backpropagation', 'vanishing gradients', 'learning rate', 'loss function', 'overfitting', 'underfitting', 'activation functions', 'why isn\'t my model learning', 'gradient descent', 'Adam', 'Dropout', 'BatchNorm', 'autoencoder', 'denoising autoencoder', or 'latent space'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/deep-learning-core Install via the SkillsCat registry.
Deep Learning Core - Neural Network Fundamentals
יסודות Deep Learning: המודל, Loss, ואופטימיזציה.
The Three Pillars of Learning
┌─────────────────────────────────────────────────────────────┐
│ 1. ASSUMPTION (Model) - מה הצורה של הפתרון? │
│ → משפחת הפונקציות (Linear, Polynomial, Neural Network) │
│ │
│ 2. CRITERION (Loss) - מה נחשב "טוב"? │
│ → פונקציית ה-Loss (MSE, Cross-Entropy, etc.) │
│ │
│ 3. SEARCH (Optimization) - איך מוצאים את הפתרון? │
│ → אלגוריתם האופטימיזציה (Gradient Descent, Adam) │
└─────────────────────────────────────────────────────────────┘Quick Start - Training Loop
import torch
import torch.nn as nn
import torch.optim as optim
# 1. Model (Assumption)
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, output_dim)
)
# 2. Loss (Criterion)
criterion = nn.MSELoss() # Regression
# criterion = nn.CrossEntropyLoss() # Classification
# 3. Optimizer (Search)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(num_epochs):
optimizer.zero_grad() # Clear gradients
outputs = model(X_train) # Forward pass
loss = criterion(outputs, y_train) # Compute loss
loss.backward() # Backward pass (compute gradients)
optimizer.step() # Update weights
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")When This Skill Activates
Use this skill when:
- Learning how neural networks work
- Understanding backpropagation
- Debugging training issues (loss not decreasing, NaN)
- Choosing loss functions or optimizers
- Deciding on architecture (layers, neurons)
- Dealing with overfitting/underfitting
Core Patterns
Pattern 1: Loss Function Selection
| Problem | Loss Function | Output Activation |
|---|---|---|
| Regression | MSELoss() |
None (linear) |
| Binary Classification | BCEWithLogitsLoss() |
None (logits) |
| Binary (with sigmoid) | BCELoss() |
Sigmoid |
| Multi-class | CrossEntropyLoss() |
None (logits) |
| Multi-label | BCEWithLogitsLoss() |
None (per class) |
# Regression
criterion = nn.MSELoss()
# Binary classification (RECOMMENDED)
criterion = nn.BCEWithLogitsLoss() # Applies sigmoid internally
# Multi-class classification
criterion = nn.CrossEntropyLoss() # Applies softmax internally
# ⚠️ Don't add softmax to model output!Pattern 2: Optimizer Selection
| Optimizer | When to Use |
|---|---|
| Adam | Default choice, "just works" |
| SGD + Momentum | Fine-tuning, when Adam overfits |
| AdamW | With weight decay (regularization) |
# Adam (recommended default)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# AdamW (Adam with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Different learning rates for different layers
optimizer = optim.Adam([
{'params': model.backbone.parameters(), 'lr': 1e-5}, # Pretrained: low LR
{'params': model.head.parameters(), 'lr': 1e-3} # New: high LR
])Pattern 3: Activation Functions
Hidden Layers:
├── ReLU (default) → max(0, x)
├── LeakyReLU → max(0.01x, x) # Avoids dead neurons
├── GELU → x * Φ(x) # Transformers use this
└── Tanh → (-1, 1) # Centered around 0
Output Layers:
├── None/Linear → Regression, or with CrossEntropyLoss
├── Sigmoid → Binary classification (0-1)
├── Softmax → Multi-class probabilities
└── Tanh → Output in (-1, 1) range# MLP with ReLU (standard)
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(), # Hidden activation
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, num_classes) # No activation for CrossEntropyLoss
)
# ⚠️ CRITICAL: Don't double-apply softmax!
# CrossEntropyLoss already applies softmax internallyPattern 4: Gradient Descent Intuition
Gradient Descent:
new_weight = old_weight - learning_rate × gradient
┌─────────────────────────────────────────┐
│ Loss Surface (simplified 2D) │
│ │
│ ∿∿∿ │
│ ∿ ∿ ● Start │
│ ∿ ∿ ↓ │
│ ∿ ∿ ↓ │
│ ∿ ★ ∿ ↓ │
│ ∿ Min ∿ ● Current │
│ ∿ ∿ │
│ ∿∿∿∿∿ │
└─────────────────────────────────────────┘
Gradient points "uphill" → we go opposite directionPattern 5: Backpropagation (Chain Rule)
# Manual gradient computation (for understanding)
def manual_gradient(X, y, w, b):
"""
For: y_pred = X @ w + b, loss = MSE
∂Loss/∂w = (2/N) × X.T @ (y_pred - y)
∂Loss/∂b = (2/N) × sum(y_pred - y)
"""
N = len(y)
y_pred = X @ w + b
error = y_pred - y
grad_w = (2/N) * (X.T @ error)
grad_b = (2/N) * error.sum()
return grad_w, grad_b
# PyTorch does this automatically!
loss.backward() # Computes all gradients
print(model.layer.weight.grad) # Access gradientPattern 6: Regularization
# Dropout (randomly zero neurons during training)
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Dropout(0.5), # 50% dropout
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.3), # 30% dropout
nn.Linear(32, output_dim)
)
# ⚠️ CRITICAL: model.train() vs model.eval()
model.train() # Dropout active
model.eval() # Dropout disabled
# BatchNorm (normalizes activations)
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.BatchNorm1d(64), # After linear, before activation
nn.ReLU(),
nn.Linear(64, output_dim)
)Pattern 7: Diagnosing Training Issues
Training Loss:
├── Not decreasing at all
│ ├── Learning rate too high? → Try 10x smaller
│ ├── Bug in loss/data? → Check shapes, labels
│ └── Model too simple? → Add layers/neurons
│
├── Decreasing then stuck
│ ├── Learning rate too low? → Try larger
│ └── Local minimum → Try Adam instead of SGD
│
├── Oscillating wildly
│ └── Learning rate too high → Decrease by 10x
│
└── NaN loss
├── Learning rate too high
├── Numerical instability → Use BatchNorm
└── Bad data (NaN in inputs)
Validation Loss:
├── Train low, Val high = OVERFITTING
│ ├── Add Dropout
│ ├── Add weight decay
│ ├── More data / augmentation
│ └── Simpler model
│
└── Both high = UNDERFITTING
├── More complex model
├── Train longer
└── Better featuresReference Navigation
For detailed content, see:
- Training Fundamentals:
reference/training_fundamentals.md- Three Pillars, Gradient Descent from scratch - Loss Functions:
reference/loss_functions.md- MSE vs MAE, BCE vs CrossEntropy - Optimization:
reference/optimization.md- Adam vs SGD, Learning rate schedules - Regularization:
reference/regularization.md- Dropout, BatchNorm, Weight decay - Classification NN:
reference/classification_nn.md- MLP design, Architecture rules - Autoencoders:
reference/autoencoders.md- Standard AE, Denoising AE, Latent space, Similarity search
Common Mistakes to Avoid
1. Forgetting optimizer.zero_grad()
# WRONG: Gradients accumulate!
for epoch in range(epochs):
outputs = model(X)
loss = criterion(outputs, y)
loss.backward() # Gradients ADD to previous!
optimizer.step()
# CORRECT: Clear gradients each step
for epoch in range(epochs):
optimizer.zero_grad() # Clear first!
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()2. Double Softmax
# WRONG: Softmax + CrossEntropyLoss
model = nn.Sequential(
nn.Linear(64, 10),
nn.Softmax(dim=1) # DON'T DO THIS!
)
criterion = nn.CrossEntropyLoss() # Applies softmax again!
# CORRECT: No softmax in model
model = nn.Sequential(
nn.Linear(64, 10)
# No activation - CrossEntropyLoss handles it
)3. Wrong model.eval() / model.train()
# WRONG: Evaluating in train mode
accuracy = model(X_test) # Dropout still active!
# CORRECT: Switch modes
model.eval() # Disable dropout, fix BatchNorm
with torch.no_grad():
accuracy = model(X_test)
model.train() # Re-enable for training4. Missing Activation Functions
# WRONG: No activations = just a linear transformation!
model = nn.Sequential(
nn.Linear(100, 64),
nn.Linear(64, 32), # This is just a bigger linear layer!
nn.Linear(32, 10)
)
# CORRECT: Activations add non-linearity
model = nn.Sequential(
nn.Linear(100, 64),
nn.ReLU(), # Non-linearity!
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)5. Vanishing Gradients with Sigmoid/Tanh
# PROBLEM: Sigmoid/Tanh saturate, gradients → 0
# In deep networks, gradients "vanish"
# SOLUTION: Use ReLU for hidden layers
model = nn.Sequential(
nn.Linear(100, 64),
nn.ReLU(), # Gradient = 1 for positive inputs
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)Teaching Mode
When explaining deep learning concepts:
Backpropagation Analogy
"Imagine a factory assembly line. Each worker (layer) passes their work forward. If the final product is bad (high loss), we need to figure out which worker made what mistakes. Backprop sends 'blame' backward through the line - each worker learns how much they contributed to the error."
Gradient Descent Visual
You're blindfolded on a hilly landscape.
Goal: Find the lowest point.
Strategy (Gradient Descent):
1. Feel the slope under your feet
2. Take a step downhill
3. Repeat until flat (minimum)
Learning rate = step size
- Too big: Overshoot the minimum
- Too small: Takes foreverOverfitting vs Underfitting
Underfit Good Fit Overfit
(High Bias) (Balanced) (High Variance)
─── ∿∿∿∿ ∿∿∿∿∿∿∿∿
● ● ● ● ● ● ●●●●●●●●
● ● ● ● ● ●
● ● ● ●
"Too simple" "Just right" "Memorized noise"
Train ❌ Test ❌ Train ✓ Test ✓ Train ✓ Test ❌