levy-n

deep-learning-core

Explains neural network fundamentals: the Three Pillars (Model, Loss, Optimizer), backpropagation, gradient descent variants (SGD, Adam), regularization (Dropout, BatchNorm), and MLP architecture design. Use when learning how neural networks work, debugging training issues, or when user asks about 'backpropagation', 'vanishing gradients', 'learning rate', 'loss function', 'overfitting', 'underfitting', 'activation functions', 'why isn\'t my model learning', 'gradient descent', 'Adam', 'Dropout', 'BatchNorm', 'autoencoder', 'denoising autoencoder', or 'latent space'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/deep-learning-core

Install via the SkillsCat registry.

SKILL.md

Deep Learning Core - Neural Network Fundamentals

יסודות Deep Learning: המודל, Loss, ואופטימיזציה.

The Three Pillars of Learning

┌─────────────────────────────────────────────────────────────┐
│  1. ASSUMPTION (Model)     - מה הצורה של הפתרון?           │
│     → משפחת הפונקציות (Linear, Polynomial, Neural Network)  │
│                                                             │
│  2. CRITERION (Loss)       - מה נחשב "טוב"?                │
│     → פונקציית ה-Loss (MSE, Cross-Entropy, etc.)           │
│                                                             │
│  3. SEARCH (Optimization)  - איך מוצאים את הפתרון?         │
│     → אלגוריתם האופטימיזציה (Gradient Descent, Adam)       │
└─────────────────────────────────────────────────────────────┘

Quick Start - Training Loop

import torch
import torch.nn as nn
import torch.optim as optim

# 1. Model (Assumption)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, output_dim)
)

# 2. Loss (Criterion)
criterion = nn.MSELoss()           # Regression
# criterion = nn.CrossEntropyLoss()  # Classification

# 3. Optimizer (Search)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()           # Clear gradients
    outputs = model(X_train)        # Forward pass
    loss = criterion(outputs, y_train)  # Compute loss
    loss.backward()                 # Backward pass (compute gradients)
    optimizer.step()                # Update weights

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

When This Skill Activates

Use this skill when:

  • Learning how neural networks work
  • Understanding backpropagation
  • Debugging training issues (loss not decreasing, NaN)
  • Choosing loss functions or optimizers
  • Deciding on architecture (layers, neurons)
  • Dealing with overfitting/underfitting

Core Patterns

Pattern 1: Loss Function Selection

Problem Loss Function Output Activation
Regression MSELoss() None (linear)
Binary Classification BCEWithLogitsLoss() None (logits)
Binary (with sigmoid) BCELoss() Sigmoid
Multi-class CrossEntropyLoss() None (logits)
Multi-label BCEWithLogitsLoss() None (per class)
# Regression
criterion = nn.MSELoss()

# Binary classification (RECOMMENDED)
criterion = nn.BCEWithLogitsLoss()  # Applies sigmoid internally

# Multi-class classification
criterion = nn.CrossEntropyLoss()  # Applies softmax internally
# ⚠️ Don't add softmax to model output!

Pattern 2: Optimizer Selection

Optimizer When to Use
Adam Default choice, "just works"
SGD + Momentum Fine-tuning, when Adam overfits
AdamW With weight decay (regularization)
# Adam (recommended default)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# AdamW (Adam with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Different learning rates for different layers
optimizer = optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Pretrained: low LR
    {'params': model.head.parameters(), 'lr': 1e-3}       # New: high LR
])

Pattern 3: Activation Functions

Hidden Layers:
├── ReLU (default)      → max(0, x)
├── LeakyReLU           → max(0.01x, x)  # Avoids dead neurons
├── GELU                → x * Φ(x)       # Transformers use this
└── Tanh                → (-1, 1)        # Centered around 0

Output Layers:
├── None/Linear         → Regression, or with CrossEntropyLoss
├── Sigmoid             → Binary classification (0-1)
├── Softmax             → Multi-class probabilities
└── Tanh                → Output in (-1, 1) range
# MLP with ReLU (standard)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),                    # Hidden activation
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, num_classes)    # No activation for CrossEntropyLoss
)

# ⚠️ CRITICAL: Don't double-apply softmax!
# CrossEntropyLoss already applies softmax internally

Pattern 4: Gradient Descent Intuition

Gradient Descent:
    new_weight = old_weight - learning_rate × gradient

    ┌─────────────────────────────────────────┐
    │     Loss Surface (simplified 2D)         │
    │                                          │
    │     ∿∿∿                                  │
    │    ∿   ∿     ● Start                    │
    │   ∿     ∿    ↓                          │
    │  ∿       ∿   ↓                          │
    │ ∿    ★   ∿  ↓                          │
    │  ∿  Min  ∿  ● Current                   │
    │   ∿     ∿                               │
    │    ∿∿∿∿∿                                │
    └─────────────────────────────────────────┘

    Gradient points "uphill" → we go opposite direction

Pattern 5: Backpropagation (Chain Rule)

# Manual gradient computation (for understanding)
def manual_gradient(X, y, w, b):
    """
    For: y_pred = X @ w + b, loss = MSE
    ∂Loss/∂w = (2/N) × X.T @ (y_pred - y)
    ∂Loss/∂b = (2/N) × sum(y_pred - y)
    """
    N = len(y)
    y_pred = X @ w + b
    error = y_pred - y

    grad_w = (2/N) * (X.T @ error)
    grad_b = (2/N) * error.sum()

    return grad_w, grad_b

# PyTorch does this automatically!
loss.backward()  # Computes all gradients
print(model.layer.weight.grad)  # Access gradient

Pattern 6: Regularization

# Dropout (randomly zero neurons during training)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Dropout(0.5),              # 50% dropout
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.3),              # 30% dropout
    nn.Linear(32, output_dim)
)

# ⚠️ CRITICAL: model.train() vs model.eval()
model.train()  # Dropout active
model.eval()   # Dropout disabled

# BatchNorm (normalizes activations)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.BatchNorm1d(64),           # After linear, before activation
    nn.ReLU(),
    nn.Linear(64, output_dim)
)

Pattern 7: Diagnosing Training Issues

Training Loss:
├── Not decreasing at all
│   ├── Learning rate too high? → Try 10x smaller
│   ├── Bug in loss/data? → Check shapes, labels
│   └── Model too simple? → Add layers/neurons
│
├── Decreasing then stuck
│   ├── Learning rate too low? → Try larger
│   └── Local minimum → Try Adam instead of SGD
│
├── Oscillating wildly
│   └── Learning rate too high → Decrease by 10x
│
└── NaN loss
    ├── Learning rate too high
    ├── Numerical instability → Use BatchNorm
    └── Bad data (NaN in inputs)

Validation Loss:
├── Train low, Val high = OVERFITTING
│   ├── Add Dropout
│   ├── Add weight decay
│   ├── More data / augmentation
│   └── Simpler model
│
└── Both high = UNDERFITTING
    ├── More complex model
    ├── Train longer
    └── Better features

Reference Navigation

For detailed content, see:

  • Training Fundamentals: reference/training_fundamentals.md - Three Pillars, Gradient Descent from scratch
  • Loss Functions: reference/loss_functions.md - MSE vs MAE, BCE vs CrossEntropy
  • Optimization: reference/optimization.md - Adam vs SGD, Learning rate schedules
  • Regularization: reference/regularization.md - Dropout, BatchNorm, Weight decay
  • Classification NN: reference/classification_nn.md - MLP design, Architecture rules
  • Autoencoders: reference/autoencoders.md - Standard AE, Denoising AE, Latent space, Similarity search

Common Mistakes to Avoid

1. Forgetting optimizer.zero_grad()

# WRONG: Gradients accumulate!
for epoch in range(epochs):
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()  # Gradients ADD to previous!
    optimizer.step()

# CORRECT: Clear gradients each step
for epoch in range(epochs):
    optimizer.zero_grad()  # Clear first!
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

2. Double Softmax

# WRONG: Softmax + CrossEntropyLoss
model = nn.Sequential(
    nn.Linear(64, 10),
    nn.Softmax(dim=1)  # DON'T DO THIS!
)
criterion = nn.CrossEntropyLoss()  # Applies softmax again!

# CORRECT: No softmax in model
model = nn.Sequential(
    nn.Linear(64, 10)
    # No activation - CrossEntropyLoss handles it
)

3. Wrong model.eval() / model.train()

# WRONG: Evaluating in train mode
accuracy = model(X_test)  # Dropout still active!

# CORRECT: Switch modes
model.eval()  # Disable dropout, fix BatchNorm
with torch.no_grad():
    accuracy = model(X_test)

model.train()  # Re-enable for training

4. Missing Activation Functions

# WRONG: No activations = just a linear transformation!
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.Linear(64, 32),  # This is just a bigger linear layer!
    nn.Linear(32, 10)
)

# CORRECT: Activations add non-linearity
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),          # Non-linearity!
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

5. Vanishing Gradients with Sigmoid/Tanh

# PROBLEM: Sigmoid/Tanh saturate, gradients → 0
# In deep networks, gradients "vanish"

# SOLUTION: Use ReLU for hidden layers
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),          # Gradient = 1 for positive inputs
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

Teaching Mode

When explaining deep learning concepts:

Backpropagation Analogy

"Imagine a factory assembly line. Each worker (layer) passes their work forward. If the final product is bad (high loss), we need to figure out which worker made what mistakes. Backprop sends 'blame' backward through the line - each worker learns how much they contributed to the error."

Gradient Descent Visual

You're blindfolded on a hilly landscape.
Goal: Find the lowest point.

Strategy (Gradient Descent):
1. Feel the slope under your feet
2. Take a step downhill
3. Repeat until flat (minimum)

Learning rate = step size
- Too big: Overshoot the minimum
- Too small: Takes forever

Overfitting vs Underfitting

     Underfit              Good Fit              Overfit
    (High Bias)           (Balanced)          (High Variance)

       ───                  ∿∿∿∿                ∿∿∿∿∿∿∿∿
      ● ● ●               ●  ●  ●              ●●●●●●●●
     ●     ●             ●      ●             ●        ●
    ●       ●           ●        ●

   "Too simple"        "Just right"        "Memorized noise"
   Train ❌ Test ❌     Train ✓ Test ✓      Train ✓ Test ❌