deep-learning-core

Explains neural network fundamentals: the Three Pillars (Model, Loss, Optimizer), backpropagation, gradient descent variants (SGD, Adam), regularization (Dropout, BatchNorm), and MLP architecture design. Use when learning how neural networks work, debugging training issues, or when user asks about 'backpropagation', 'vanishing gradients', 'learning rate', 'loss function', 'overfitting', 'underfitting', 'activation functions', 'why isn\'t my model learning', 'gradient descent', 'Adam', 'Dropout', 'BatchNorm', 'autoencoder', 'denoising autoencoder', or 'latent space'.

levy-n 10 1 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add levy-n/claude-useful-skills/deep-learning-core

Install via the SkillsCat registry.

SKILL.md

Deep Learning Core - Neural Network Fundamentals

יסודות Deep Learning: המודל, Loss, ואופטימיזציה.

The Three Pillars of Learning

┌─────────────────────────────────────────────────────────────┐
│  1. ASSUMPTION (Model)     - מה הצורה של הפתרון?           │
│     → משפחת הפונקציות (Linear, Polynomial, Neural Network)  │
│                                                             │
│  2. CRITERION (Loss)       - מה נחשב "טוב"?                │
│     → פונקציית ה-Loss (MSE, Cross-Entropy, etc.)           │
│                                                             │
│  3. SEARCH (Optimization)  - איך מוצאים את הפתרון?         │
│     → אלגוריתם האופטימיזציה (Gradient Descent, Adam)       │
└─────────────────────────────────────────────────────────────┘

Quick Start - Training Loop

import torch
import torch.nn as nn
import torch.optim as optim

# 1. Model (Assumption)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, output_dim)
)

# 2. Loss (Criterion)
criterion = nn.MSELoss()           # Regression
# criterion = nn.CrossEntropyLoss()  # Classification

# 3. Optimizer (Search)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()           # Clear gradients
    outputs = model(X_train)        # Forward pass
    loss = criterion(outputs, y_train)  # Compute loss
    loss.backward()                 # Backward pass (compute gradients)
    optimizer.step()                # Update weights

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

When This Skill Activates

Use this skill when:

Learning how neural networks work
Understanding backpropagation
Debugging training issues (loss not decreasing, NaN)
Choosing loss functions or optimizers
Deciding on architecture (layers, neurons)
Dealing with overfitting/underfitting

Core Patterns

Pattern 1: Loss Function Selection

Problem	Loss Function	Output Activation
Regression	`MSELoss()`	None (linear)
Binary Classification	`BCEWithLogitsLoss()`	None (logits)
Binary (with sigmoid)	`BCELoss()`	Sigmoid
Multi-class	`CrossEntropyLoss()`	None (logits)
Multi-label	`BCEWithLogitsLoss()`	None (per class)

# Regression
criterion = nn.MSELoss()

# Binary classification (RECOMMENDED)
criterion = nn.BCEWithLogitsLoss()  # Applies sigmoid internally

# Multi-class classification
criterion = nn.CrossEntropyLoss()  # Applies softmax internally
# ⚠️ Don't add softmax to model output!

Pattern 2: Optimizer Selection

Optimizer	When to Use
Adam	Default choice, "just works"
SGD + Momentum	Fine-tuning, when Adam overfits
AdamW	With weight decay (regularization)

# Adam (recommended default)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# AdamW (Adam with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Different learning rates for different layers
optimizer = optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Pretrained: low LR
    {'params': model.head.parameters(), 'lr': 1e-3}       # New: high LR
])

Pattern 3: Activation Functions

Hidden Layers:
├── ReLU (default)      → max(0, x)
├── LeakyReLU           → max(0.01x, x)  # Avoids dead neurons
├── GELU                → x * Φ(x)       # Transformers use this
└── Tanh                → (-1, 1)        # Centered around 0

Output Layers:
├── None/Linear         → Regression, or with CrossEntropyLoss
├── Sigmoid             → Binary classification (0-1)
├── Softmax             → Multi-class probabilities
└── Tanh                → Output in (-1, 1) range

# MLP with ReLU (standard)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),                    # Hidden activation
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, num_classes)    # No activation for CrossEntropyLoss
)

# ⚠️ CRITICAL: Don't double-apply softmax!
# CrossEntropyLoss already applies softmax internally

Pattern 4: Gradient Descent Intuition

Gradient Descent:
    new_weight = old_weight - learning_rate × gradient

    ┌─────────────────────────────────────────┐
    │     Loss Surface (simplified 2D)         │
    │                                          │
    │     ∿∿∿                                  │
    │    ∿   ∿     ● Start                    │
    │   ∿     ∿    ↓                          │
    │  ∿       ∿   ↓                          │
    │ ∿    ★   ∿  ↓                          │
    │  ∿  Min  ∿  ● Current                   │
    │   ∿     ∿                               │
    │    ∿∿∿∿∿                                │
    └─────────────────────────────────────────┘

    Gradient points "uphill" → we go opposite direction

Pattern 5: Backpropagation (Chain Rule)

# Manual gradient computation (for understanding)
def manual_gradient(X, y, w, b):
    """
    For: y_pred = X @ w + b, loss = MSE
    ∂Loss/∂w = (2/N) × X.T @ (y_pred - y)
    ∂Loss/∂b = (2/N) × sum(y_pred - y)
    """
    N = len(y)
    y_pred = X @ w + b
    error = y_pred - y

    grad_w = (2/N) * (X.T @ error)
    grad_b = (2/N) * error.sum()

    return grad_w, grad_b

# PyTorch does this automatically!
loss.backward()  # Computes all gradients
print(model.layer.weight.grad)  # Access gradient

Pattern 6: Regularization

# Dropout (randomly zero neurons during training)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Dropout(0.5),              # 50% dropout
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.3),              # 30% dropout
    nn.Linear(32, output_dim)
)

# ⚠️ CRITICAL: model.train() vs model.eval()
model.train()  # Dropout active
model.eval()   # Dropout disabled

# BatchNorm (normalizes activations)
model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.BatchNorm1d(64),           # After linear, before activation
    nn.ReLU(),
    nn.Linear(64, output_dim)
)

Pattern 7: Diagnosing Training Issues

Training Loss:
├── Not decreasing at all
│   ├── Learning rate too high? → Try 10x smaller
│   ├── Bug in loss/data? → Check shapes, labels
│   └── Model too simple? → Add layers/neurons
│
├── Decreasing then stuck
│   ├── Learning rate too low? → Try larger
│   └── Local minimum → Try Adam instead of SGD
│
├── Oscillating wildly
│   └── Learning rate too high → Decrease by 10x
│
└── NaN loss
    ├── Learning rate too high
    ├── Numerical instability → Use BatchNorm
    └── Bad data (NaN in inputs)

Validation Loss:
├── Train low, Val high = OVERFITTING
│   ├── Add Dropout
│   ├── Add weight decay
│   ├── More data / augmentation
│   └── Simpler model
│
└── Both high = UNDERFITTING
    ├── More complex model
    ├── Train longer
    └── Better features

Reference Navigation

For detailed content, see:

Training Fundamentals: reference/training_fundamentals.md - Three Pillars, Gradient Descent from scratch
Loss Functions: reference/loss_functions.md - MSE vs MAE, BCE vs CrossEntropy
Optimization: reference/optimization.md - Adam vs SGD, Learning rate schedules
Regularization: reference/regularization.md - Dropout, BatchNorm, Weight decay
Classification NN: reference/classification_nn.md - MLP design, Architecture rules
Autoencoders: reference/autoencoders.md - Standard AE, Denoising AE, Latent space, Similarity search

Common Mistakes to Avoid

1. Forgetting optimizer.zero_grad()

# WRONG: Gradients accumulate!
for epoch in range(epochs):
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()  # Gradients ADD to previous!
    optimizer.step()

# CORRECT: Clear gradients each step
for epoch in range(epochs):
    optimizer.zero_grad()  # Clear first!
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

2. Double Softmax

# WRONG: Softmax + CrossEntropyLoss
model = nn.Sequential(
    nn.Linear(64, 10),
    nn.Softmax(dim=1)  # DON'T DO THIS!
)
criterion = nn.CrossEntropyLoss()  # Applies softmax again!

# CORRECT: No softmax in model
model = nn.Sequential(
    nn.Linear(64, 10)
    # No activation - CrossEntropyLoss handles it
)

3. Wrong model.eval() / model.train()

# WRONG: Evaluating in train mode
accuracy = model(X_test)  # Dropout still active!

# CORRECT: Switch modes
model.eval()  # Disable dropout, fix BatchNorm
with torch.no_grad():
    accuracy = model(X_test)

model.train()  # Re-enable for training

4. Missing Activation Functions

# WRONG: No activations = just a linear transformation!
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.Linear(64, 32),  # This is just a bigger linear layer!
    nn.Linear(32, 10)
)

# CORRECT: Activations add non-linearity
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),          # Non-linearity!
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

5. Vanishing Gradients with Sigmoid/Tanh

# PROBLEM: Sigmoid/Tanh saturate, gradients → 0
# In deep networks, gradients "vanish"

# SOLUTION: Use ReLU for hidden layers
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),          # Gradient = 1 for positive inputs
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

Teaching Mode

When explaining deep learning concepts:

Backpropagation Analogy

"Imagine a factory assembly line. Each worker (layer) passes their work forward. If the final product is bad (high loss), we need to figure out which worker made what mistakes. Backprop sends 'blame' backward through the line - each worker learns how much they contributed to the error."

Gradient Descent Visual

You're blindfolded on a hilly landscape.
Goal: Find the lowest point.

Strategy (Gradient Descent):
1. Feel the slope under your feet
2. Take a step downhill
3. Repeat until flat (minimum)

Learning rate = step size
- Too big: Overshoot the minimum
- Too small: Takes forever

Overfitting vs Underfitting

     Underfit              Good Fit              Overfit
    (High Bias)           (Balanced)          (High Variance)

       ───                  ∿∿∿∿                ∿∿∿∿∿∿∿∿
      ● ● ●               ●  ●  ●              ●●●●●●●●
     ●     ●             ●      ●             ●        ●
    ●       ●           ●        ●

   "Too simple"        "Just right"        "Memorized noise"
   Train ❌ Test ❌     Train ✓ Test ✓      Train ✓ Test ❌

deep-learning-core

Resources

Install

Deep Learning Core - Neural Network Fundamentals

The Three Pillars of Learning

Quick Start - Training Loop

When This Skill Activates

Core Patterns

Pattern 1: Loss Function Selection

Pattern 2: Optimizer Selection

Pattern 3: Activation Functions

Pattern 4: Gradient Descent Intuition

Pattern 5: Backpropagation (Chain Rule)

Pattern 6: Regularization

Pattern 7: Diagnosing Training Issues

Reference Navigation

Common Mistakes to Avoid

1. Forgetting optimizer.zero_grad()

2. Double Softmax

3. Wrong model.eval() / model.train()

4. Missing Activation Functions

5. Vanishing Gradients with Sigmoid/Tanh

Teaching Mode

Backpropagation Analogy

Gradient Descent Visual

Overfitting vs Underfitting

Categories

Install

Recommended Skills