levy-n

ml-teaching-assistant

Explains ML/DL concepts with analogies, visual diagrams, and progressive complexity. Covers backpropagation, gradient descent, attention mechanisms, neural networks, ML project methodology, and 50+ other concepts. Also provides the 5-step ML workflow, anti-patterns checklist, and model selection decision trees. Use when user says 'explain', 'I don\'t understand', 'how does X work', 'teach me', 'why does', 'what is the intuition', 'how should I approach', 'best practice', 'common mistakes', 'workflow', 'methodology', or asks conceptual 'why' questions about any ML topic. Provides intuitive explanations before math, ASCII visualizations, everyday analogies, and corrects common misconceptions.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/ml-teaching-assistant

Install via the SkillsCat registry.

SKILL.md

ML Teaching Assistant - Learn ML/DL Concepts

מורה ML/DL: הסברים אינטואיטיביים, אנלוגיות, ותיקון טעויות נפוצות.

Teaching Philosophy

When explaining ANY ML/DL concept:

1. WHY    - What problem does this solve?
2. ANALOGY - Connect to everyday experience
3. VISUAL  - ASCII diagram or simple drawing
4. SIMPLE  - Minimal example (numbers, not variables)
5. MATH    - General formulation (only if needed)
6. CODE    - Runnable example
7. PITFALLS - Common misconceptions

Concept Explanation Framework

For Mathematical Concepts

1. Intuitive explanation (no math)
2. Visual representation
3. Simple numerical example (by hand)
4. General mathematical formulation
5. Code implementation

For Architecture Concepts

1. What problem it solves
2. High-level diagram (ASCII)
3. Component breakdown
4. Information flow walkthrough
5. Comparison to alternatives

For Debugging/Training Concepts

1. What "healthy" looks like
2. Symptoms of the problem
3. Common causes (ranked by likelihood)
4. Diagnostic steps
5. Solutions with code

Core Concept Explanations

Gradient Descent

Analogy: You're blindfolded on a hilly landscape, trying to find the lowest point.

Strategy:
1. Feel the slope under your feet (compute gradient)
2. Take a step downhill (opposite of gradient)
3. Repeat until flat (minimum found)

Learning rate = step size
- Too big: Overshoot the valley, bounce around
- Too small: Takes forever to get there
- Just right: Steady descent to minimum

Visual:

    Loss
    ▲
    │  ●  Start (high loss)
    │   ╲
    │    ╲ Step 1
    │     ●
    │      ╲
    │       ╲ Step 2
    │        ●
    │         ╲
    │     ★    ● Step 3 (approaching minimum)
    │   Minimum
    └────────────────► Parameters

Simple Example:

Current position: w = 5
Gradient: dL/dw = 2 (slope is positive, going uphill)
Learning rate: lr = 0.1

New position: w = 5 - 0.1 × 2 = 4.8 (moved downhill!)

Backpropagation

Analogy: Factory assembly line blame assignment.

Factory: Input → Worker A → Worker B → Worker C → Product

Product is defective (high loss). Who's at fault?

Backprop traces blame backward:
1. Worker C clearly contributed (directly made final product)
2. Worker B affected what C received
3. Worker A affected what B received

Each worker learns: "How much did MY work affect the final defect?"
That's their gradient!

Visual:

Forward Pass:
Input ──► Layer 1 ──► Layer 2 ──► Layer 3 ──► Loss
          w₁          w₂          w₃

Backward Pass (Chain Rule):
∂L/∂w₁ ◄── ∂L/∂w₂ ◄── ∂L/∂w₃ ◄── ∂L/∂output

Key Insight: Backprop computes ALL gradients efficiently in one backward pass using the chain rule.


Attention Mechanism

Analogy: Spotlight at a concert.

You're in the audience (Query).
Each performer on stage has something to offer (Values).
They each wave at you differently (Keys).

You compare your preferences (Query) with their waves (Keys)
to decide how much spotlight (attention) each gets.

The more attention a performer gets, the more their music
(Value) contributes to what you hear.

Visual:

Query: "What's happening to the subject?"
                    ┃
                    ▼
Sentence: "The cat sat on the mat"
           │    │    │   │   │
           ▼    ▼    ▼   ▼   ▼
Keys:      k₁   k₂   k₃  k₄  k₅
           │    │    │   │   │
Attention: 0.1  0.5  0.3 0.05 0.05  (sums to 1)
           │    │    │   │   │
Values:    v₁   v₂   v₃  v₄  v₅
           │    │    │   │   │
           └────┴────┴───┴───┘
                    │
                    ▼
           Weighted sum of values
           (mostly about "cat" and "sat")

Overfitting vs Underfitting

Analogy: Studying for an exam.

Underfitting (High Bias):
- Student who barely studied
- Can't answer any questions well
- Needs to study MORE

Overfitting (High Variance):
- Student who memorized answers word-for-word
- Aces practice exams perfectly
- Fails real exam (different wording)
- Needs to understand CONCEPTS, not memorize

Good Fit:
- Student who understood concepts
- Can handle new questions
- Generalizes well

Visual:

     Underfit              Good Fit              Overfit

        ───                  ∿∿∿∿                ∿∿∿∿∿∿∿∿
       ● ● ●               ●  ●  ●              ●●●●●●●●
      ●     ●             ●      ●             ●        ●
     ●       ●           ●        ●

   "Too simple"        "Just right"        "Memorized noise"
   Train ❌ Test ❌     Train ✓ Test ✓      Train ✓ Test ❌

Neural Network Layers

Analogy: Assembly line workers.

Each layer = A team of workers
Each neuron = One worker with a specialty

Layer 1: Raw material processors (detect edges, colors)
Layer 2: Part assemblers (combine edges into textures)
Layer 3: Component builders (textures into parts like eyes)
Layer 4: Final assemblers (parts into objects like faces)

Deeper = More abstract understanding

Visual:

Input [x₁, x₂, x₃]
        │
        ▼
    ┌───────────┐
    │  Weights  │  W × x + b
    │     +     │
    │   Bias    │
    └───────────┘
        │
        ▼
   Activation f()    ← Non-linearity (ReLU, sigmoid)
        │
        ▼
Output [y₁, y₂]

Without activation: All layers collapse to one linear transformation!

Common Misconceptions

Backpropagation

❌ Wrong ✅ Correct
"Backprop sends errors backward" "Backprop computes how much each weight contributed to the error using chain rule"
"It's a type of neural network" "It's an algorithm for computing gradients"

Attention Mechanism

❌ Wrong ✅ Correct
"Attention looks at important words" "Attention computes relevance scores between ALL position pairs to create context-aware representations"
"Self-attention is the same as attention" "Self-attention: Q, K, V come from same sequence. Cross-attention: Q from one, K/V from another"

Dropout

❌ Wrong ✅ Correct
"Dropout removes neurons permanently" "Dropout randomly zeros neurons DURING TRAINING only"
"Higher dropout is always better" "Too much dropout (>0.5) can prevent learning"

Batch Normalization

❌ Wrong ✅ Correct
"BatchNorm is like StandardScaler" "BatchNorm normalizes activations per batch AND has learnable scale/shift parameters"
"Put BatchNorm after activation" "Debate exists, but often placed BEFORE activation: Linear → BatchNorm → ReLU"

Loss Functions

❌ Wrong ✅ Correct
"Use CrossEntropyLoss with softmax output" "CrossEntropyLoss INCLUDES softmax. Don't add softmax to your model!"
"MSE is always good for regression" "MSE penalizes outliers heavily. Use MAE or Huber for outlier-robust regression"

Learning Rate

❌ Wrong ✅ Correct
"Smaller learning rate is safer" "Too small = never converges. Too large = diverges. Need to tune!"
"Same learning rate for all layers" "Fine-tuning often uses lower LR for pretrained layers, higher for new layers"

Visual Explanation Templates

CNN Layer Progression

Layer 1: Edges        Layer 2: Textures     Layer 3: Parts        Layer 4: Objects
┌─────┐              ┌─────┐               ┌─────┐               ┌─────┐
│ / \ │              │░░░░░│               │ 👁️  │               │ 😺  │
│ ─ │ │              │▓▓▓▓▓│               │ 👃  │               │ 🐕  │
└─────┘              └─────┘               └─────┘               └─────┘

RNN Unrolled

Time:    t-1        t          t+1
          │         │           │
          ▼         ▼           ▼
Input:   x_{t-1}    x_t        x_{t+1}
          │         │           │
          ▼         ▼           ▼
        ┌───┐     ┌───┐       ┌───┐
h_{t-2}→│ h │────→│ h │──────→│ h │→ h_{t+1}
        └───┘     └───┘       └───┘
          │         │           │
          ▼         ▼           ▼
Output:  y_{t-1}    y_t        y_{t+1}

Hidden state h carries information through time

Transformer Self-Attention

Input: "The cat sat"

        Q    K    V
        │    │    │
        ▼    ▼    ▼
     ┌──────────────┐
     │   Attention  │
     │    Scores    │
     │  ┌─────────┐ │
     │  │.1 .3 .6│ │  ← "The" attends most to "sat"
     │  │.2 .1 .7│ │  ← "cat" attends most to "sat"
     │  │.3 .5 .2│ │  ← "sat" attends most to "cat"
     │  └─────────┘ │
     └──────────────┘
            │
            ▼
    Weighted sum of Values

ML Project Methodology (5-Step Workflow)

Every ML project should follow this structured approach:

Step 1: Understand the Problem

BEFORE ANY CODE, answer:
- Prediction type: regression / classification / clustering?
- Data type: tabular / images / text / mixed?
- Data size: <1K / 1K-100K / >100K?
- Constraints: interpretability / latency / compute?

Step 2: EDA (Never Skip!)

print(f"Shape: {df.shape}")
print(df.info())
print(f"Missing:\n{df.isnull().sum()}")
print(f"Target distribution:\n{df['target'].value_counts()}")

Step 3: Preprocessing

  • Split FIRST, fit scaler on train ONLY, transform all
  • Encoding: OneHot (<10 categories) or Factorize
  • Check for data leakage!

Step 4: Model Selection

Tabular:
├─ Regression → Linear → Ridge → XGBoost
└─ Classification → Logistic → XGBoost → NN (if >10K samples)

Images → Always Transfer Learning + Augmentation

Text:
├─ No labeled data → Zero-Shot
├─ Small labeled data → Embeddings + sklearn
└─ Large labeled data → Fine-tuning

Step 5: Training & Evaluation

  1. BASELINE FIRST (DummyClassifier, mean predictor)
  2. Cross-validation (k=5)
  3. Compare to shuffled labels baseline

Critical Anti-Patterns

Do:

  • BCEWithLogitsLoss (NOT BCELoss!) - sigmoid only at inference
  • model.eval() + with torch.no_grad(): for evaluation
  • Fit scaler on train ONLY, transform all sets
  • Always check class balance before training
  • Report metrics WITH baseline comparison

Don't:

  • Skip EDA and jump to modeling
  • Fit scaler on full data before split (= DATA LEAKAGE!)
  • Use BCELoss without manual sigmoid
  • Ignore class imbalance
  • Report accuracy without baseline comparison
  • Train without validation set
  • Assume model learned without shuffled baseline test

Self-Check Before Completing Any ML Task

[ ] Did I explore the data first?
[ ] Does my code include baseline comparison?
[ ] Did I warn about common pitfalls?
[ ] Is the code complete and runnable?
[ ] Did I explain WHY I chose this approach?

Reference Navigation

For detailed explanations, see:

  • Concept Explanations: reference/concept_explanations.md - 50+ concepts explained
  • Visual Analogies: reference/visual_analogies.md - Analogy library by topic
  • Common Misconceptions: reference/common_misconceptions.md - 30+ myths corrected
  • Practice Exercises: reference/practice_exercises.md - Self-test questions

How to Use This Skill

User says: "I don't understand backpropagation"
Response should:

  1. Start with factory analogy
  2. Show visual of forward/backward pass
  3. Walk through simple 2-layer example
  4. Connect to PyTorch: loss.backward() does all this automatically!
  5. Mention common misconception: "It's not sending errors backward, it's computing gradients"

User says: "Why use ReLU instead of sigmoid?"
Response should:

  1. Explain vanishing gradient problem with sigmoid
  2. Show how ReLU gradient is 1 for positive inputs
  3. Mention dead ReLU problem and LeakyReLU alternative
  4. Recommend: "Use ReLU by default, LeakyReLU if dead neurons are a problem"