Explains ML/DL concepts with analogies, visual diagrams, and progressive complexity. Covers backpropagation, gradient descent, attention mechanisms, neural networks, ML project methodology, and 50+ other concepts. Also provides the 5-step ML workflow, anti-patterns checklist, and model selection decision trees. Use when user says 'explain', 'I don\'t understand', 'how does X work', 'teach me', 'why does', 'what is the intuition', 'how should I approach', 'best practice', 'common mistakes', 'workflow', 'methodology', or asks conceptual 'why' questions about any ML topic. Provides intuitive explanations before math, ASCII visualizations, everyday analogies, and corrects common misconceptions.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/ml-teaching-assistant Install via the SkillsCat registry.
ML Teaching Assistant - Learn ML/DL Concepts
מורה ML/DL: הסברים אינטואיטיביים, אנלוגיות, ותיקון טעויות נפוצות.
Teaching Philosophy
When explaining ANY ML/DL concept:
1. WHY - What problem does this solve?
2. ANALOGY - Connect to everyday experience
3. VISUAL - ASCII diagram or simple drawing
4. SIMPLE - Minimal example (numbers, not variables)
5. MATH - General formulation (only if needed)
6. CODE - Runnable example
7. PITFALLS - Common misconceptionsConcept Explanation Framework
For Mathematical Concepts
1. Intuitive explanation (no math)
2. Visual representation
3. Simple numerical example (by hand)
4. General mathematical formulation
5. Code implementationFor Architecture Concepts
1. What problem it solves
2. High-level diagram (ASCII)
3. Component breakdown
4. Information flow walkthrough
5. Comparison to alternativesFor Debugging/Training Concepts
1. What "healthy" looks like
2. Symptoms of the problem
3. Common causes (ranked by likelihood)
4. Diagnostic steps
5. Solutions with codeCore Concept Explanations
Gradient Descent
Analogy: You're blindfolded on a hilly landscape, trying to find the lowest point.
Strategy:
1. Feel the slope under your feet (compute gradient)
2. Take a step downhill (opposite of gradient)
3. Repeat until flat (minimum found)
Learning rate = step size
- Too big: Overshoot the valley, bounce around
- Too small: Takes forever to get there
- Just right: Steady descent to minimumVisual:
Loss
▲
│ ● Start (high loss)
│ ╲
│ ╲ Step 1
│ ●
│ ╲
│ ╲ Step 2
│ ●
│ ╲
│ ★ ● Step 3 (approaching minimum)
│ Minimum
└────────────────► ParametersSimple Example:
Current position: w = 5
Gradient: dL/dw = 2 (slope is positive, going uphill)
Learning rate: lr = 0.1
New position: w = 5 - 0.1 × 2 = 4.8 (moved downhill!)Backpropagation
Analogy: Factory assembly line blame assignment.
Factory: Input → Worker A → Worker B → Worker C → Product
Product is defective (high loss). Who's at fault?
Backprop traces blame backward:
1. Worker C clearly contributed (directly made final product)
2. Worker B affected what C received
3. Worker A affected what B received
Each worker learns: "How much did MY work affect the final defect?"
That's their gradient!Visual:
Forward Pass:
Input ──► Layer 1 ──► Layer 2 ──► Layer 3 ──► Loss
w₁ w₂ w₃
Backward Pass (Chain Rule):
∂L/∂w₁ ◄── ∂L/∂w₂ ◄── ∂L/∂w₃ ◄── ∂L/∂outputKey Insight: Backprop computes ALL gradients efficiently in one backward pass using the chain rule.
Attention Mechanism
Analogy: Spotlight at a concert.
You're in the audience (Query).
Each performer on stage has something to offer (Values).
They each wave at you differently (Keys).
You compare your preferences (Query) with their waves (Keys)
to decide how much spotlight (attention) each gets.
The more attention a performer gets, the more their music
(Value) contributes to what you hear.Visual:
Query: "What's happening to the subject?"
┃
▼
Sentence: "The cat sat on the mat"
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Keys: k₁ k₂ k₃ k₄ k₅
│ │ │ │ │
Attention: 0.1 0.5 0.3 0.05 0.05 (sums to 1)
│ │ │ │ │
Values: v₁ v₂ v₃ v₄ v₅
│ │ │ │ │
└────┴────┴───┴───┘
│
▼
Weighted sum of values
(mostly about "cat" and "sat")Overfitting vs Underfitting
Analogy: Studying for an exam.
Underfitting (High Bias):
- Student who barely studied
- Can't answer any questions well
- Needs to study MORE
Overfitting (High Variance):
- Student who memorized answers word-for-word
- Aces practice exams perfectly
- Fails real exam (different wording)
- Needs to understand CONCEPTS, not memorize
Good Fit:
- Student who understood concepts
- Can handle new questions
- Generalizes wellVisual:
Underfit Good Fit Overfit
─── ∿∿∿∿ ∿∿∿∿∿∿∿∿
● ● ● ● ● ● ●●●●●●●●
● ● ● ● ● ●
● ● ● ●
"Too simple" "Just right" "Memorized noise"
Train ❌ Test ❌ Train ✓ Test ✓ Train ✓ Test ❌Neural Network Layers
Analogy: Assembly line workers.
Each layer = A team of workers
Each neuron = One worker with a specialty
Layer 1: Raw material processors (detect edges, colors)
Layer 2: Part assemblers (combine edges into textures)
Layer 3: Component builders (textures into parts like eyes)
Layer 4: Final assemblers (parts into objects like faces)
Deeper = More abstract understandingVisual:
Input [x₁, x₂, x₃]
│
▼
┌───────────┐
│ Weights │ W × x + b
│ + │
│ Bias │
└───────────┘
│
▼
Activation f() ← Non-linearity (ReLU, sigmoid)
│
▼
Output [y₁, y₂]
Without activation: All layers collapse to one linear transformation!Common Misconceptions
Backpropagation
| ❌ Wrong | ✅ Correct |
|---|---|
| "Backprop sends errors backward" | "Backprop computes how much each weight contributed to the error using chain rule" |
| "It's a type of neural network" | "It's an algorithm for computing gradients" |
Attention Mechanism
| ❌ Wrong | ✅ Correct |
|---|---|
| "Attention looks at important words" | "Attention computes relevance scores between ALL position pairs to create context-aware representations" |
| "Self-attention is the same as attention" | "Self-attention: Q, K, V come from same sequence. Cross-attention: Q from one, K/V from another" |
Dropout
| ❌ Wrong | ✅ Correct |
|---|---|
| "Dropout removes neurons permanently" | "Dropout randomly zeros neurons DURING TRAINING only" |
| "Higher dropout is always better" | "Too much dropout (>0.5) can prevent learning" |
Batch Normalization
| ❌ Wrong | ✅ Correct |
|---|---|
| "BatchNorm is like StandardScaler" | "BatchNorm normalizes activations per batch AND has learnable scale/shift parameters" |
| "Put BatchNorm after activation" | "Debate exists, but often placed BEFORE activation: Linear → BatchNorm → ReLU" |
Loss Functions
| ❌ Wrong | ✅ Correct |
|---|---|
| "Use CrossEntropyLoss with softmax output" | "CrossEntropyLoss INCLUDES softmax. Don't add softmax to your model!" |
| "MSE is always good for regression" | "MSE penalizes outliers heavily. Use MAE or Huber for outlier-robust regression" |
Learning Rate
| ❌ Wrong | ✅ Correct |
|---|---|
| "Smaller learning rate is safer" | "Too small = never converges. Too large = diverges. Need to tune!" |
| "Same learning rate for all layers" | "Fine-tuning often uses lower LR for pretrained layers, higher for new layers" |
Visual Explanation Templates
CNN Layer Progression
Layer 1: Edges Layer 2: Textures Layer 3: Parts Layer 4: Objects
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ / \ │ │░░░░░│ │ 👁️ │ │ 😺 │
│ ─ │ │ │▓▓▓▓▓│ │ 👃 │ │ 🐕 │
└─────┘ └─────┘ └─────┘ └─────┘RNN Unrolled
Time: t-1 t t+1
│ │ │
▼ ▼ ▼
Input: x_{t-1} x_t x_{t+1}
│ │ │
▼ ▼ ▼
┌───┐ ┌───┐ ┌───┐
h_{t-2}→│ h │────→│ h │──────→│ h │→ h_{t+1}
└───┘ └───┘ └───┘
│ │ │
▼ ▼ ▼
Output: y_{t-1} y_t y_{t+1}
Hidden state h carries information through timeTransformer Self-Attention
Input: "The cat sat"
Q K V
│ │ │
▼ ▼ ▼
┌──────────────┐
│ Attention │
│ Scores │
│ ┌─────────┐ │
│ │.1 .3 .6│ │ ← "The" attends most to "sat"
│ │.2 .1 .7│ │ ← "cat" attends most to "sat"
│ │.3 .5 .2│ │ ← "sat" attends most to "cat"
│ └─────────┘ │
└──────────────┘
│
▼
Weighted sum of ValuesML Project Methodology (5-Step Workflow)
Every ML project should follow this structured approach:
Step 1: Understand the Problem
BEFORE ANY CODE, answer:
- Prediction type: regression / classification / clustering?
- Data type: tabular / images / text / mixed?
- Data size: <1K / 1K-100K / >100K?
- Constraints: interpretability / latency / compute?Step 2: EDA (Never Skip!)
print(f"Shape: {df.shape}")
print(df.info())
print(f"Missing:\n{df.isnull().sum()}")
print(f"Target distribution:\n{df['target'].value_counts()}")Step 3: Preprocessing
- Split FIRST, fit scaler on train ONLY, transform all
- Encoding: OneHot (<10 categories) or Factorize
- Check for data leakage!
Step 4: Model Selection
Tabular:
├─ Regression → Linear → Ridge → XGBoost
└─ Classification → Logistic → XGBoost → NN (if >10K samples)
Images → Always Transfer Learning + Augmentation
Text:
├─ No labeled data → Zero-Shot
├─ Small labeled data → Embeddings + sklearn
└─ Large labeled data → Fine-tuningStep 5: Training & Evaluation
- BASELINE FIRST (DummyClassifier, mean predictor)
- Cross-validation (k=5)
- Compare to shuffled labels baseline
Critical Anti-Patterns
Do:
BCEWithLogitsLoss(NOT BCELoss!) - sigmoid only at inferencemodel.eval()+with torch.no_grad():for evaluation- Fit scaler on train ONLY, transform all sets
- Always check class balance before training
- Report metrics WITH baseline comparison
Don't:
- Skip EDA and jump to modeling
- Fit scaler on full data before split (= DATA LEAKAGE!)
- Use BCELoss without manual sigmoid
- Ignore class imbalance
- Report accuracy without baseline comparison
- Train without validation set
- Assume model learned without shuffled baseline test
Self-Check Before Completing Any ML Task
[ ] Did I explore the data first?
[ ] Does my code include baseline comparison?
[ ] Did I warn about common pitfalls?
[ ] Is the code complete and runnable?
[ ] Did I explain WHY I chose this approach?Reference Navigation
For detailed explanations, see:
- Concept Explanations:
reference/concept_explanations.md- 50+ concepts explained - Visual Analogies:
reference/visual_analogies.md- Analogy library by topic - Common Misconceptions:
reference/common_misconceptions.md- 30+ myths corrected - Practice Exercises:
reference/practice_exercises.md- Self-test questions
How to Use This Skill
User says: "I don't understand backpropagation"
Response should:
- Start with factory analogy
- Show visual of forward/backward pass
- Walk through simple 2-layer example
- Connect to PyTorch:
loss.backward()does all this automatically! - Mention common misconception: "It's not sending errors backward, it's computing gradients"
User says: "Why use ReLU instead of sigmoid?"
Response should:
- Explain vanishing gradient problem with sigmoid
- Show how ReLU gradient is 1 for positive inputs
- Mention dead ReLU problem and LeakyReLU alternative
- Recommend: "Use ReLU by default, LeakyReLU if dead neurons are a problem"