levy-n

ml-fundamentals

Implements classical ML algorithms for regression and classification. Covers Linear/Polynomial/Logistic Regression, Decision Trees, Ridge/Lasso regularization, train/test splits, cross-validation, and evaluation metrics (R², RMSE, Precision, Recall, F1, ROC-AUC, Confusion Matrix). Use when building predictive models on tabular data, comparing baseline algorithms, handling imbalanced data, or when user mentions 'regression', 'classification', 'overfitting', 'cross-validation', 'confusion matrix', 'feature importance', 'precision/recall', or 'regularization'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/ml-fundamentals

Install via the SkillsCat registry.

SKILL.md

ML Fundamentals - Classical Machine Learning

יסודות Machine Learning קלאסי: רגרסיה, קלסיפיקציה, והערכת מודלים.

Quick Start

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# 1. Split data (ALWAYS stratify for classification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Scale features (fit on train ONLY!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled = scaler.transform(X_test)        # transform ONLY

# 3. Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# 4. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

When This Skill Activates

Use this skill when:

  • Building regression models (predicting continuous values)
  • Building classification models (predicting categories)
  • Comparing baseline algorithms before deep learning
  • Evaluating model performance with metrics
  • Engineering features from tabular data
  • Handling class imbalance problems

Core Patterns

Pattern 1: Model Selection by Problem Type

Problem Type → Recommended Models
├── Regression (continuous target)
│   ├── Linear Regression (baseline)
│   ├── Ridge/Lasso (with regularization)
│   └── Polynomial (non-linear relationships)
│
├── Binary Classification
│   ├── Logistic Regression (baseline, interpretable)
│   └── Decision Tree (non-linear, interpretable)
│
└── Multi-class Classification
    ├── Logistic Regression (one-vs-rest)
    └── Decision Tree (native multi-class)

Pattern 2: Train/Test Split Rules

# Classification: ALWAYS use stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# With validation set (recommended)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
)
# Results in 70/15/15 split

Pattern 3: StandardScaler (CRITICAL)

# ⚠️ CRITICAL: Prevent data leakage!
scaler = StandardScaler()

# CORRECT:
X_train_scaled = scaler.fit_transform(X_train)  # Learn from train
X_test_scaled = scaler.transform(X_test)        # Apply to test

# WRONG (data leakage!):
# X_scaled = scaler.fit_transform(X)  # Don't fit on all data!
# X_train, X_test = train_test_split(X_scaled, ...)

Pattern 4: Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

Pattern 5: Evaluation Metrics Selection

Metric Use When Formula
Accuracy Balanced classes (TP+TN) / Total
Precision False positives costly (spam) TP / (TP+FP)
Recall False negatives costly (cancer) TP / (TP+FN)
F1 Need balance, imbalanced data 2 * (P*R) / (P+R)
ROC-AUC Compare models, threshold tuning Area under ROC curve
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

# All metrics at once
print(classification_report(y_test, y_pred))

# ROC-AUC (needs probabilities)
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

Reference Navigation

For detailed content, see:

  • Regression Models: reference/regression_models.md - Linear, Polynomial, Ridge/Lasso, Gradient Descent
  • Classification Basics: reference/classification_basics.md - Decision Trees, Logistic Regression, Impurity measures
  • Model Evaluation: reference/model_evaluation.md - Metrics, Confusion Matrix, ROC curves, CV strategies
  • Feature Engineering: reference/feature_engineering.md - Encoding, Scaling, Selection, Imbalance handling

Common Mistakes to Avoid

1. Data Leakage with Scaling

# WRONG: Fitting scaler on all data
scaler.fit(X)  # Sees test data statistics!

# CORRECT: Fit only on training data
scaler.fit(X_train)

2. Using Accuracy on Imbalanced Data

# If 95% of data is class 0, predicting all 0 gives 95% accuracy!
# Use F1, Precision, Recall, or ROC-AUC instead

from sklearn.metrics import f1_score
print(f"F1: {f1_score(y_test, y_pred)}")

3. Not Stratifying Classification Splits

# WRONG: May get unbalanced splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CORRECT: Preserves class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y
)

4. Forgetting to Set random_state

# WRONG: Results not reproducible
train_test_split(X, y, test_size=0.2)

# CORRECT: Reproducible results
train_test_split(X, y, test_size=0.2, random_state=42)

5. Using LASSO for P-values

# WRONG: LASSO cannot compute p-values!
# lasso.pvalue  # Does not exist!

# CORRECT: Two-step pattern
# 1. Use LASSO to select features
# 2. Use OLS on selected features for p-values
import statsmodels.api as sm
selected_features = X[:, lasso.coef_ != 0]
model = sm.OLS(y, sm.add_constant(selected_features)).fit()
print(model.pvalues)

Teaching Mode

When explaining ML fundamentals:

  1. Start with intuition: "Regression finds the best line through points"
  2. Use visual analogies: Draw scatter plots, decision boundaries
  3. Show the math gradually: Start with formula, then code
  4. Connect to real problems: "Spam detection is classification"
  5. Explain evaluation metrics with examples: "Precision = of all emails marked spam, how many were actually spam?"

Confusion Matrix Visual

                 Predicted
              Neg    |   Pos
         ┌──────────┼──────────┐
    Neg  │    TN    │    FP    │  ← Type I Error (False Alarm)
Actual   ├──────────┼──────────┤
    Pos  │    FN    │    TP    │  ← Type II Error (Miss)
         └──────────┴──────────┘
              ↑
         Type II Error