Implements classical ML algorithms for regression and classification. Covers Linear/Polynomial/Logistic Regression, Decision Trees, Ridge/Lasso regularization, train/test splits, cross-validation, and evaluation metrics (R², RMSE, Precision, Recall, F1, ROC-AUC, Confusion Matrix). Use when building predictive models on tabular data, comparing baseline algorithms, handling imbalanced data, or when user mentions 'regression', 'classification', 'overfitting', 'cross-validation', 'confusion matrix', 'feature importance', 'precision/recall', or 'regularization'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/ml-fundamentals Install via the SkillsCat registry.
SKILL.md
ML Fundamentals - Classical Machine Learning
יסודות Machine Learning קלאסי: רגרסיה, קלסיפיקציה, והערכת מודלים.
Quick Start
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# 1. Split data (ALWAYS stratify for classification)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Scale features (fit on train ONLY!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform ONLY
# 3. Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# 4. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))When This Skill Activates
Use this skill when:
- Building regression models (predicting continuous values)
- Building classification models (predicting categories)
- Comparing baseline algorithms before deep learning
- Evaluating model performance with metrics
- Engineering features from tabular data
- Handling class imbalance problems
Core Patterns
Pattern 1: Model Selection by Problem Type
Problem Type → Recommended Models
├── Regression (continuous target)
│ ├── Linear Regression (baseline)
│ ├── Ridge/Lasso (with regularization)
│ └── Polynomial (non-linear relationships)
│
├── Binary Classification
│ ├── Logistic Regression (baseline, interpretable)
│ └── Decision Tree (non-linear, interpretable)
│
└── Multi-class Classification
├── Logistic Regression (one-vs-rest)
└── Decision Tree (native multi-class)Pattern 2: Train/Test Split Rules
# Classification: ALWAYS use stratify
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# With validation set (recommended)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
)
# Results in 70/15/15 splitPattern 3: StandardScaler (CRITICAL)
# ⚠️ CRITICAL: Prevent data leakage!
scaler = StandardScaler()
# CORRECT:
X_train_scaled = scaler.fit_transform(X_train) # Learn from train
X_test_scaled = scaler.transform(X_test) # Apply to test
# WRONG (data leakage!):
# X_scaled = scaler.fit_transform(X) # Don't fit on all data!
# X_train, X_test = train_test_split(X_scaled, ...)Pattern 4: Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Simple CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')Pattern 5: Evaluation Metrics Selection
| Metric | Use When | Formula |
|---|---|---|
| Accuracy | Balanced classes | (TP+TN) / Total |
| Precision | False positives costly (spam) | TP / (TP+FP) |
| Recall | False negatives costly (cancer) | TP / (TP+FN) |
| F1 | Need balance, imbalanced data | 2 * (P*R) / (P+R) |
| ROC-AUC | Compare models, threshold tuning | Area under ROC curve |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report
)
# All metrics at once
print(classification_report(y_test, y_pred))
# ROC-AUC (needs probabilities)
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")Reference Navigation
For detailed content, see:
- Regression Models:
reference/regression_models.md- Linear, Polynomial, Ridge/Lasso, Gradient Descent - Classification Basics:
reference/classification_basics.md- Decision Trees, Logistic Regression, Impurity measures - Model Evaluation:
reference/model_evaluation.md- Metrics, Confusion Matrix, ROC curves, CV strategies - Feature Engineering:
reference/feature_engineering.md- Encoding, Scaling, Selection, Imbalance handling
Common Mistakes to Avoid
1. Data Leakage with Scaling
# WRONG: Fitting scaler on all data
scaler.fit(X) # Sees test data statistics!
# CORRECT: Fit only on training data
scaler.fit(X_train)2. Using Accuracy on Imbalanced Data
# If 95% of data is class 0, predicting all 0 gives 95% accuracy!
# Use F1, Precision, Recall, or ROC-AUC instead
from sklearn.metrics import f1_score
print(f"F1: {f1_score(y_test, y_pred)}")3. Not Stratifying Classification Splits
# WRONG: May get unbalanced splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# CORRECT: Preserves class distribution
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y
)4. Forgetting to Set random_state
# WRONG: Results not reproducible
train_test_split(X, y, test_size=0.2)
# CORRECT: Reproducible results
train_test_split(X, y, test_size=0.2, random_state=42)5. Using LASSO for P-values
# WRONG: LASSO cannot compute p-values!
# lasso.pvalue # Does not exist!
# CORRECT: Two-step pattern
# 1. Use LASSO to select features
# 2. Use OLS on selected features for p-values
import statsmodels.api as sm
selected_features = X[:, lasso.coef_ != 0]
model = sm.OLS(y, sm.add_constant(selected_features)).fit()
print(model.pvalues)Teaching Mode
When explaining ML fundamentals:
- Start with intuition: "Regression finds the best line through points"
- Use visual analogies: Draw scatter plots, decision boundaries
- Show the math gradually: Start with formula, then code
- Connect to real problems: "Spam detection is classification"
- Explain evaluation metrics with examples: "Precision = of all emails marked spam, how many were actually spam?"
Confusion Matrix Visual
Predicted
Neg | Pos
┌──────────┼──────────┐
Neg │ TN │ FP │ ← Type I Error (False Alarm)
Actual ├──────────┼──────────┤
Pos │ FN │ TP │ ← Type II Error (Miss)
└──────────┴──────────┘
↑
Type II Error