levy-n

ml-advanced

Implements ensemble learning (Random Forest, XGBoost, CatBoost, Stacking) and unsupervised methods (K-Means, DBSCAN, Hierarchical clustering, PCA, t-SNE, UMAP), and recommender systems (Matrix Factorization, NeuMF). Use when comparing gradient boosting algorithms, doing customer segmentation, anomaly detection, dimensionality reduction, building recommender systems, or when user mentions 'ensemble', 'boosting', 'bagging', 'random forest', 'XGBoost', 'clustering', 'K-Means', 'DBSCAN', 'elbow method', 'silhouette score', 'PCA', 't-SNE', 'dimensionality reduction', 'feature importance', 'matrix factorization', 'NeuMF', 'recommender system', or 'collaborative filtering'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/ml-advanced

Install via the SkillsCat registry.

SKILL.md

ML Advanced - Ensemble & Unsupervised Learning

שיטות מתקדמות: Ensemble Learning ו-Unsupervised Learning.

Quick Start - Ensemble Classification

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# Compare ensemble methods
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Quick Start - Clustering

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print(f"Silhouette Score: {silhouette_score(X, labels):.3f}")

# DBSCAN (no need to specify k!)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Found {n_clusters} clusters")

When This Skill Activates

Use this skill when:

  • Comparing Random Forest vs XGBoost vs CatBoost
  • Building ensemble models (bagging, boosting, stacking)
  • Doing customer/user segmentation
  • Finding anomalies or outliers
  • Reducing dimensionality for visualization
  • Analyzing geospatial data with clustering

Core Patterns

Pattern 1: Bagging vs Boosting

Ensemble Methods
├── Bagging (parallel, reduce variance)
│   ├── Random Forest
│   └── Bagging Classifier
│
└── Boosting (sequential, reduce bias)
    ├── AdaBoost
    ├── Gradient Boosting
    ├── XGBoost
    └── CatBoost
Aspect Bagging Boosting
Training Parallel Sequential
Goal Reduce variance Reduce bias
Overfitting Less prone More prone
Speed Faster Slower
Example Random Forest XGBoost

Pattern 2: Random Forest Feature Importance

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Feature importance
importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importance.head(10))

# Plot
import matplotlib.pyplot as plt
importance.head(15).plot(kind='barh', x='feature', y='importance')
plt.title('Feature Importance')
plt.show()

Pattern 3: XGBoost with Early Stopping

from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=6,
    early_stopping_rounds=50,  # Stop if no improvement
    eval_metric='logloss',
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best iteration: {model.best_iteration}")

Pattern 4: CatBoost for Categorical Features

from catboost import CatBoostClassifier

# CatBoost handles categorical features natively!
cat_features = ['color', 'brand', 'category']  # Column names or indices

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    cat_features=cat_features,  # No encoding needed!
    verbose=100
)

model.fit(X_train, y_train, eval_set=(X_val, y_val))

Pattern 5: K-Means with Elbow Method

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Find optimal k
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

Pattern 6: DBSCAN for Density-Based Clustering

from sklearn.cluster import DBSCAN
import numpy as np

# DBSCAN parameters:
# eps: maximum distance between neighbors
# min_samples: minimum points to form a cluster

dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X)

# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()

print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Pattern 7: Geospatial Clustering with Haversine

from sklearn.cluster import DBSCAN
import numpy as np

# Convert lat/lon to radians (required for Haversine)
coords_rad = np.radians(df[['latitude', 'longitude']].values)

# Earth radius in km
kms_per_radian = 6371.0088
eps_km = 0.5  # Cluster radius in km
eps = eps_km / kms_per_radian

# DBSCAN with Haversine distance
dbscan = DBSCAN(
    eps=eps,
    min_samples=10,
    metric='haversine',
    algorithm='ball_tree'
)

df['cluster'] = dbscan.fit_predict(coords_rad)

Pattern 8: PCA for Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale before PCA!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced dimensions: {X_pca.shape[1]}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

Pattern 9: t-SNE for Visualization

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# t-SNE (good for visualization, NOT for preprocessing)
tsne = TSNE(
    n_components=2,
    perplexity=30,  # Try 5-50
    random_state=42
)
X_tsne = tsne.fit_transform(X)

# Plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10')
plt.colorbar(scatter)
plt.title('t-SNE Visualization')
plt.show()

Reference Navigation

For detailed content, see:

  • Ensemble Methods: reference/ensemble_methods.md - Random Forest, XGBoost, CatBoost, Stacking, TreeInterpreter
  • Clustering Algorithms: reference/clustering_algorithms.md - K-Means, DBSCAN, Hierarchical, HDBSCAN
  • Dimensionality Reduction: reference/dimensionality_reduction.md - PCA, t-SNE, UMAP, MDS
  • Geospatial Analysis: reference/geospatial_analysis.md - Haversine, ConvexHull, KDE density plots
  • Recommender Systems: reference/recommender_systems.md - Matrix Factorization, NeuMF, MovieLens patterns

Common Mistakes to Avoid

1. Not Scaling Before PCA

# WRONG: PCA on unscaled data
pca.fit_transform(X)  # Features with larger scales dominate!

# CORRECT: Scale first
X_scaled = StandardScaler().fit_transform(X)
pca.fit_transform(X_scaled)

2. Using t-SNE for Preprocessing

# WRONG: t-SNE output as features
X_tsne = TSNE().fit_transform(X)
model.fit(X_tsne, y)  # Bad! t-SNE is for visualization only

# CORRECT: Use PCA for preprocessing
X_pca = PCA(n_components=50).fit_transform(X)
model.fit(X_pca, y)

3. Ignoring DBSCAN Noise Points

# Noise points have label -1
noise_mask = labels == -1
print(f"Noise: {noise_mask.sum()} points ({noise_mask.mean():.1%})")

# Consider: too much noise? Adjust eps or min_samples

4. Wrong n_init for K-Means

# K-Means is sensitive to initialization
# WRONG: Single initialization
kmeans = KMeans(n_clusters=3, n_init=1)  # May find local minimum

# CORRECT: Multiple initializations (default is 10)
kmeans = KMeans(n_clusters=3, n_init=10)

5. Using Euclidean Distance on Lat/Lon

# WRONG: Euclidean on geographic coordinates
DBSCAN(eps=0.01, metric='euclidean').fit(lat_lon_data)

# CORRECT: Use Haversine for geographic data
coords_rad = np.radians(lat_lon_data)
DBSCAN(eps=eps_in_radians, metric='haversine').fit(coords_rad)

Teaching Mode

When explaining ensemble/clustering:

  1. Bagging intuition: "Ask many experts, vote on answer"
  2. Boosting intuition: "Learn from mistakes, focus on hard cases"
  3. K-Means intuition: "Find k centroids that minimize point-to-center distances"
  4. DBSCAN intuition: "Find dense regions separated by sparse areas"
  5. PCA intuition: "Find directions of maximum variance, project data onto them"

Clustering Decision Tree

Need to cluster data?
├── Know number of clusters (k)?
│   ├── Yes → K-Means
│   └── No → DBSCAN or Hierarchical
│
├── Data has varying densities?
│   ├── Yes → DBSCAN or HDBSCAN
│   └── No → K-Means
│
├── Need hierarchical structure?
│   └── Yes → Hierarchical Clustering
│
└── Geographic data?
    └── Yes → DBSCAN with Haversine