Implements ensemble learning (Random Forest, XGBoost, CatBoost, Stacking) and unsupervised methods (K-Means, DBSCAN, Hierarchical clustering, PCA, t-SNE, UMAP), and recommender systems (Matrix Factorization, NeuMF). Use when comparing gradient boosting algorithms, doing customer segmentation, anomaly detection, dimensionality reduction, building recommender systems, or when user mentions 'ensemble', 'boosting', 'bagging', 'random forest', 'XGBoost', 'clustering', 'K-Means', 'DBSCAN', 'elbow method', 'silhouette score', 'PCA', 't-SNE', 'dimensionality reduction', 'feature importance', 'matrix factorization', 'NeuMF', 'recommender system', or 'collaborative filtering'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/ml-advanced Install via the SkillsCat registry.
ML Advanced - Ensemble & Unsupervised Learning
שיטות מתקדמות: Ensemble Learning ו-Unsupervised Learning.
Quick Start - Ensemble Classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
# Compare ensemble methods
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost': XGBClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std()*2:.3f})")Quick Start - Clustering
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print(f"Silhouette Score: {silhouette_score(X, labels):.3f}")
# DBSCAN (no need to specify k!)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Found {n_clusters} clusters")When This Skill Activates
Use this skill when:
- Comparing Random Forest vs XGBoost vs CatBoost
- Building ensemble models (bagging, boosting, stacking)
- Doing customer/user segmentation
- Finding anomalies or outliers
- Reducing dimensionality for visualization
- Analyzing geospatial data with clustering
Core Patterns
Pattern 1: Bagging vs Boosting
Ensemble Methods
├── Bagging (parallel, reduce variance)
│ ├── Random Forest
│ └── Bagging Classifier
│
└── Boosting (sequential, reduce bias)
├── AdaBoost
├── Gradient Boosting
├── XGBoost
└── CatBoost| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Goal | Reduce variance | Reduce bias |
| Overfitting | Less prone | More prone |
| Speed | Faster | Slower |
| Example | Random Forest | XGBoost |
Pattern 2: Random Forest Feature Importance
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Feature importance
importance = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance.head(10))
# Plot
import matplotlib.pyplot as plt
importance.head(15).plot(kind='barh', x='feature', y='importance')
plt.title('Feature Importance')
plt.show()Pattern 3: XGBoost with Early Stopping
from xgboost import XGBClassifier
model = XGBClassifier(
n_estimators=1000,
learning_rate=0.1,
max_depth=6,
early_stopping_rounds=50, # Stop if no improvement
eval_metric='logloss',
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
print(f"Best iteration: {model.best_iteration}")Pattern 4: CatBoost for Categorical Features
from catboost import CatBoostClassifier
# CatBoost handles categorical features natively!
cat_features = ['color', 'brand', 'category'] # Column names or indices
model = CatBoostClassifier(
iterations=500,
learning_rate=0.1,
depth=6,
cat_features=cat_features, # No encoding needed!
verbose=100
)
model.fit(X_train, y_train, eval_set=(X_val, y_val))Pattern 5: K-Means with Elbow Method
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Find optimal k
inertias = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()Pattern 6: DBSCAN for Density-Based Clustering
from sklearn.cluster import DBSCAN
import numpy as np
# DBSCAN parameters:
# eps: maximum distance between neighbors
# min_samples: minimum points to form a cluster
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X)
# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")Pattern 7: Geospatial Clustering with Haversine
from sklearn.cluster import DBSCAN
import numpy as np
# Convert lat/lon to radians (required for Haversine)
coords_rad = np.radians(df[['latitude', 'longitude']].values)
# Earth radius in km
kms_per_radian = 6371.0088
eps_km = 0.5 # Cluster radius in km
eps = eps_km / kms_per_radian
# DBSCAN with Haversine distance
dbscan = DBSCAN(
eps=eps,
min_samples=10,
metric='haversine',
algorithm='ball_tree'
)
df['cluster'] = dbscan.fit_predict(coords_rad)Pattern 8: PCA for Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Always scale before PCA!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced dimensions: {X_pca.shape[1]}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")Pattern 9: t-SNE for Visualization
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# t-SNE (good for visualization, NOT for preprocessing)
tsne = TSNE(
n_components=2,
perplexity=30, # Try 5-50
random_state=42
)
X_tsne = tsne.fit_transform(X)
# Plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10')
plt.colorbar(scatter)
plt.title('t-SNE Visualization')
plt.show()Reference Navigation
For detailed content, see:
- Ensemble Methods:
reference/ensemble_methods.md- Random Forest, XGBoost, CatBoost, Stacking, TreeInterpreter - Clustering Algorithms:
reference/clustering_algorithms.md- K-Means, DBSCAN, Hierarchical, HDBSCAN - Dimensionality Reduction:
reference/dimensionality_reduction.md- PCA, t-SNE, UMAP, MDS - Geospatial Analysis:
reference/geospatial_analysis.md- Haversine, ConvexHull, KDE density plots - Recommender Systems:
reference/recommender_systems.md- Matrix Factorization, NeuMF, MovieLens patterns
Common Mistakes to Avoid
1. Not Scaling Before PCA
# WRONG: PCA on unscaled data
pca.fit_transform(X) # Features with larger scales dominate!
# CORRECT: Scale first
X_scaled = StandardScaler().fit_transform(X)
pca.fit_transform(X_scaled)2. Using t-SNE for Preprocessing
# WRONG: t-SNE output as features
X_tsne = TSNE().fit_transform(X)
model.fit(X_tsne, y) # Bad! t-SNE is for visualization only
# CORRECT: Use PCA for preprocessing
X_pca = PCA(n_components=50).fit_transform(X)
model.fit(X_pca, y)3. Ignoring DBSCAN Noise Points
# Noise points have label -1
noise_mask = labels == -1
print(f"Noise: {noise_mask.sum()} points ({noise_mask.mean():.1%})")
# Consider: too much noise? Adjust eps or min_samples4. Wrong n_init for K-Means
# K-Means is sensitive to initialization
# WRONG: Single initialization
kmeans = KMeans(n_clusters=3, n_init=1) # May find local minimum
# CORRECT: Multiple initializations (default is 10)
kmeans = KMeans(n_clusters=3, n_init=10)5. Using Euclidean Distance on Lat/Lon
# WRONG: Euclidean on geographic coordinates
DBSCAN(eps=0.01, metric='euclidean').fit(lat_lon_data)
# CORRECT: Use Haversine for geographic data
coords_rad = np.radians(lat_lon_data)
DBSCAN(eps=eps_in_radians, metric='haversine').fit(coords_rad)Teaching Mode
When explaining ensemble/clustering:
- Bagging intuition: "Ask many experts, vote on answer"
- Boosting intuition: "Learn from mistakes, focus on hard cases"
- K-Means intuition: "Find k centroids that minimize point-to-center distances"
- DBSCAN intuition: "Find dense regions separated by sparse areas"
- PCA intuition: "Find directions of maximum variance, project data onto them"
Clustering Decision Tree
Need to cluster data?
├── Know number of clusters (k)?
│ ├── Yes → K-Means
│ └── No → DBSCAN or Hierarchical
│
├── Data has varying densities?
│ ├── Yes → DBSCAN or HDBSCAN
│ └── No → K-Means
│
├── Need hierarchical structure?
│ └── Yes → Hierarchical Clustering
│
└── Geographic data?
└── Yes → DBSCAN with Haversine