MLOps and experiment tracking for reproducible ML workflows. Covers MLflow, Weights & Biases (W&B), TensorBoard, hyperparameter tuning (Optuna, Ray Tune), model registry, experiment versioning, and production deployment patterns. Use when user asks about 'MLflow', 'W&B', 'Weights and Biases', 'experiment tracking', 'hyperparameter tuning', 'Optuna', 'model registry', 'TensorBoard', 'reproducibility', 'model versioning', 'ML pipeline', 'model deployment', 'logging', 'wandb', or 'Ray Tune'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/mlops-experiment Install via the SkillsCat registry.
SKILL.md
MLOps & Experiment Tracking
MLOps, מעקב ניסויים, וניהול מודלים בפרודקשן.
Quick Start - MLflow Experiment Tracking
import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
# Start MLflow tracking
mlflow.set_experiment("my-classification-experiment")
with mlflow.start_run(run_name="baseline-mlp"):
# Log hyperparameters
mlflow.log_params({
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50,
"hidden_dim": 64,
"optimizer": "Adam",
"dropout": 0.3
})
# Training loop
for epoch in range(50):
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss, val_acc = validate(model, val_loader, criterion)
# Log metrics each epoch
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}, step=epoch)
# Log final model
mlflow.pytorch.log_model(model, "model")
# Log artifacts (plots, configs)
mlflow.log_artifact("training_curve.png")
# View results: mlflow ui (opens browser at localhost:5000)When This Skill Activates
Use this skill when:
- Tracking experiments across multiple runs
- Comparing model performance systematically
- Tuning hyperparameters (grid/random/Bayesian)
- Saving and versioning trained models
- Building reproducible ML workflows
- Deploying models to production
Core Patterns
Pattern 1: Weights & Biases (W&B)
import wandb
# Initialize
wandb.init(
project="image-classifier",
name="resnet50-augmented",
config={
"learning_rate": 0.001,
"architecture": "ResNet50",
"dataset": "CIFAR-10",
"epochs": 100,
"batch_size": 32,
}
)
# Training loop with logging
for epoch in range(wandb.config.epochs):
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss, val_acc = validate(model, val_loader, criterion)
# Log metrics (auto-creates beautiful dashboards)
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": optimizer.param_groups[0]["lr"]
})
# Log sample predictions as images
if epoch % 10 == 0:
images, labels, preds = get_predictions(model, val_loader)
wandb.log({
"predictions": wandb.Image(
images[:16],
caption=[f"True:{l} Pred:{p}" for l, p in zip(labels[:16], preds[:16])]
)
})
# Log final model artifact
wandb.save("best_model.pt")
wandb.finish()Pattern 2: TensorBoard
from torch.utils.tensorboard import SummaryWriter
# Create writer
writer = SummaryWriter("runs/experiment_1")
# Log scalars
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss = validate(model, val_loader, criterion)
writer.add_scalars("Loss", {
"train": train_loss,
"validation": val_loss
}, epoch)
writer.add_scalar("Learning Rate",
optimizer.param_groups[0]["lr"], epoch)
# Log model graph
dummy_input = torch.randn(1, input_dim)
writer.add_graph(model, dummy_input)
# Log embeddings
features = get_features(model, dataset)
writer.add_embedding(features, metadata=labels, tag="learned_features")
# Log histograms of weights
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
writer.close()
# View: tensorboard --logdir=runsPattern 3: Hyperparameter Tuning with Optuna
import optuna
def objective(trial):
# Suggest hyperparameters
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
hidden_dim = trial.suggest_categorical("hidden_dim", [32, 64, 128, 256])
dropout = trial.suggest_float("dropout", 0.1, 0.5)
optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
batch_size = trial.suggest_categorical("batch_size", [16, 32, 64])
# Build model with suggested params
model = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, num_classes)
).to(device)
# Select optimizer
if optimizer_name == "Adam":
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
elif optimizer_name == "SGD":
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
else:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# Train and evaluate
for epoch in range(20):
train_epoch(model, train_loader, criterion, optimizer)
# Pruning: stop bad trials early
val_loss = validate(model, val_loader, criterion)
trial.report(val_loss, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return val_loss
# Run optimization
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100, timeout=3600)
# Results
print(f"Best trial: {study.best_trial.value:.4f}")
print(f"Best params: {study.best_trial.params}")
# Visualization
optuna.visualization.plot_optimization_history(study).show()
optuna.visualization.plot_param_importances(study).show()Pattern 4: Experiment Comparison Framework
# Structured experiment tracking (works with any logger)
from dataclasses import dataclass, asdict
from datetime import datetime
import json
@dataclass
class ExperimentConfig:
"""All hyperparameters in one place for reproducibility."""
# Data
dataset: str = "CIFAR-10"
train_size: int = 50000
val_split: float = 0.1
# Model
architecture: str = "ResNet18"
hidden_dim: int = 64
dropout: float = 0.3
# Training
learning_rate: float = 0.001
batch_size: int = 32
epochs: int = 100
optimizer: str = "Adam"
weight_decay: float = 0.01
scheduler: str = "CosineAnnealing"
# Reproducibility
seed: int = 42
timestamp: str = ""
def __post_init__(self):
self.timestamp = datetime.now().isoformat()
# Usage
config = ExperimentConfig(learning_rate=0.0005, epochs=50)
# Save config
with open("experiment_config.json", "w") as f:
json.dump(asdict(config), f, indent=2)
# Set seed for reproducibility
import torch
import numpy as np
import random
def set_seed(seed: int):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
set_seed(config.seed)Pattern 5: Model Registry Pattern
import mlflow
from pathlib import Path
# Register model after training
with mlflow.start_run():
# ... training code ...
# Log and register model
mlflow.pytorch.log_model(
model,
"model",
registered_model_name="text-classifier"
)
# Transition model stages
client = mlflow.tracking.MlflowClient()
# Promote to staging
client.transition_model_version_stage(
name="text-classifier",
version=1,
stage="Staging"
)
# After validation, promote to production
client.transition_model_version_stage(
name="text-classifier",
version=1,
stage="Production"
)
# Load production model
model = mlflow.pytorch.load_model(
"models:/text-classifier/Production"
)Model Lifecycle:
┌──────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────┐
│ None │───▶│ Staging │───▶│ Production │───▶│ Archived │
│ (trained) │ │ (testing) │ │ (serving) │ │ (old) │
└──────────┘ └──────────┘ └─────────────┘ └──────────┘
Train & A/B Serve Keep
Log model Test users historyPattern 6: Learning Rate Schedulers
from torch.optim.lr_scheduler import (
StepLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
)
# 1. Step decay (reduce every N epochs)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# 2. Cosine annealing (smooth decay)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# 3. Reduce on plateau (adaptive)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=10, factor=0.5)
# 4. One-cycle (warmup + decay, best for training from scratch)
scheduler = OneCycleLR(
optimizer,
max_lr=0.01,
epochs=100,
steps_per_epoch=len(train_loader)
)
# In training loop
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss = validate(model, val_loader, criterion)
# For ReduceLROnPlateau (needs metric)
scheduler.step(val_loss)
# For others
# scheduler.step()
print(f"LR: {optimizer.param_groups[0]['lr']:.6f}")Learning Rate Schedules:
Step Decay: Cosine: OneCycle: ReduceOnPlateau:
lr│ ──┐ lr│∿ lr│ /\ lr│ ────┐
│ └──┐ │ ∿ │ / \ │ └──┐
│ └──┐ │ ∿ │/ \ │ └──
│ └── │ ∿∿ │ \ │
└──────── epoch └──── epoch └────── epoch └──────── epochPattern 7: Early Stopping
class EarlyStopping:
"""Stop training when validation loss stops improving."""
def __init__(self, patience=10, min_delta=0.001, path="best_model.pt"):
self.patience = patience
self.min_delta = min_delta
self.path = path
self.counter = 0
self.best_loss = float('inf')
self.should_stop = False
def __call__(self, val_loss, model):
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
torch.save(model.state_dict(), self.path)
else:
self.counter += 1
if self.counter >= self.patience:
self.should_stop = True
# Usage
early_stopping = EarlyStopping(patience=15)
for epoch in range(1000): # Set high, early stopping will cut it
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss = validate(model, val_loader, criterion)
early_stopping(val_loss, model)
if early_stopping.should_stop:
print(f"Early stopping at epoch {epoch}")
break
# Load best model
model.load_state_dict(torch.load("best_model.pt"))Reference Navigation
For detailed content, see:
- Experiment Tracking:
reference/experiment_tracking.md- MLflow, W&B, comparison guide - Hyperparameter Tuning:
reference/hyperparameter_tuning.md- Optuna, Grid/Random/Bayesian search - Model Registry:
reference/model_registry.md- Versioning, staging, deployment - TensorBoard Guide:
reference/tensorboard_guide.md- Visualization, embedding projector
Common Mistakes to Avoid
1. Not Setting Seeds
# WRONG: Results change every run
model = train_model(data) # Non-reproducible!
# CORRECT: Set all seeds
set_seed(42)
model = train_model(data) # Reproducible!2. Not Logging Everything
# WRONG: Only logging final accuracy
mlflow.log_metric("accuracy", final_acc)
# CORRECT: Log everything for debugging later
mlflow.log_params(asdict(config)) # All hyperparameters
mlflow.log_metrics(metrics, step=epoch) # Metrics per epoch
mlflow.log_artifact("confusion_matrix.png") # Visualizations3. Manual Hyperparameter Search
# WRONG: Testing params by hand
# lr=0.001 → bad, lr=0.01 → bad, lr=0.005 → maybe?
# CORRECT: Systematic search with Optuna
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
# Explores intelligently, prunes bad trials earlyTeaching Mode
MLOps Visual
ML Development WITHOUT MLOps:
"Which run had the best accuracy?"
"What hyperparameters did I use last week?"
"Which model is in production?"
→ Chaos, no reproducibility
ML Development WITH MLOps:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Version │ │ Track │ │ Compare │ │ Deploy │
│ Data & │──▶│ Experi- │──▶│ Results │──▶│ Best │
│ Code │ │ ments │ │ Auto │ │ Model │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Git + MLflow/ Dashboard Registry
DVC W&B + Charts + StagingCross-References
- PyTorch training:
../pytorch-mastery/SKILL.md- Training loops, checkpoints - Deep learning:
../deep-learning-core/SKILL.md- Loss functions, optimizers - Fine-tuning:
../fine-tuning-peft/SKILL.md- LoRA/QLoRA experiments - Transformers:
../transformers-llm/SKILL.md- HuggingFace Trainer integration