mlops-experiment

MLOps and experiment tracking for reproducible ML workflows. Covers MLflow, Weights & Biases (W&B), TensorBoard, hyperparameter tuning (Optuna, Ray Tune), model registry, experiment versioning, and production deployment patterns. Use when user asks about 'MLflow', 'W&B', 'Weights and Biases', 'experiment tracking', 'hyperparameter tuning', 'Optuna', 'model registry', 'TensorBoard', 'reproducibility', 'model versioning', 'ML pipeline', 'model deployment', 'logging', 'wandb', or 'Ray Tune'.

levy-n 10 1 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add levy-n/claude-useful-skills/mlops-experiment

Install via the SkillsCat registry.

SKILL.md

MLOps & Experiment Tracking

MLOps, מעקב ניסויים, וניהול מודלים בפרודקשן.

Quick Start - MLflow Experiment Tracking

import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn

# Start MLflow tracking
mlflow.set_experiment("my-classification-experiment")

with mlflow.start_run(run_name="baseline-mlp"):
    # Log hyperparameters
    mlflow.log_params({
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 50,
        "hidden_dim": 64,
        "optimizer": "Adam",
        "dropout": 0.3
    })

    # Training loop
    for epoch in range(50):
        train_loss = train_epoch(model, train_loader, criterion, optimizer)
        val_loss, val_acc = validate(model, val_loader, criterion)

        # Log metrics each epoch
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_accuracy": val_acc
        }, step=epoch)

    # Log final model
    mlflow.pytorch.log_model(model, "model")

    # Log artifacts (plots, configs)
    mlflow.log_artifact("training_curve.png")

# View results: mlflow ui  (opens browser at localhost:5000)

When This Skill Activates

Use this skill when:

Tracking experiments across multiple runs
Comparing model performance systematically
Tuning hyperparameters (grid/random/Bayesian)
Saving and versioning trained models
Building reproducible ML workflows
Deploying models to production

Core Patterns

Pattern 1: Weights & Biases (W&B)

import wandb

# Initialize
wandb.init(
    project="image-classifier",
    name="resnet50-augmented",
    config={
        "learning_rate": 0.001,
        "architecture": "ResNet50",
        "dataset": "CIFAR-10",
        "epochs": 100,
        "batch_size": 32,
    }
)

# Training loop with logging
for epoch in range(wandb.config.epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss, val_acc = validate(model, val_loader, criterion)

    # Log metrics (auto-creates beautiful dashboards)
    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": optimizer.param_groups[0]["lr"]
    })

    # Log sample predictions as images
    if epoch % 10 == 0:
        images, labels, preds = get_predictions(model, val_loader)
        wandb.log({
            "predictions": wandb.Image(
                images[:16],
                caption=[f"True:{l} Pred:{p}" for l, p in zip(labels[:16], preds[:16])]
            )
        })

# Log final model artifact
wandb.save("best_model.pt")
wandb.finish()

Pattern 2: TensorBoard

from torch.utils.tensorboard import SummaryWriter

# Create writer
writer = SummaryWriter("runs/experiment_1")

# Log scalars
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)

    writer.add_scalars("Loss", {
        "train": train_loss,
        "validation": val_loss
    }, epoch)

    writer.add_scalar("Learning Rate",
                       optimizer.param_groups[0]["lr"], epoch)

# Log model graph
dummy_input = torch.randn(1, input_dim)
writer.add_graph(model, dummy_input)

# Log embeddings
features = get_features(model, dataset)
writer.add_embedding(features, metadata=labels, tag="learned_features")

# Log histograms of weights
for name, param in model.named_parameters():
    writer.add_histogram(name, param, epoch)

writer.close()

# View: tensorboard --logdir=runs

Pattern 3: Hyperparameter Tuning with Optuna

import optuna

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    hidden_dim = trial.suggest_categorical("hidden_dim", [32, 64, 128, 256])
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
    batch_size = trial.suggest_categorical("batch_size", [16, 32, 64])

    # Build model with suggested params
    model = nn.Sequential(
        nn.Linear(input_dim, hidden_dim),
        nn.ReLU(),
        nn.Dropout(dropout),
        nn.Linear(hidden_dim, hidden_dim // 2),
        nn.ReLU(),
        nn.Dropout(dropout),
        nn.Linear(hidden_dim // 2, num_classes)
    ).to(device)

    # Select optimizer
    if optimizer_name == "Adam":
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    elif optimizer_name == "SGD":
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    else:
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    # Train and evaluate
    for epoch in range(20):
        train_epoch(model, train_loader, criterion, optimizer)

        # Pruning: stop bad trials early
        val_loss = validate(model, val_loader, criterion)
        trial.report(val_loss, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_loss

# Run optimization
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100, timeout=3600)

# Results
print(f"Best trial: {study.best_trial.value:.4f}")
print(f"Best params: {study.best_trial.params}")

# Visualization
optuna.visualization.plot_optimization_history(study).show()
optuna.visualization.plot_param_importances(study).show()

Pattern 4: Experiment Comparison Framework

# Structured experiment tracking (works with any logger)
from dataclasses import dataclass, asdict
from datetime import datetime
import json

@dataclass
class ExperimentConfig:
    """All hyperparameters in one place for reproducibility."""
    # Data
    dataset: str = "CIFAR-10"
    train_size: int = 50000
    val_split: float = 0.1

    # Model
    architecture: str = "ResNet18"
    hidden_dim: int = 64
    dropout: float = 0.3

    # Training
    learning_rate: float = 0.001
    batch_size: int = 32
    epochs: int = 100
    optimizer: str = "Adam"
    weight_decay: float = 0.01
    scheduler: str = "CosineAnnealing"

    # Reproducibility
    seed: int = 42
    timestamp: str = ""

    def __post_init__(self):
        self.timestamp = datetime.now().isoformat()

# Usage
config = ExperimentConfig(learning_rate=0.0005, epochs=50)

# Save config
with open("experiment_config.json", "w") as f:
    json.dump(asdict(config), f, indent=2)

# Set seed for reproducibility
import torch
import numpy as np
import random

def set_seed(seed: int):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True

set_seed(config.seed)

Pattern 5: Model Registry Pattern

import mlflow
from pathlib import Path

# Register model after training
with mlflow.start_run():
    # ... training code ...

    # Log and register model
    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="text-classifier"
    )

# Transition model stages
client = mlflow.tracking.MlflowClient()

# Promote to staging
client.transition_model_version_stage(
    name="text-classifier",
    version=1,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name="text-classifier",
    version=1,
    stage="Production"
)

# Load production model
model = mlflow.pytorch.load_model(
    "models:/text-classifier/Production"
)

Model Lifecycle:
┌──────────┐    ┌──────────┐    ┌─────────────┐    ┌──────────┐
│  None     │───▶│ Staging  │───▶│ Production  │───▶│ Archived │
│ (trained) │    │ (testing) │    │  (serving)  │    │ (old)    │
└──────────┘    └──────────┘    └─────────────┘    └──────────┘
    Train &          A/B             Serve              Keep
    Log model        Test            users             history

Pattern 6: Learning Rate Schedulers

from torch.optim.lr_scheduler import (
    StepLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR
)

# 1. Step decay (reduce every N epochs)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# 2. Cosine annealing (smooth decay)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# 3. Reduce on plateau (adaptive)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=10, factor=0.5)

# 4. One-cycle (warmup + decay, best for training from scratch)
scheduler = OneCycleLR(
    optimizer,
    max_lr=0.01,
    epochs=100,
    steps_per_epoch=len(train_loader)
)

# In training loop
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)

    # For ReduceLROnPlateau (needs metric)
    scheduler.step(val_loss)

    # For others
    # scheduler.step()

    print(f"LR: {optimizer.param_groups[0]['lr']:.6f}")

Learning Rate Schedules:

Step Decay:        Cosine:         OneCycle:        ReduceOnPlateau:
lr│ ──┐            lr│∿             lr│  /\          lr│ ────┐
  │   └──┐          │  ∿            │ /  \           │     └──┐
  │      └──┐       │   ∿           │/    \          │        └──
  │         └──     │    ∿∿         │      \         │
  └──────── epoch   └──── epoch     └────── epoch    └──────── epoch

Pattern 7: Early Stopping

class EarlyStopping:
    """Stop training when validation loss stops improving."""

    def __init__(self, patience=10, min_delta=0.001, path="best_model.pt"):
        self.patience = patience
        self.min_delta = min_delta
        self.path = path
        self.counter = 0
        self.best_loss = float('inf')
        self.should_stop = False

    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            torch.save(model.state_dict(), self.path)
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True

# Usage
early_stopping = EarlyStopping(patience=15)

for epoch in range(1000):  # Set high, early stopping will cut it
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)

    early_stopping(val_loss, model)
    if early_stopping.should_stop:
        print(f"Early stopping at epoch {epoch}")
        break

# Load best model
model.load_state_dict(torch.load("best_model.pt"))

Reference Navigation

For detailed content, see:

Experiment Tracking: reference/experiment_tracking.md - MLflow, W&B, comparison guide
Hyperparameter Tuning: reference/hyperparameter_tuning.md - Optuna, Grid/Random/Bayesian search
Model Registry: reference/model_registry.md - Versioning, staging, deployment
TensorBoard Guide: reference/tensorboard_guide.md - Visualization, embedding projector

Common Mistakes to Avoid

1. Not Setting Seeds

# WRONG: Results change every run
model = train_model(data)  # Non-reproducible!

# CORRECT: Set all seeds
set_seed(42)
model = train_model(data)  # Reproducible!

2. Not Logging Everything

# WRONG: Only logging final accuracy
mlflow.log_metric("accuracy", final_acc)

# CORRECT: Log everything for debugging later
mlflow.log_params(asdict(config))       # All hyperparameters
mlflow.log_metrics(metrics, step=epoch)  # Metrics per epoch
mlflow.log_artifact("confusion_matrix.png")  # Visualizations

3. Manual Hyperparameter Search

# WRONG: Testing params by hand
# lr=0.001 → bad, lr=0.01 → bad, lr=0.005 → maybe?

# CORRECT: Systematic search with Optuna
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
# Explores intelligently, prunes bad trials early

Teaching Mode

MLOps Visual

ML Development WITHOUT MLOps:
  "Which run had the best accuracy?"
  "What hyperparameters did I use last week?"
  "Which model is in production?"
  → Chaos, no reproducibility

ML Development WITH MLOps:
┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│ Version  │   │ Track    │   │ Compare  │   │ Deploy   │
│ Data &   │──▶│ Experi-  │──▶│ Results  │──▶│ Best     │
│ Code     │   │ ments    │   │ Auto     │   │ Model    │
└──────────┘   └──────────┘   └──────────┘   └──────────┘
    Git +          MLflow/        Dashboard       Registry
    DVC            W&B            + Charts        + Staging

Cross-References

PyTorch training: ../pytorch-mastery/SKILL.md - Training loops, checkpoints
Deep learning: ../deep-learning-core/SKILL.md - Loss functions, optimizers
Fine-tuning: ../fine-tuning-peft/SKILL.md - LoRA/QLoRA experiments
Transformers: ../transformers-llm/SKILL.md - HuggingFace Trainer integration

mlops-experiment

Resources

Install

MLOps & Experiment Tracking

Quick Start - MLflow Experiment Tracking

When This Skill Activates

Core Patterns

Pattern 1: Weights & Biases (W&B)

Pattern 2: TensorBoard

Pattern 3: Hyperparameter Tuning with Optuna

Pattern 4: Experiment Comparison Framework

Pattern 5: Model Registry Pattern

Pattern 6: Learning Rate Schedulers

Pattern 7: Early Stopping

Reference Navigation

Common Mistakes to Avoid

1. Not Setting Seeds

2. Not Logging Everything

3. Manual Hyperparameter Search

Teaching Mode

MLOps Visual

Cross-References

Categories

Install

Recommended Skills