ML Training Debugger - Diagnose and Fix Training Issues

Context Cascade - Nested Plugin Architecture for Claude Code Official Claude Code Plugin | Version 3.1.0 | Last updated: 2026-01-09 (see docs/COMPONENT-COUNTS.json for source counts) Context-saving nested architecture: Playbooks -> Skills -> Agents -> Commands. Load only what you need, saving 90%+ context space.

DNYoussef 32 6 Updated 6mo ago

Resources

GitHub

Install

npx skillscat add dnyoussef/context-cascade/skills-delivery-when-debugging-ml-training-use-ml-training-debugger

Install via the SkillsCat registry.

SKILL.md

/============================================================================/
/* WHEN-DEBUGGING-ML-TRAINING-USE-ML-TRAINING-DEBUGGER SKILL :: VERILINGUA x VERIX EDITION /
/============================================================================*/

name: when-debugging-ml-training-use-ml-training-debugger
version: 1.0.0
description: |
[assert|neutral] Debug ML training issues and optimize performance including loss divergence, overfitting, and slow convergence [ground:given] [conf:0.95] [state:confirmed]
category: machine-learning
tags:

debugging
ml
training
optimization
troubleshooting
author: ruv
cognitive_frame:
primary: evidential
goal_analysis:
first_order: "Execute when-debugging-ml-training-use-ml-training-debugger workflow"
second_order: "Ensure quality and consistency"
third_order: "Enable systematic machine-learning processes"

/----------------------------------------------------------------------------/
/* S0 META-IDENTITY /
/----------------------------------------------------------------------------*/

[define|neutral] SKILL := {
name: "when-debugging-ml-training-use-ml-training-debugger",
category: "machine-learning",
version: "1.0.0",
layer: L1
} [ground:given] [conf:1.0] [state:confirmed]

/----------------------------------------------------------------------------/
/* S1 COGNITIVE FRAME /
/----------------------------------------------------------------------------*/

[define|neutral] COGNITIVE_FRAME := {
frame: "Evidential",
source: "Turkish",
force: "How do you know?"
} [ground:cognitive-science] [conf:0.92] [state:confirmed]

Kanitsal Cerceve (Evidential Frame Activation)

Kaynak dogrulama modu etkin.

/----------------------------------------------------------------------------/
/* S2 TRIGGER CONDITIONS /
/----------------------------------------------------------------------------*/

[define|neutral] TRIGGER_POSITIVE := {
keywords: ["when-debugging-ml-training-use-ml-training-debugger", "machine-learning", "workflow"],
context: "user needs when-debugging-ml-training-use-ml-training-debugger capability"
} [ground:given] [conf:1.0] [state:confirmed]

/----------------------------------------------------------------------------/
/* S3 CORE CONTENT /
/----------------------------------------------------------------------------*/

When NOT to Use This Skill

Simple data preprocessing without model training
Statistical analysis that does not require ML models
Rule-based systems without learning components
Operations that do not involve model training or inference

Success Criteria

[assert|neutral] Model training convergence: Loss decreasing consistently [ground:acceptance-criteria] [conf:0.90] [state:provisional]
[assert|neutral] Validation accuracy: Meeting or exceeding baseline targets [ground:acceptance-criteria] [conf:0.90] [state:provisional]
[assert|neutral] Training time: Within expected bounds for dataset size [ground:acceptance-criteria] [conf:0.90] [state:provisional]
[assert|neutral] GPU utilization: >80% during training [ground:acceptance-criteria] [conf:0.90] [state:provisional]
[assert|neutral] Model export success: 100% successful saves [ground:acceptance-criteria] [conf:0.90] [state:provisional]
[assert|neutral] Inference latency: <100ms for real-time applications [ground:acceptance-criteria] [conf:0.90] [state:provisional]

Edge Cases & Error Handling

GPU Memory Overflow: Reduce batch size, use gradient accumulation, or mixed precision
Divergent Training: Implement learning rate scheduling, gradient clipping
Data Pipeline Failures: Validate data integrity, handle missing/corrupted files
Version Mismatches: Lock dependency versions, use containerization
Checkpoint Corruption: Save multiple checkpoints, validate before loading
Distributed Training Failures: Handle node failures, implement fault tolerance

Guardrails & Safety

[assert|emphatic] NEVER: train on unvalidated or uncleaned data [ground:policy] [conf:0.98] [state:confirmed]
[assert|neutral] ALWAYS: validate model outputs before deployment [ground:policy] [conf:0.98] [state:confirmed]
[assert|neutral] ALWAYS: implement reproducibility (random seeds, version pinning) [ground:policy] [conf:0.98] [state:confirmed]
[assert|emphatic] NEVER: expose training data in model artifacts or logs [ground:policy] [conf:0.98] [state:confirmed]
[assert|neutral] ALWAYS: monitor for bias and fairness issues [ground:policy] [conf:0.98] [state:confirmed]
[assert|neutral] ALWAYS: implement model versioning and rollback capabilities [ground:policy] [conf:0.98] [state:confirmed]

Evidence-Based Validation

Verify hardware availability: Check GPU/TPU status before training
Validate data quality: Run data integrity checks and statistics
Monitor training: Track loss curves, gradients, and metrics
Test model performance: Evaluate on held-out test set
Benchmark inference: Measure latency and throughput under load

ML Training Debugger - Diagnose and Fix Training Issues

Kanitsal Cerceve (Evidential Frame Activation)

Kaynak dogrulama modu etkin.

Overview

Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization.

When to Use

Training loss becomes NaN or infinite
Severe overfitting (train >> val performance)
Training not converging
Gradient vanishing/exploding
Poor validation accuracy
Training too slow

Phase 1: Diagnose Issue (8 min)

Objective

Identify the specific training problem

Agent: ML-Developer

Step 1.1: Analyze Training Curves

import json
import numpy as np

# Load training history
with open('training_history.json', 'r') as f:
    history = json.load(f)

# Diagnose issues
diagnosis = {
    'loss_divergence': check_loss_divergence(history['loss']),
    'overfitting': check_overfitting(history['loss'], history['val_loss']),
    'slow_convergence': check_convergence_rate(history['loss']),
    'gradient_issues': check_gradient_health(history),
    'nan_values': any(np.isnan(history['loss']))
}

def check_loss_divergence(losses):
    # Loss increasing over time
    if len(losses) > 10:
        recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5])
      

/*----------------------------------------------------------------------------*/
/* S4 SUCCESS CRITERIA                                                         */
/*----------------------------------------------------------------------------*/

[define|neutral] SUCCESS_CRITERIA := {
  primary: "Skill execution completes successfully",
  quality: "Output meets quality thresholds",
  verification: "Results validated against requirements"
} [ground:given] [conf:1.0] [state:confirmed]

/*----------------------------------------------------------------------------*/
/* S5 MCP INTEGRATION                                                          */
/*----------------------------------------------------------------------------*/

[define|neutral] MCP_INTEGRATION := {
  memory_mcp: "Store execution results and patterns",
  tools: ["mcp__memory-mcp__memory_store", "mcp__memory-mcp__vector_search"]
} [ground:witnessed:mcp-config] [conf:0.95] [state:confirmed]

/*----------------------------------------------------------------------------*/
/* S6 MEMORY NAMESPACE                                                         */
/*----------------------------------------------------------------------------*/

[define|neutral] MEMORY_NAMESPACE := {
  pattern: "skills/machine-learning/when-debugging-ml-training-use-ml-training-debugger/{project}/{timestamp}",
  store: ["executions", "decisions", "patterns"],
  retrieve: ["similar_tasks", "proven_patterns"]
} [ground:system-policy] [conf:1.0] [state:confirmed]

[define|neutral] MEMORY_TAGGING := {
  WHO: "when-debugging-ml-training-use-ml-training-debugger-{session_id}",
  WHEN: "ISO8601_timestamp",
  PROJECT: "{project_name}",
  WHY: "skill-execution"
} [ground:system-policy] [conf:1.0] [state:confirmed]

/*----------------------------------------------------------------------------*/
/* S7 SKILL COMPLETION VERIFICATION                                            */
/*----------------------------------------------------------------------------*/

[direct|emphatic] COMPLETION_CHECKLIST := {
  agent_spawning: "Spawn agents via Task()",
  registry_validation: "Use registry agents only",
  todowrite_called: "Track progress with TodoWrite",
  work_delegation: "Delegate to specialized agents"
} [ground:system-policy] [conf:1.0] [state:confirmed]

/*----------------------------------------------------------------------------*/
/* S8 ABSOLUTE RULES                                                           */
/*----------------------------------------------------------------------------*/

[direct|emphatic] RULE_NO_UNICODE := forall(output): NOT(unicode_outside_ascii) [ground:windows-compatibility] [conf:1.0] [state:confirmed]

[direct|emphatic] RULE_EVIDENCE := forall(claim): has(ground) AND has(confidence) [ground:verix-spec] [conf:1.0] [state:confirmed]

[direct|emphatic] RULE_REGISTRY := forall(agent): agent IN AGENT_REGISTRY [ground:system-policy] [conf:1.0] [state:confirmed]

/*----------------------------------------------------------------------------*/
/* PROMISE                                                                     */
/*----------------------------------------------------------------------------*/

[commit|confident] <promise>WHEN_DEBUGGING_ML_TRAINING_USE_ML_TRAINING_DEBUGGER_VERILINGUA_VERIX_COMPLIANT</promise> [ground:self-validation] [conf:0.99] [state:confirmed]

ML Training Debugger - Diagnose and Fix Training Issues

Resources

Install

Kanitsal Cerceve (Evidential Frame Activation)

When NOT to Use This Skill

Success Criteria

Edge Cases & Error Handling

Guardrails & Safety

Evidence-Based Validation

ML Training Debugger - Diagnose and Fix Training Issues

Kanitsal Cerceve (Evidential Frame Activation)

Overview

When to Use

Phase 1: Diagnose Issue (8 min)

Objective

Agent: ML-Developer

Categories

Install

Recommended Skills