Compute/Throughput Baseline & Regression Gate

This skill should be used when the user asks to "add performance benchmarks", "create a regression gate", "measure training throughput", "compute MFU", "benchmark step time", "profile training loop", "set up CI perf gate", "compare against baseline", "collect environment info", "machine profile", "tokens per second measurement", "CUDA sync timing", "PyTorch profiler traces", "TensorBoard trace handler", "eval harness", "perplexity gate", "update performance baseline", "scaling efficiency test", or needs guidance on repeatable performance measurement, baseline storage, regression detection, profiling integration, or CI-gated throughput checks.

sovr610 0 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add sovr610/refffiy/compute-throughput-baseline-regression-gate

Install via the SkillsCat registry.

SKILL.md

Compute/Throughput Baseline & Regression Gate

Overview

Guide implementation of a repeatable performance+quality harness that runs fast enough for every PR, produces machine-readable artifacts, and blocks merges when throughput or quality regresses beyond defined tolerances. Cover environment capture, training micro-benchmarks with proper CUDA synchronization, deterministic eval probes, baseline storage keyed by machine profile, comparison logic with configurable tolerances, PyTorch Profiler integration, and CI workflow generation.

Public Contract

CollectEnv

Capture deterministic environment snapshot and machine profile string.

class CollectEnv:
    def collect(self) -> Dict[str, Any]: ...  # Full env dict
    def machine_profile(self) -> str: ...     # e.g. "H100x8_driver550_cuda12.4_torch2.4_sm90"
    def save(self, path: str) -> None: ...    # Write env.json

BenchTrain

Training micro-run benchmark with proper timing.

class BenchTrain:
    def __init__(self, config: BenchConfig): ...
    def run(self) -> BenchResult: ...
    def save(self, path: str) -> None: ...    # Write metrics.json

EvalSmall

Small, fixed, deterministic quality harness.

class EvalSmall:
    def __init__(self, config: EvalConfig): ...
    def run(self) -> EvalResult: ...
    def save(self, path: str) -> None: ...    # Write eval.json

CompareBaseline

Baseline comparison with configurable tolerances and pass/fail.

class CompareBaseline:
    def __init__(self, tolerances: ToleranceConfig): ...
    def compare(self, current: str, baseline: str) -> CompareResult: ...
    def update_baseline(self, current: str, dest: str) -> None: ...

BenchRunner

Unified entrypoint orchestrating all sub-modes.

class BenchRunner:
    def run(self, modes: List[str], **kwargs) -> Dict[str, Any]: ...

Key Concepts

Two Benchmark Modes

Mode	What It Measures	When to Gate
`e2e`	Real dataloader + full step	Data path is stable
`synthetic`	No I/O, pure training loop	Data path noisy; alert on stalls separately

Timing Mechanics (Critical)

Measure "end-to-end step time": forward + backward + optimizer step + distributed comm + gradient accumulation boundaries.

torch.cuda.reset_peak_memory_stats()        # Before warmup
[run warmup_steps]                           # Warm caches + JIT
torch.cuda.synchronize()                     # ← BEFORE timer
t0 = time.perf_counter()
[one full training step]
torch.cuda.synchronize()                     # ← AFTER step
step_time = time.perf_counter() - t0

Without CUDA sync, the benchmark measures kernel queueing, not execution.

Throughput Metrics

tokens_per_step = global_batch_size * seq_len (includes data-parallel aggregation)
tokens_per_sec = tokens_per_step / step_time_s
MFU (Model FLOPs Utilization): achieved_flops_per_s / (num_gpus * peak_flops_per_gpu)

Two MFU estimators:

Simple 6ND: flops_per_step = 6 * N_non_embedding_params * tokens_per_step
Transformer-aware (optional): incorporates attention FLOPs

Machine Profile Rule

Compare only within the same machine profile (GPU model/count, driver, CUDA, PyTorch build). Store baselines per profile. Otherwise: false failures.

Aggregation for Gating

Gate on medians (p50) and optionally tails (p90) for jitter. Report: step_time mean/p50/p90, tokens/sec mean/p50/p10, memory peak, loss slope, MFU p50.

Baseline Storage

bench/baselines/<machine_profile>.metrics.json
bench/baselines/<machine_profile>.eval.json

Baselines updated only on main branch or by explicit maintainer action.

Comparison Rules (Defaults)

Gate	Metric	Threshold	Action
Perf	tokens_per_sec_p50 drop	> 5%	FAIL
Perf	step_time_p50 increase	> 5%	FAIL
Perf	memory increase	> 10%	WARN
Quality	perplexity worsens	> 1.5% relative	FAIL
Quality	probe accuracy drops	> 2% absolute	FAIL
Stability	loss slope positive	> threshold	FAIL

PyTorch Profiler Integration

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=W, warmup=WU, active=A, repeat=R),
    on_trace_ready=torch.profiler.tensorboard_trace_handler(out_dir),
) as prof:
    for step in range(total):
        with torch.autograd.profiler.record_function("forward"):
            ...
        prof.step()

Wrap regions in record_function("name"): data, forward, loss, backward, optimizer_step, grad_sync.

Scaling Efficiency

efficiency = tokens_sec_p50(world_size=N) / (N * tokens_sec_p50(world_size=1))

Eval Harness Design

Fixed shard bundled in-repo (no network drift)
Fixed tokenizer + truncation rules
Greedy decode (temperature=0), fixed seeds
2-3 task probes scored with exact-match or regex (no LLM-judge in CI)

Configuration Surface

@dataclass
class BenchConfig:
    warmup_steps: int = 200
    measure_steps: int = 100
    mode: str = "e2e"                    # "e2e" | "synthetic"
    world_size: int = 1                  # 1 | N | -1 for auto
    profile: str = "off"                 # "off" | "trace" | "tb"
    repeat: int = 1
    peak_tflops: Optional[float] = None  # Override for MFU
    mfu_estimator: str = "6ND"           # "6ND" | "transformer_aware"

@dataclass
class EvalConfig:
    fixed_shard_path: str = "bench/data/fixed_shard.txt"
    probes: List[str] = ("basic_reasoning_25", "format_following_30", "code_sanity_20")
    temperature: float = 0.0
    seed: int = 42

@dataclass
class ToleranceConfig:
    throughput_drop_pct: float = 5.0
    step_time_increase_pct: float = 5.0
    memory_increase_pct: float = 10.0
    ppl_increase_pct: float = 1.5
    probe_drop_abs: float = 2.0
    loss_slope_threshold: float = 0.001

Done-When Gates

Bench Run — python -m tools.bench.run --bench --eval produces metrics.json and eval.json with stable schema; repeated runs on same hardware yield p50 within 3% variance.
Profile Traces — --profile trace produces profile/ directory with traces viewable in TensorBoard via PyTorch's documented tensorboard_trace_handler flow.
CI Gate — compare_baseline correctly blocks when throughput drops > X% or eval worsens beyond tolerance on the same machine profile; passes when within tolerance.

Resources

Reference Files

references/benchmark-harness.md — CUDA sync timing, tokens/sec, MFU estimators, aggregation, metrics.json schema
references/eval-harness.md — Deterministic eval, perplexity, task probes, eval.json schema, seed management
references/baseline-comparison.md — Storage format, comparison rules, tolerance tuning, update workflow, failure reporting
references/profiling-mode.md — PyTorch profiler schedule, record_function ranges, TensorBoard export, Chrome traces
references/testing-matrix.md — Test scenarios for all phases

Asset Files

assets/collect_env_template.py — CollectEnv with GPU detection, machine profile, env.json output
assets/bench_train_template.py — BenchTrain with CUDA sync timing, MFU, aggregation, metrics.json
assets/eval_small_template.py — EvalSmall with fixed-shard perplexity, task probes, eval.json
assets/compare_baseline_template.py — CompareBaseline with tolerance rules, delta reporting, update mode
assets/run_template.py — BenchRunner unified entrypoint for all sub-modes
assets/perf_gate_config_template.py — All config dataclasses, GPU peak TFLOPS registry, defaults

Scripts

scripts/validate_perf_gate.py — Validates done-when gates
scripts/gen_perf_tests.py — Generates 100+ pytest test cases
scripts/ci_workflow_template.py — Generates .github/workflows/perf_gate.yml

Compute/Throughput Baseline & Regression Gate

Resources

Install

Compute/Throughput Baseline & Regression Gate

Overview

Public Contract

CollectEnv

BenchTrain

EvalSmall

CompareBaseline

BenchRunner

Key Concepts

Two Benchmark Modes

Timing Mechanics (Critical)

Throughput Metrics

Machine Profile Rule

Aggregation for Gating

Baseline Storage

Comparison Rules (Defaults)

PyTorch Profiler Integration

Scaling Efficiency

Eval Harness Design

Configuration Surface

Done-When Gates

Resources

Reference Files

Asset Files

Scripts

Categories

Install

Recommended Skills