Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
Resources
4Install
npx skillscat add openclaw/skills/ml-model-eval-benchmark Install via the SkillsCat registry.
SKILL.md
ML Model Eval Benchmark
Overview
Produce consistent model ranking outputs from metric-weighted evaluation inputs.
Workflow
- Define metric weights and accepted metric ranges.
- Ingest model metrics for each candidate.
- Compute weighted score and ranking.
- Export leaderboard and promotion recommendation.
Use Bundled Resources
- Run
scripts/benchmark_models.pyto generate benchmark outputs. - Read
references/benchmarking-guide.mdfor weighting and tie-break guidance.
Guardrails
- Keep metric names and scales consistent across candidates.
- Record weighting assumptions in the output.