Analyze W&B, Weave, and local fine-tuning evaluation artifacts, then produce a concrete next-run improvement plan with data, prompt, and training actions. Use after each SFT or eval cycle.
Resources
2Install
npx skillscat add haruk1y/mistral-hackathon/wandb-weave-ft-retrospective Install via the SkillsCat registry.
SKILL.md
Wandb Weave Ft Retrospective
Use this skill to turn experiment traces into an actionable fine-tuning plan.
When To Use
- User asks for a retrospective, run diagnosis, or "next fine-tuning plan".
- You have W&B runs but no clear priority of what to fix first.
- You need to convert hard-case outputs into concrete dataset or prompt changes.
Required Inputs
- Submission records:
artifacts/hf_jobs/submissions.jsonlartifacts/hf_jobs/eval_submissions.jsonl- Per-run outputs:
outputs/*/final_metrics.jsonoutputs/*/iter_eval_metrics.jsonloutputs/*/hard_cases.valid.jsonl- Optional trace export:
outputs/*/weave_eval_traces.json
Workflow
- Pin the comparison set.
- Pick one baseline run and one candidate run.
- Record base model id, adapter/full model id, dataset version, and run ids.
- Check gate metrics first.
- Use
json_valid_rateandparse_error_rateas hard gates. - Then inspect quality metrics (
mae_raw,mse_raw, per-dimensionmae_raw_*).
- Cluster dominant failures.
- Group hard cases by
parse_error,source_type, and request pattern. - Separate format failures from value-quality failures.
- Convert clusters into fixes.
- Format failures first: output contract prompt text, decode config, EOS handling.
- Value-quality failures next: add targeted training rows for high-error dimensions.
- Hyperparameter tuning last: only after format validity is stable.
- Produce a run-ready plan.
- Include exact env var overrides for the next run.
- Include promotion criteria and rollback criteria.
Repository Helpers
python scripts/hf/collect_campaign_validation.pypython scripts/hf/debug_json_prompt_variants.pypython scripts/hf/debug_local_eval_outputs.py
Output Contract
Always return all of the following:
- Short diagnosis summary.
- Top 3 failure clusters with evidence references.
- Concrete next-run changes (dataset, prompt, train config).
- Clear success gates for promotion.
References
references/retrospective-template.md