Haruk1y

wandb-weave-ft-retrospective

Analyze W&B, Weave, and local fine-tuning evaluation artifacts, then produce a concrete next-run improvement plan with data, prompt, and training actions. Use after each SFT or eval cycle.

Haruk1y 0 Updated 3mo ago

Resources

2
GitHub

Install

npx skillscat add haruk1y/mistral-hackathon/wandb-weave-ft-retrospective

Install via the SkillsCat registry.

SKILL.md

Wandb Weave Ft Retrospective

Use this skill to turn experiment traces into an actionable fine-tuning plan.

When To Use

  • User asks for a retrospective, run diagnosis, or "next fine-tuning plan".
  • You have W&B runs but no clear priority of what to fix first.
  • You need to convert hard-case outputs into concrete dataset or prompt changes.

Required Inputs

  • Submission records:
  • artifacts/hf_jobs/submissions.jsonl
  • artifacts/hf_jobs/eval_submissions.jsonl
  • Per-run outputs:
  • outputs/*/final_metrics.json
  • outputs/*/iter_eval_metrics.jsonl
  • outputs/*/hard_cases.valid.jsonl
  • Optional trace export:
  • outputs/*/weave_eval_traces.json

Workflow

  1. Pin the comparison set.
  • Pick one baseline run and one candidate run.
  • Record base model id, adapter/full model id, dataset version, and run ids.
  1. Check gate metrics first.
  • Use json_valid_rate and parse_error_rate as hard gates.
  • Then inspect quality metrics (mae_raw, mse_raw, per-dimension mae_raw_*).
  1. Cluster dominant failures.
  • Group hard cases by parse_error, source_type, and request pattern.
  • Separate format failures from value-quality failures.
  1. Convert clusters into fixes.
  • Format failures first: output contract prompt text, decode config, EOS handling.
  • Value-quality failures next: add targeted training rows for high-error dimensions.
  • Hyperparameter tuning last: only after format validity is stable.
  1. Produce a run-ready plan.
  • Include exact env var overrides for the next run.
  • Include promotion criteria and rollback criteria.

Repository Helpers

  • python scripts/hf/collect_campaign_validation.py
  • python scripts/hf/debug_json_prompt_variants.py
  • python scripts/hf/debug_local_eval_outputs.py

Output Contract

Always return all of the following:

  • Short diagnosis summary.
  • Top 3 failure clusters with evidence references.
  • Concrete next-run changes (dataset, prompt, train config).
  • Clear success gates for promotion.

References

  • references/retrospective-template.md