Operate Hugging Face Jobs reliably for training and evaluation in this repository, with reproducible submission records, monitoring, and retry flow. Use when submitting, monitoring, or triaging HF Jobs.
Resources
2Install
npx skillscat add haruk1y/mistral-hackathon/hf-job-ops-playbook Install via the SkillsCat registry.
SKILL.md
Hf Job Ops Playbook
Use this skill for end-to-end Hugging Face Jobs operations in Mistral_Hackathon.
When To Use
- User asks to submit training or eval jobs on Hugging Face.
- User asks for job status, logs, failure triage, or rerun.
- You need reproducible submission ledgers and job ids.
Preflight
- Authenticate CLI:
hf auth whoami
- Confirm required secrets are set:
HF_TOKENWANDB_API_KEY(if W&B logging is enabled)MISTRAL_API_KEY(if Mistral API fallback is used by eval)
- Run dry-run first:
- SFT:
node scripts/hf/submit_sft_job.mjs - Eval:
node scripts/hf/submit_eval_job.mjs
Submit Workflow
- Submit SFT job.
HF_JOB_SUBMIT=true node scripts/hf/submit_sft_job.mjs- Record is appended to
artifacts/hf_jobs/submissions.jsonl.
- Submit eval job.
HF_EVAL_JOB_SUBMIT=true node scripts/hf/submit_eval_job.mjs- Record is appended to
artifacts/hf_jobs/eval_submissions.jsonl.
- Capture job id from CLI and ledger.
- If id parse fails, keep
stdout_tailandstderr_tailfor manual lookup.
Monitoring Workflow
Use HF CLI directly:
- Running jobs:
hf jobs ps --namespace Haruk1y - All jobs:
hf jobs ps -a --namespace Haruk1y - Inspect:
hf jobs inspect <job_id> - Logs:
hf jobs logs -f <job_id> - Usage stats:
hf jobs stats <job_id>
Failure Triage
- Submission failure:
- Check
hf auth whoami. - Verify required secrets and env vars.
- Re-run dry-run command and inspect built command string.
- Runtime failure:
- Inspect logs and classify as data, dependency, OOM, or auth issue.
- Apply smallest config change needed and re-submit.
- Post-run artifact missing:
- Confirm
push_to_huband output model id settings. - Confirm local metrics files exist in output dir.
Repository Mapping
- SFT submission:
scripts/hf/submit_sft_job.mjs - Eval submission:
scripts/hf/submit_eval_job.mjs - Training script:
scripts/hf/train_sft_request_to_hidden_lm.py - Eval runner:
scripts/wandb/weave_eval_runner.py - Submission ledgers:
artifacts/hf_jobs/submissions.jsonl,artifacts/hf_jobs/eval_submissions.jsonl
References
references/job-ops-checklist.md