Improve strict JSON generation reliability for fine-tuned language models using parser-based evaluation, prompt alignment, and targeted retraining loops. Use when outputs are malformed, have extra text, or violate schema/range constraints.
Resources
2Install
npx skillscat add haruk1y/mistral-hackathon/strict-json-output-hardening Install via the SkillsCat registry.
SKILL.md
Strict Json Output Hardening
Use this skill when JSON format correctness is a release gate.
Target Contract
Expected output is exactly one JSON object with keys:
energywarmthbrightnessacousticnesscomplexitynostalgia
Constraints:
- All values are integers.
- All values are in range
0..10. - No extra keys, no wrapper object, no markdown.
- No trailing tokens after the final
}.
Workflow
- Freeze prompt contract.
- Keep train prompt and inference prompt text aligned.
- Do not mix multiple output schemas across runs.
- Measure failures with parser-first eval.
- Parse raw completions as strict JSON object.
- Track failure categories, not only aggregate validity rate.
- Reproduce with focused debug runs.
- Compare prompt variants before retraining.
- Keep decoding deterministic (
do_sample=false) during diagnosis.
- Apply fixes in order.
- Inference-side controls: EOS, max tokens, stop behavior.
- Prompt wording hardening: explicit "single JSON object only".
- Data-side fixes: add hard cases and prompt-completion rows.
- Training-side fixes: keep completion-only loss and split hygiene.
- Re-evaluate with hard gates.
- Gate on
json_valid_rate. - Only compare quality metrics after format validity is stable.
Repository Mapping
- Prompt variant comparison:
scripts/hf/debug_json_prompt_variants.py - Full/adpater debug inference:
scripts/hf/debug_full_model_json_inference.py - Local output audit:
scripts/hf/debug_local_eval_outputs.py - Training pipeline:
scripts/hf/train_sft_request_to_hidden_lm.py - Dataset conversion to prompt-completion:
scripts/ft/convert_to_prompt_completion_dataset.py
Required Reporting
Always report:
- Current failure taxonomy with counts.
- Top 3 most frequent categories and representative outputs.
- Applied fixes and expected metric movement.
- Pass/fail gate recommendation for promotion.
References
references/json-failure-taxonomy.md