A/B benchmarking skill for code reviewer agent prompts. Runs two versions of a reviewer agent against fixture ground truth using an LLM judge for semantic evaluation, then compares precision/recall/F1 metrics side-by-side. Defaults to comparing current agent vs git HEAD~1. Reports IMPROVED / REGRESSED / NEUTRAL verdict on F1.
Install
npx skillscat add whichguy/claude-craft/review-fix-bench Install via the SkillsCat registry.
Step 0 — Parse Arguments
Parse the user's invocation:
/review-fix-bench # current vs git HEAD~1, code-reviewer agent
/review-fix-bench --candidate path/to/new.md # current vs explicit candidate file
/review-fix-bench --agent review-fix # benchmark review-fix instead of code-reviewer
/review-fix-bench --judge path/to/judge.md # custom judge agent
/review-fix-bench --fixtures path/to/dir/ # custom fixtures directory
/review-fix-bench --runs N # N runs per fixture (max 3)
/review-fix-bench --label-a NAME --label-b NAME # custom labels for reportsSet defaults:
agent=code-reviewerfixtures_dir=test/fixtures/review-fix/runs=1judge_file=agents/review-fix-judge.md(can be overridden with--judge)label_a=currentlabel_b=candidate(orprevif using git HEAD~1)
Step 1 — Resolve File Paths and Pre-flight Checks
Resolve repo root:
REPO_DIR=$(git -C "$(pwd)" rev-parse --show-toplevel)Pre-flight checks — verify these exist before proceeding:
Judge agent:
$REPO_DIR/agents/review-fix-judge.md(or--judgeoverride)- Must exist and contain
"tp","fn","fp_count"— grep to verify - If missing: error with "Judge agent not found — run: ln -sfn $(pwd)/agents ~/.claude/agents or check CLAUDE.md"
- Must exist and contain
Bench harness:
$REPO_DIR/tools/review-fix-bench.sh- Must exist and contain
JUDGE_FILE— grep to verify - If missing
JUDGE_FILE: error with "Harness missing --judge-file support — ensure Phase 2 was applied"
- Must exist and contain
Fixtures directory:
$REPO_DIR/$fixtures_dir- Must contain at least one
*.ground-truth.jsonfile
- Must contain at least one
Resolve Version A (always the current agent file):
version_a_path = "$REPO_DIR/agents/${agent}.md"- Validate it exists; error if not
Resolve Version B (candidate or git HEAD~1):
If --candidate <path> was provided:
version_b_path = <path>(resolve relative to cwd if not absolute)- Validate the file exists; if not, error: "Candidate file not found: "
version_b_source = "candidate: <path>"- Set
label_b = "candidate"if not overridden
Otherwise, extract from git:
tmp_b=$(mktemp /tmp/bench-agent-b.XXXXXX)
git -C "$REPO_DIR" show HEAD~1:agents/${agent}.md > "$tmp_b" 2>/dev/null- If this fails (exit non-zero or empty file): error with clear message:
"Cannot extract HEAD~1 version of agents/${agent}.md — file may be new or only one commit exists.
Use --candidate to specify version B explicitly." version_b_path = "$tmp_b"(will be cleaned up after bench completes)version_b_source = "git HEAD~1: agents/${agent}.md"- Set
label_b = "prev"if not overridden
Git hash for version A:
git_hash_a=$(git -C "$REPO_DIR" rev-parse --short HEAD 2>/dev/null || echo "unknown")Step 2 — Run Bench A and Bench B in Parallel
Spawn two Task agents in a single parallel message (run_in_background: true) to execute:
Task A — benchmark current agent:
cd "$REPO_DIR" && tools/review-fix-bench.sh \
--run \
--label "${label_a}" \
--fixtures "${fixtures_dir}" \
--runs "${runs}" \
--agent-file "${version_a_path}" \
--judge-file "${judge_file}"Task B — benchmark candidate agent:
cd "$REPO_DIR" && tools/review-fix-bench.sh \
--run \
--label "${label_b}" \
--fixtures "${fixtures_dir}" \
--runs "${runs}" \
--agent-file "${version_b_path}" \
--judge-file "${judge_file}"Both tasks capture stdout. The harness prints Results written to: <path> — parse this line to get the result JSON path for each run.
On failure: If either task exits non-zero, print the captured stderr and abort. Clean up $tmp_b if set.
After both complete: Extract result paths:
result_a = line matching "Results written to:" from Task A stdout
result_b = line matching "Results written to:" from Task B stdoutIf either path is missing or the file doesn't exist, error: "Bench run failed to produce results file — check stderr above."
Step 3 — Compare in a Task Agent
Spawn a third Task agent to run:
cd "$REPO_DIR" && tools/review-fix-bench.sh --compare "${result_a}" "${result_b}"Capture the full output (delta table + per-fixture breakdown + verdict line).
On failure: If the compare exits non-zero or produces no output, print error and exit cleanly.
Step 4 — Summary Output
After the compare Task completes, print inline:
## Review-Fix Bench Results
**Version A** (baseline): agents/${agent}.md @ ${git_hash_a}
**Version B** (candidate): ${version_b_source}
**Judge**: ${judge_file}
**Fixtures**: ${fixtures_dir} | Runs per fixture: ${runs}
--- Delta Table ---
[compare output here]
--- Verdict ---
Overall: IMPROVED / REGRESSED / NEUTRAL (on F1)
[If IMPROVED]: Recommendation: adopt candidate — F1 improved without precision regression
[If REGRESSED]: Recommendation: revert or revise candidate — F1 declined
[If NEUTRAL]: No significant difference detected (|ΔF1| < 0.01)Parse the Verdict: ... line from the compare output to determine IMPROVED / REGRESSED / NEUTRAL.
Step 5 — Cleanup and Symlink Verification
Clean up temp file if created: rm -f "$tmp_b"
Verify the skill symlink exists:
ls -la ~/.claude/skills/review-fix-benchIf it doesn't exist, create it:
ln -sfn "${REPO_DIR}/skills/review-fix-bench" ~/.claude/skills/review-fix-bench