"QA auditor that examines completed pipeline executions, detects regressions, scores lesson effectiveness with a graduated model, and generates risk-tiered meta-improvement proposals through a forge loop."
Resources
13Install
npx skillscat add anthonyjlee/auditor Install via the SkillsCat registry.
Announce at start: "I'll run a QA audit on [scope]. Let me gather the execution data and build an audit corpus."
Auditor
Examine completed pipeline executions, detect regressions, score lesson effectiveness with a graduated model, and generate risk-tiered meta-improvement proposals through a forge loop.
Goal
Produce three artifact classes:
- A comprehensive, evidence-grounded audit report covering all slices in scope
- A lesson effectiveness assessment with graduated scoring for every promoted lesson
- Zero or more meta-slice contracts proposing concrete pipeline improvements, risk-tiered and ecosystem-validated
All artifacts are stress-tested through an autonomous scored loop before operator review.
Non-Negotiable Rules
- Use
AskUserQuestionfor ALL operator interaction -- never ask in plain text - The operator confirms the audit scope interactively (Phase 1) before any autonomous forging begins (Phase 2)
- Complete each phase in order: Audit Intake --> Audit Forge --> Regression Analysis --> Meta-Improvement --> Ecosystem Fit Review --> Audit Review
- Every experiment is scored and recorded -- never skip logging
- Use subagents for evaluation, brainstorming, and criticism -- never rely on main agent alone
- Never touch the baseline -- always work on the optimized copy
- One mutation per experiment -- never change multiple things at once
- No fixed loop limits -- run until patience is exhausted
- Surface uncertainty honestly -- missing data becomes explicit assumptions, never hidden behind confident prose
- Self-audit recursion depth is capped at 2
- All meta-slice risk tiers auto-route with notification -- no human gates
Entry Point
Determine the audit scope from the operator's input:
loop-{id}-- Single loop audit (AuditScopeType.LOOP): Full six-phase pipeline scoped to one loop run.{repo_id} last {N}-- Repo audit, max 20 slices (AuditScopeType.REPO): Full six-phase pipeline scoped to the repo's recent slices.selforpipeline-- Recursive pipeline audit (AuditScopeType.PIPELINE): Full six-phase pipeline scoped to all components. At Phase 4: if recursion_depth < 2, flag meta-slices for self-audit. At depth 2: skip Phase 4 and Phase 5, report findings only.effectiveness {repo_id}-- Quick lesson effectiveness check (no forge loop): Phase 1 (intake) then straight to Phase 6 (review). No forge, no regression, no meta-slices.standalone-- Standalone repo audit (no DB): Full six-phase pipeline reading from git log, test-results/, and filesystem artifacts instead of the control plane DB. Auto-detected whenpackages/contracts/audit.pyis not importable OR theaudit_runsDB table does not exist.standalone last {N}-- Standalone audit scoped to last N commits (max 50).standalone since {YYYY-MM-DD}-- Standalone audit scoped to commits after the given date.standalone plan {path}-- Standalone audit scoped to commits linked to a specific plan file.
Always follow the phases in order for each scope type -- never reorder or skip phases within a path (Rule 3).
Phase 1: Audit Intake
Input: Operator's scope specification
Output: Audit corpus, confirmed scope, working directory with templates
Parse Scope
| Input Pattern | Scope Type | Scope Ref | Max Slices |
|---|---|---|---|
loop-{id} |
LOOP | The loop_id | 1 |
{repo_id} last {N} |
REPO | The repo_id | min(N, 20) |
self or pipeline |
PIPELINE | "pipeline" | All |
effectiveness {repo_id} |
EFFECTIVENESS | The repo_id | All |
Gather Execution Data
Detect mode: Check for ailee control plane by testing whether packages/contracts/audit.py is importable AND the audit_runs DB table exists. If both true: DB mode. Otherwise: standalone mode. Mode is selected ONCE at Phase 1 intake; all subsequent phases operate on the common AuditCorpus model regardless of backend.
DB mode (unchanged): Query loop_runs, landing_results, lessons, repair_attempts, facts, thread_events as before.
Standalone mode data sources:
git log --since=<scope_start> --format-> execution timeline (one commit = one event)test-results/*.jsonortest-results/*.xml-> verifier outcomes (pytest JSON / JUnit XML)- If no
test-results/directory exists -> run verifier_command from plan, capture output docs/plans/-> ideot plan artifact (for plan-vs-landed comparison).auditor/lessons.json-> lesson registry (created/appended by auditor after each run)docs/audits/YYYY-MM-DD-*.md-> previous audit reports for regression baseline
Commit-to-slice linking: Commits tagged with [slice:X] in the commit message subject link to plan slices by ID.
Standalone scope parsing:
| Input Pattern | Regex | Scope Ref | Time Range |
|---|---|---|---|
standalone |
^standalone$ |
HEAD | Last 20 commits |
standalone last N |
^standalone\s+last\s+(\d+)$ |
HEAD | Last min(N, 50) commits |
standalone since <date> |
^standalone\s+since\s+(\d{4}-\d{2}-\d{2})$ |
HEAD | Commits after date |
standalone plan <path> |
^standalone\s+plan\s+(.+)$ |
Plan file | Commits with Plan: <path> in message |
If input does not match any pattern, abort with: "Unrecognized standalone scope format. Expected: standalone [last N | since YYYY-MM-DD | plan ]"
Build Audit Corpus
Assemble queried data into a structured markdown document:
- Scope metadata (type, ref, time range, query timestamp)
- Slice inventory table (loop_id, status, verifier_status, cleanup_status)
- Failure analysis (per failed slice: blocker_reason, error_summary, classification, repair outcome)
- Lesson inventory table (lesson_id, problem_key, type, confidence, status)
- Recurrence data (per lesson: count of post-promotion slices where same problem_key appeared)
- Timeline reconstruction (chronological event chain)
- Missing data inventory (explicit list of empty/incomplete query results)
Confirm Scope
Present corpus summary via AskUserQuestion:
"Audit corpus built for [scope_type]: [scope_ref]. Time range: [earliest] to [latest]. [N] slices ([M] failed, [K] succeeded). [L] lessons ([P] promoted). [R] repair attempts. Missing data: [list or 'none']. Should I proceed?"
Options: Proceed / Narrow scope / Expand scope
Do not proceed until confirmed.
Setup
- Derive scope_short: LOOP = loop_id first 8 chars, REPO = repo_id, PIPELINE = "pipeline", EFFECTIVENESS = repo_id + "-eff"
- Create working directory:
auditor-{scope_short}-YYYY-MM-DD-HH-MM/ - Copy DASHBOARD verbatim to
dashboard.html - Copy RESULTS verbatim to
results.tsv - Create empty
CHANGELOG.md - Write
audit-report-baseline.mdfrom corpus using AUDIT REPORT TEMPLATE - Copy to
audit-report-optimized.md(working copy) - Start HTTP server:
Tell operator: "Dashboard live at http://localhost:7824/dashboard.html"cd auditor-{scope_short}-YYYY-MM-DD-HH-MM && python3 -m http.server 7824 &
Phase 2: Audit Forge Loop
Input: audit-report-optimized.md (from Phase 1)
Output: Improved audit-report-optimized.md (scored against 7 evaluations)
Skipped for effectiveness scope.
Baseline
- Dispatch evaluator subagent to score initial report against all 7 evaluations
- Record in
results.tsvwithphase: audit,experiment: 0,status: baseline - Announce: "Audit baseline: [score]%"
The Loop
Run autonomously. No interaction. No stopping to ask.
Each experiment:
Analyze -- Which evaluations fail most? Read failing sections. Identify patterns across recent failures.
Brainstorm + Critique -- Dispatch two subagents in parallel. Information isolation is critical:
Brainstormer receives: current artifact + structured failure output (what/where/why) + CHANGELOG.md. Does NOT receive critic output.
"Here is an audit report and its failing evaluations with structured diagnostics: [list]. Here is the history of prior mutations: [changelog]. Propose 3 specific mutations. Each must change exactly one thing."
Critic receives: current artifact ONLY. No criteria, no scores, no changelog, no mutation context.
"Review this audit report as a hostile reviewer. No context about history or scoring. Find every flaw, gap, unstated assumption, missing evidence, broken causal chain. Be ruthless."
Synthesize -- Pick ONE best mutation from combined proposals. Prefer mutations addressing both critic findings AND failing evaluations.
Mutate -- Apply single change to
audit-report-optimized.md. Never touch baseline.Evaluate -- Dispatch evaluator subagent with strict isolation. Receives ONLY artifact + criteria. No mutation info, changelog, scores, or history.
"Score the following audit report against these evaluations. For each: status (true/false). For every 'false': what, where, why. Do NOT suggest fixes.
DOCUMENT: [full audit-report-optimized.md]
EVALUATIONS: [7 stable audit evaluations]"Evaluator returns per evaluation:
- status: boolean
- what: one sentence (required if false)
- where: section or element (required if false)
- why: reason (required if false)
Decide + Log -- ONE atomic step.
Decide:
- Score improved --> KEEP. Reset patience.
- Score unchanged or worse --> DISCARD. Restore from last kept state. Decrement patience.
IMMEDIATELY log:
results.tsv: one row per evaluation for THIS experimentCHANGELOG.md: one entry for THIS experiment (see CHANGELOG format below)
Never batch multiple experiments into one log entry.
Repeat.
Patience
- Default: 10 consecutive no-improvement experiments
- Resets to 10 on every kept experiment
- At 0: stop, proceed to Phase 3
Confirmation at 100%
Do NOT immediately stop at 100%. Run a confirmation evaluation (fresh subagent, same criteria, no mutation). Log as separate experiment with "confirmation run". If confirmation also 100%: done. If below: continue loop, do NOT reset patience.
When patience < 5 (stuck):
- Re-read all failures and full CHANGELOG
- Try opposites of previous failed mutations
- Try combining near-miss mutations
- Try simplification
- Dispatch fresh brainstormer with complete CHANGELOG
Phase 3: Regression Analysis
Input: audit-report-optimized.md (from Phase 2) + historical audit data
Output: Annotated audit-report-optimized.md with regression section
NOT a forge loop -- deterministic comparison. No mutations, no patience.
Query and Compare
Standalone mode: Read docs/audits/ directory for prior reports matching the scope. Parse finding counts (critical/warning/info) from markdown headers. Parse effectiveness scores from the Lesson Effectiveness table. No .auditor/audit-cache/ -- the report IS the record.
DB mode: Query audit_runs for previous records matching same scope_type + scope_ref (last 10 runs).
If previous reports exist:
- Finding count trends -- Compare critical/warning/info counts. Calculate direction: increasing (degrading), decreasing (improving), stable (within +/- 1).
- Lesson effectiveness trends -- Compare graduated scores per lesson across audits. Flag declining (dropped > 0.1), improving (rose > 0.1), consistently low (< 0.3 across 3+ audits).
- Recurring failure patterns -- Same problem_key across multiple audits. Count recurrences. Flag problem_keys recurring despite promoted lessons.
- Trend summary -- Improving (criticals down AND effectiveness up), Degrading (criticals up OR effectiveness down), Stable, or Insufficient data (< 2 prior audits).
Annotate Report
Add "Regression Analysis" section to audit-report-optimized.md:
- Finding count comparison table
- Lesson effectiveness trend table
- Recurring failure patterns
- Overall trend with justification
- If no history: "First audit for this scope -- baseline established."
Phase 4: Meta-Improvement
Input: audit-report-optimized.md (with regression annotations)
Output: Zero or more meta-slice-{N}.json files
Skipped for effectiveness scope or recursion_depth = 2.
Generate Meta-Slices
Review all critical/warning findings. For each systemic issue (affects multiple slices or caused by repeatable pipeline behavior), generate a meta-slice contract:
{
"meta_slice_id": "uuid-v4",
"audit_run_id": "parent audit uuid",
"target": "ideot|ralph|auditor|hailee",
"target_component": "specific file or evaluation",
"hypothesis": "why this change will improve the pipeline",
"proposed_change": "concrete description",
"evidence": ["finding_id_1", "finding_id_2"],
"verifier_command": "how to verify",
"risk_level": "low|medium|high",
"ecosystem_fit_notes": ""
}Requirements: target_component must be a real path (validate with grep/glob). hypothesis must be falsifiable. proposed_change must be implementable without clarifying questions. ecosystem_fit_notes left empty for Phase 5.
Save as meta-slice-{N}.json, numbered from 1. If scope is "self"/"pipeline" and recursion_depth < 2: flag HIGH risk slices for Phase 5 self-audit.
Phase 5: Ecosystem Fit Review
Input: meta-slice-{N}.json files
Output: Updated meta-slices with ecosystem_fit_notes and validated risk_level
Skipped when no meta-slices generated, scope is effectiveness, or recursion_depth = 2.
Validate Each Meta-Slice
Three checks per meta-slice:
- Target existence -- grep/glob for target_component. If not found: mark invalid.
- Operator override conflicts -- Query lessons where lesson_type='operator_override'. Flag conflicts, elevate risk to at least MEDIUM.
- Risk assessment:
| Risk | Criteria | Examples |
|---|---|---|
| LOW | Thresholds, documentation, info-severity | Timeout values, comments, log levels |
| MEDIUM | New evaluations, contract fields, warning-severity | Adding evaluations, schema changes |
| HIGH | Lesson promotion logic, verifier behavior, state transitions, critical-severity | Promotion rules, pass/fail semantics, state machines |
QA Gates
| Risk | Gate |
|---|---|
| LOW | Ecosystem fitness pass |
| MEDIUM | Fitness pass + snapshot regression baseline |
| HIGH | Fitness pass + regression baseline + pre-route self-audit (if recursion_depth < 2) |
The HIGH self-audit examines: component dependencies, unintended side effects, verifier_command sufficiency.
Update each meta-slice-{N}.json with results.
Phase 6: Audit Review
Input: All artifacts from Phases 1-5
Output: Operator decision: accept / deeper audit / dismiss
Present via AskUserQuestion:
- Executive summary -- one paragraph
- Score improvement -- "Baseline [X]% --> Final [Y]% over [N] experiments" (skip for effectiveness scope)
- Top 3 mutations that helped most (skip for effectiveness scope)
- Remaining failures with what/where/why (skip for effectiveness scope)
- Finding count by severity with trend direction
- Lesson effectiveness summary -- table with best/worst highlighted
- Meta-slices generated -- count, targets, risk levels (skip for effectiveness scope)
- Regression trend -- one line
Options:
Accept:
- Save report to
docs/audits/YYYY-MM-DD-{scope_short}.md - Write DB records: audit_runs (audit_run_id, scope_type, scope_ref, created_at, detail_json with finding_counts, effectiveness_summary, recursion_depth, scores, meta_slice_count, regression_trend), audit_findings (finding_id, audit_run_id, severity, description, evidence, slice_id), lesson_effectiveness (lesson_id, audit_run_id, effectiveness_score, base_score, time_decay, opportunity_factor, confidence_weight)
- Auto-route valid meta-slices: LOW = standard notification, MEDIUM = elevated + baseline, HIGH = urgent + baseline + self-audit results
- Announce: "Audit accepted. Report saved. [N] meta-slices routed ([L] low, [M] medium, [H] high)."
Standalone mode persist (no DB writes in standalone mode):
- Save report to
docs/audits/YYYY-MM-DD-{scope_short}.md(same as DB mode) - Append new lessons to
.auditor/lessons.json(create file with{"schema_version":1,"lessons":[]}if it doesn't exist) - Meta-slices saved as
meta-slice-{N}.jsonin working directory (same as DB mode) - No DB writes -- all persistence is filesystem-based
- Save report to
Deeper audit: Re-enter Phase 2 with patience 5 focusing on remaining failures. Skip Phases 3-5 on second pass.
Dismiss: Discard all. No DB records, no routing.
7 Stable Audit Evaluations
EVALUATION 1: Coverage
Question: Does the audit address every completed slice in scope with zero gaps?
Pass: Every loop_id in scope has at least one finding or explicit "no issues found" notation
Fail: Any slice is not mentioned or analyzed
EVALUATION 2: Evidence-Grounded
Question: Does every finding cite specific verifier outputs, lesson IDs, or git refs?
Pass: Every finding contains at least one concrete reference (ID, SHA, or output excerpt)
Fail: Any finding relies on general statements without specific evidence
EVALUATION 3: Causal Accuracy
Question: Does the failure-to-lesson-to-action chain accurately reflect what happened?
Pass: Every causal chain can be verified against the corpus data
Fail: Any chain misattributes causation, reverses temporal ordering, or cites nonexistent records
EVALUATION 4: Regression Sensitivity
Question: Are degradation patterns identified that span multiple slices?
Pass: Multi-slice trends are analyzed and patterns (positive or negative) are called out
Fail: Each slice is analyzed in isolation without cross-slice trend comparison
EVALUATION 5: Lesson Effectiveness
Question: For each promoted lesson, is there evidence it prevented or failed to prevent recurrence?
Pass: Every lesson in scope has a measured effectiveness assessment with the graduated formula
Fail: Any lesson lacks a recurrence check or effectiveness judgment
EVALUATION 6: Actionability
Question: Does every recommendation have a concrete next step?
Pass: Every recommendation specifies either a meta-slice, operator override, or investigation path with enough detail to act on
Fail: Any recommendation is vague ("should improve") without a concrete action
EVALUATION 7: Self-Consistency
Question: Do the audit's own recommendations not contradict each other or existing operator overrides?
Pass: No recommendation conflicts with another recommendation or a known operator decision
Fail: Any pair of recommendations is contradictory, or any recommendation overrides an explicit operator decision without flagging thisGraduated Lesson Effectiveness Formula
effectiveness_score = base_score * time_decay * opportunity_factor * confidence_weightbase_score (0.0 or 1.0): 1.0 if problem_key has NOT recurred after lesson promotion, 0.0 if it has.
time_decay (0.1 - 1.0): max(0.1, 1.0 - (days_since_lesson / 180)). Today = 1.0, 90 days = 0.5, 180+ days = 0.1 floor.
opportunity_factor (0.0 - 1.0): min(1.0, opportunities_for_recurrence / 3). Opportunities = post-promotion slices where the problem_key could have occurred (same repo, same verifier type). 0 opportunities = 0.0, 3+ = 1.0.
confidence_weight (0.0 - 1.0): The lesson's confidence field directly.
| Score | Label | Interpretation |
|---|---|---|
| 0.8 - 1.0 | Highly effective | Working as intended |
| 0.5 - 0.79 | Moderately effective | May need refinement |
| 0.2 - 0.49 | Weakly effective | Not reliably preventing recurrence |
| 0.0 - 0.19 | Ineffective | Review for retirement or revision |
Meta-Slice Risk Tiers and Routing
All tiers auto-route with notification. The auditor + self-heal loop ARE the safety net -- no human gates.
| Risk | Notification | QA Gate |
|---|---|---|
| LOW | Standard | Ecosystem fitness pass |
| MEDIUM | Elevated | Ecosystem fitness pass + regression baseline snapshot |
| HIGH | Urgent | Ecosystem fitness pass + regression baseline + pre-route self-audit |
Recursion Rules
- Max depth: 2 (audit --> audit-of-audit --> stop)
- Scope "self" or "pipeline" triggers recursive self-audit at depth+1
- Depth tracked in
audit_runs.detail_jsonas"recursion_depth": N - At depth 2: skip Phase 4 and Phase 5 -- report findings only
Subagent Information Isolation
| Role | Receives | Does NOT receive |
|---|---|---|
| Evaluator | Audit report + evaluation criteria only | Mutation info, changelog, scores, reasoning, experiment history |
| Brainstormer | Audit report + failing evaluations (what/where/why) + CHANGELOG | Critic output, evaluator reasoning |
| Critic | Audit report ONLY | Evaluation criteria, scores, failures, changelog, mutation context |
Evaluator Response Format
Per evaluation:
evaluation_name: [name]
status: true|false
what: [one sentence] (required if false, omit if true)
where: [section or element] (required if false, omit if true)
why: [reason] (required if false, omit if true)CHANGELOG Entry Format
## Experiment [N] -- [keep/discard] (Phase: Forge)
**Score:** [X]/7 ([percent]%)
**Mutation:** [What was changed]
**Source:** [Brainstormer proposal #N / Critic finding / Both]
**Reasoning:** [Why this was expected to help]
**Result:** [Which evaluations improved/declined/unchanged]
**Failures:**
- [evaluation_name]: what=[what] where=[where] why=[why]Working Directory Naming
auditor-{scope_short}-YYYY-MM-DD-HH-MM/
| Scope Type | scope_short | Example |
|---|---|---|
| LOOP | loop_id first 8 chars | auditor-a1b2c3d4-2026-03-30-14-22/ |
| REPO | repo_id | auditor-my-repo-2026-03-30-14-22/ |
| PIPELINE | "pipeline" | auditor-pipeline-2026-03-30-14-22/ |
| EFFECTIVENESS | repo_id + "-eff" | auditor-my-repo-eff-2026-03-30-14-22/ |
Output Files
auditor-{scope_short}-YYYY-MM-DD-HH-MM/
+-- dashboard.html # Live dashboard (auto-refreshes)
+-- results.tsv # Score log (one row per evaluation per experiment)
+-- CHANGELOG.md # Detailed mutation log
+-- audit-report-baseline.md # Initial report (never modified)
+-- audit-report-optimized.md # Working copy during Phase 2
+-- meta-slice-1.json # Meta-slice contracts (if generated)
+-- meta-slice-2.json
+-- ...Final accepted report is saved to docs/audits/YYYY-MM-DD-{scope_short}.md.