auditor

"QA auditor that examines completed pipeline executions, detects regressions, scores lesson effectiveness with a graduated model, and generates risk-tiered meta-improvement proposals through a forge loop."

anthonyjlee 0 Updated 2mo ago

Resources

GitHub

Install

npx skillscat add anthonyjlee/auditor

Install via the SkillsCat registry.

SKILL.md

Announce at start: "I'll run a QA audit on [scope]. Let me gather the execution data and build an audit corpus."

Auditor

Examine completed pipeline executions, detect regressions, score lesson effectiveness with a graduated model, and generate risk-tiered meta-improvement proposals through a forge loop.

Goal

Produce three artifact classes:

A comprehensive, evidence-grounded audit report covering all slices in scope
A lesson effectiveness assessment with graduated scoring for every promoted lesson
Zero or more meta-slice contracts proposing concrete pipeline improvements, risk-tiered and ecosystem-validated

All artifacts are stress-tested through an autonomous scored loop before operator review.

Non-Negotiable Rules

Use AskUserQuestion for ALL operator interaction -- never ask in plain text
The operator confirms the audit scope interactively (Phase 1) before any autonomous forging begins (Phase 2)
Complete each phase in order: Audit Intake --> Audit Forge --> Regression Analysis --> Meta-Improvement --> Ecosystem Fit Review --> Audit Review
Every experiment is scored and recorded -- never skip logging
Use subagents for evaluation, brainstorming, and criticism -- never rely on main agent alone
Never touch the baseline -- always work on the optimized copy
One mutation per experiment -- never change multiple things at once
No fixed loop limits -- run until patience is exhausted
Surface uncertainty honestly -- missing data becomes explicit assumptions, never hidden behind confident prose
Self-audit recursion depth is capped at 2
All meta-slice risk tiers auto-route with notification -- no human gates

Entry Point

Determine the audit scope from the operator's input:

loop-{id} -- Single loop audit (AuditScopeType.LOOP): Full six-phase pipeline scoped to one loop run.
{repo_id} last {N} -- Repo audit, max 20 slices (AuditScopeType.REPO): Full six-phase pipeline scoped to the repo's recent slices.
self or pipeline -- Recursive pipeline audit (AuditScopeType.PIPELINE): Full six-phase pipeline scoped to all components. At Phase 4: if recursion_depth < 2, flag meta-slices for self-audit. At depth 2: skip Phase 4 and Phase 5, report findings only.
effectiveness {repo_id} -- Quick lesson effectiveness check (no forge loop): Phase 1 (intake) then straight to Phase 6 (review). No forge, no regression, no meta-slices.
standalone -- Standalone repo audit (no DB): Full six-phase pipeline reading from git log, test-results/, and filesystem artifacts instead of the control plane DB. Auto-detected when packages/contracts/audit.py is not importable OR the audit_runs DB table does not exist.
standalone last {N} -- Standalone audit scoped to last N commits (max 50).
standalone since {YYYY-MM-DD} -- Standalone audit scoped to commits after the given date.
standalone plan {path} -- Standalone audit scoped to commits linked to a specific plan file.

Always follow the phases in order for each scope type -- never reorder or skip phases within a path (Rule 3).

Phase 1: Audit Intake

Input: Operator's scope specification
Output: Audit corpus, confirmed scope, working directory with templates

Parse Scope

Input Pattern	Scope Type	Scope Ref	Max Slices
`loop-{id}`	LOOP	The loop_id	1
`{repo_id} last {N}`	REPO	The repo_id	min(N, 20)
`self` or `pipeline`	PIPELINE	"pipeline"	All
`effectiveness {repo_id}`	EFFECTIVENESS	The repo_id	All

Gather Execution Data

Detect mode: Check for ailee control plane by testing whether packages/contracts/audit.py is importable AND the audit_runs DB table exists. If both true: DB mode. Otherwise: standalone mode. Mode is selected ONCE at Phase 1 intake; all subsequent phases operate on the common AuditCorpus model regardless of backend.

DB mode (unchanged): Query loop_runs, landing_results, lessons, repair_attempts, facts, thread_events as before.

Standalone mode data sources:

git log --since=<scope_start> --format -> execution timeline (one commit = one event)
test-results/*.json or test-results/*.xml -> verifier outcomes (pytest JSON / JUnit XML)
If no test-results/ directory exists -> run verifier_command from plan, capture output
docs/plans/ -> ideot plan artifact (for plan-vs-landed comparison)
.auditor/lessons.json -> lesson registry (created/appended by auditor after each run)
docs/audits/YYYY-MM-DD-*.md -> previous audit reports for regression baseline

Commit-to-slice linking: Commits tagged with [slice:X] in the commit message subject link to plan slices by ID.

Standalone scope parsing:

Input Pattern	Regex	Scope Ref	Time Range
`standalone`	`^standalone$`	HEAD	Last 20 commits
`standalone last N`	`^standalone\s+last\s+(\d+)$`	HEAD	Last min(N, 50) commits
`standalone since <date>`	`^standalone\s+since\s+(\d{4}-\d{2}-\d{2})$`	HEAD	Commits after date
`standalone plan <path>`	`^standalone\s+plan\s+(.+)$`	Plan file	Commits with `Plan: <path>` in message

If input does not match any pattern, abort with: "Unrecognized standalone scope format. Expected: standalone [last N | since YYYY-MM-DD | plan ]"

Build Audit Corpus

Assemble queried data into a structured markdown document:

Scope metadata (type, ref, time range, query timestamp)
Slice inventory table (loop_id, status, verifier_status, cleanup_status)
Failure analysis (per failed slice: blocker_reason, error_summary, classification, repair outcome)
Lesson inventory table (lesson_id, problem_key, type, confidence, status)
Recurrence data (per lesson: count of post-promotion slices where same problem_key appeared)
Timeline reconstruction (chronological event chain)
Missing data inventory (explicit list of empty/incomplete query results)

Confirm Scope

Present corpus summary via AskUserQuestion:

"Audit corpus built for [scope_type]: [scope_ref]. Time range: [earliest] to [latest]. [N] slices ([M] failed, [K] succeeded). [L] lessons ([P] promoted). [R] repair attempts. Missing data: [list or 'none']. Should I proceed?"

Options: Proceed / Narrow scope / Expand scope

Do not proceed until confirmed.

Setup

Derive scope_short: LOOP = loop_id first 8 chars, REPO = repo_id, PIPELINE = "pipeline", EFFECTIVENESS = repo_id + "-eff"
Create working directory: auditor-{scope_short}-YYYY-MM-DD-HH-MM/
Copy DASHBOARD verbatim to dashboard.html
Copy RESULTS verbatim to results.tsv
Create empty CHANGELOG.md
Write audit-report-baseline.md from corpus using AUDIT REPORT TEMPLATE
Copy to audit-report-optimized.md (working copy)
Start HTTP server:
```
cd auditor-{scope_short}-YYYY-MM-DD-HH-MM && python3 -m http.server 7824 &
```
Tell operator: "Dashboard live at http://localhost:7824/dashboard.html"

Phase 2: Audit Forge Loop

Input: audit-report-optimized.md (from Phase 1)
Output: Improved audit-report-optimized.md (scored against 7 evaluations)

Skipped for effectiveness scope.

Baseline

Dispatch evaluator subagent to score initial report against all 7 evaluations
Record in results.tsv with phase: audit, experiment: 0, status: baseline
Announce: "Audit baseline: [score]%"

The Loop

Run autonomously. No interaction. No stopping to ask.

Each experiment:

Analyze -- Which evaluations fail most? Read failing sections. Identify patterns across recent failures.
Brainstorm + Critique -- Dispatch two subagents in parallel. Information isolation is critical:
- Brainstormer receives: current artifact + structured failure output (what/where/why) + CHANGELOG.md. Does NOT receive critic output.
  
  "Here is an audit report and its failing evaluations with structured diagnostics: [list]. Here is the history of prior mutations: [changelog]. Propose 3 specific mutations. Each must change exactly one thing."
- Critic receives: current artifact ONLY. No criteria, no scores, no changelog, no mutation context.
  
  "Review this audit report as a hostile reviewer. No context about history or scoring. Find every flaw, gap, unstated assumption, missing evidence, broken causal chain. Be ruthless."
Synthesize -- Pick ONE best mutation from combined proposals. Prefer mutations addressing both critic findings AND failing evaluations.
Mutate -- Apply single change to audit-report-optimized.md. Never touch baseline.
Evaluate -- Dispatch evaluator subagent with strict isolation. Receives ONLY artifact + criteria. No mutation info, changelog, scores, or history.

"Score the following audit report against these evaluations. For each: status (true/false). For every 'false': what, where, why. Do NOT suggest fixes.

DOCUMENT: [full audit-report-optimized.md]
EVALUATIONS: [7 stable audit evaluations]"

Evaluator returns per evaluation:
- status: boolean
- what: one sentence (required if false)
- where: section or element (required if false)
- why: reason (required if false)
Decide + Log -- ONE atomic step.

Decide:
- Score improved --> KEEP. Reset patience.
- Score unchanged or worse --> DISCARD. Restore from last kept state. Decrement patience.
IMMEDIATELY log:
- results.tsv: one row per evaluation for THIS experiment
- CHANGELOG.md: one entry for THIS experiment (see CHANGELOG format below)
Never batch multiple experiments into one log entry.
Repeat.

Patience

Default: 10 consecutive no-improvement experiments
Resets to 10 on every kept experiment
At 0: stop, proceed to Phase 3

Confirmation at 100%

Do NOT immediately stop at 100%. Run a confirmation evaluation (fresh subagent, same criteria, no mutation). Log as separate experiment with "confirmation run". If confirmation also 100%: done. If below: continue loop, do NOT reset patience.

When patience < 5 (stuck):

Re-read all failures and full CHANGELOG
Try opposites of previous failed mutations
Try combining near-miss mutations
Try simplification
Dispatch fresh brainstormer with complete CHANGELOG

Phase 3: Regression Analysis

Input: audit-report-optimized.md (from Phase 2) + historical audit data
Output: Annotated audit-report-optimized.md with regression section

NOT a forge loop -- deterministic comparison. No mutations, no patience.

Query and Compare

Standalone mode: Read docs/audits/ directory for prior reports matching the scope. Parse finding counts (critical/warning/info) from markdown headers. Parse effectiveness scores from the Lesson Effectiveness table. No .auditor/audit-cache/ -- the report IS the record.

DB mode: Query audit_runs for previous records matching same scope_type + scope_ref (last 10 runs).

If previous reports exist:

Finding count trends -- Compare critical/warning/info counts. Calculate direction: increasing (degrading), decreasing (improving), stable (within +/- 1).
Lesson effectiveness trends -- Compare graduated scores per lesson across audits. Flag declining (dropped > 0.1), improving (rose > 0.1), consistently low (< 0.3 across 3+ audits).
Recurring failure patterns -- Same problem_key across multiple audits. Count recurrences. Flag problem_keys recurring despite promoted lessons.
Trend summary -- Improving (criticals down AND effectiveness up), Degrading (criticals up OR effectiveness down), Stable, or Insufficient data (< 2 prior audits).

Annotate Report

Add "Regression Analysis" section to audit-report-optimized.md:

Finding count comparison table
Lesson effectiveness trend table
Recurring failure patterns
Overall trend with justification
If no history: "First audit for this scope -- baseline established."

Phase 4: Meta-Improvement

Input: audit-report-optimized.md (with regression annotations)
Output: Zero or more meta-slice-{N}.json files

Skipped for effectiveness scope or recursion_depth = 2.

Generate Meta-Slices

Review all critical/warning findings. For each systemic issue (affects multiple slices or caused by repeatable pipeline behavior), generate a meta-slice contract:

{
  "meta_slice_id": "uuid-v4",
  "audit_run_id": "parent audit uuid",
  "target": "ideot|ralph|auditor|hailee",
  "target_component": "specific file or evaluation",
  "hypothesis": "why this change will improve the pipeline",
  "proposed_change": "concrete description",
  "evidence": ["finding_id_1", "finding_id_2"],
  "verifier_command": "how to verify",
  "risk_level": "low|medium|high",
  "ecosystem_fit_notes": ""
}

Requirements: target_component must be a real path (validate with grep/glob). hypothesis must be falsifiable. proposed_change must be implementable without clarifying questions. ecosystem_fit_notes left empty for Phase 5.

Save as meta-slice-{N}.json, numbered from 1. If scope is "self"/"pipeline" and recursion_depth < 2: flag HIGH risk slices for Phase 5 self-audit.

Phase 5: Ecosystem Fit Review

Input: meta-slice-{N}.json files
Output: Updated meta-slices with ecosystem_fit_notes and validated risk_level

Skipped when no meta-slices generated, scope is effectiveness, or recursion_depth = 2.

Validate Each Meta-Slice

Three checks per meta-slice:

Target existence -- grep/glob for target_component. If not found: mark invalid.
Operator override conflicts -- Query lessons where lesson_type='operator_override'. Flag conflicts, elevate risk to at least MEDIUM.
Risk assessment:

Risk	Criteria	Examples
LOW	Thresholds, documentation, info-severity	Timeout values, comments, log levels
MEDIUM	New evaluations, contract fields, warning-severity	Adding evaluations, schema changes
HIGH	Lesson promotion logic, verifier behavior, state transitions, critical-severity	Promotion rules, pass/fail semantics, state machines

QA Gates

Risk	Gate
LOW	Ecosystem fitness pass
MEDIUM	Fitness pass + snapshot regression baseline
HIGH	Fitness pass + regression baseline + pre-route self-audit (if recursion_depth < 2)

The HIGH self-audit examines: component dependencies, unintended side effects, verifier_command sufficiency.

Update each meta-slice-{N}.json with results.

Phase 6: Audit Review

Input: All artifacts from Phases 1-5
Output: Operator decision: accept / deeper audit / dismiss

Present via AskUserQuestion:

Executive summary -- one paragraph
Score improvement -- "Baseline [X]% --> Final [Y]% over [N] experiments" (skip for effectiveness scope)
Top 3 mutations that helped most (skip for effectiveness scope)
Remaining failures with what/where/why (skip for effectiveness scope)
Finding count by severity with trend direction
Lesson effectiveness summary -- table with best/worst highlighted
Meta-slices generated -- count, targets, risk levels (skip for effectiveness scope)
Regression trend -- one line

Options:

Accept:
1. Save report to docs/audits/YYYY-MM-DD-{scope_short}.md
2. Write DB records: audit_runs (audit_run_id, scope_type, scope_ref, created_at, detail_json with finding_counts, effectiveness_summary, recursion_depth, scores, meta_slice_count, regression_trend), audit_findings (finding_id, audit_run_id, severity, description, evidence, slice_id), lesson_effectiveness (lesson_id, audit_run_id, effectiveness_score, base_score, time_decay, opportunity_factor, confidence_weight)
3. Auto-route valid meta-slices: LOW = standard notification, MEDIUM = elevated + baseline, HIGH = urgent + baseline + self-audit results
4. Announce: "Audit accepted. Report saved. [N] meta-slices routed ([L] low, [M] medium, [H] high)."
Standalone mode persist (no DB writes in standalone mode):
1. Save report to docs/audits/YYYY-MM-DD-{scope_short}.md (same as DB mode)
2. Append new lessons to .auditor/lessons.json (create file with {"schema_version":1,"lessons":[]} if it doesn't exist)
3. Meta-slices saved as meta-slice-{N}.json in working directory (same as DB mode)
4. No DB writes -- all persistence is filesystem-based
Deeper audit: Re-enter Phase 2 with patience 5 focusing on remaining failures. Skip Phases 3-5 on second pass.
Dismiss: Discard all. No DB records, no routing.

7 Stable Audit Evaluations

EVALUATION 1: Coverage
Question: Does the audit address every completed slice in scope with zero gaps?
Pass: Every loop_id in scope has at least one finding or explicit "no issues found" notation
Fail: Any slice is not mentioned or analyzed

EVALUATION 2: Evidence-Grounded
Question: Does every finding cite specific verifier outputs, lesson IDs, or git refs?
Pass: Every finding contains at least one concrete reference (ID, SHA, or output excerpt)
Fail: Any finding relies on general statements without specific evidence

EVALUATION 3: Causal Accuracy
Question: Does the failure-to-lesson-to-action chain accurately reflect what happened?
Pass: Every causal chain can be verified against the corpus data
Fail: Any chain misattributes causation, reverses temporal ordering, or cites nonexistent records

EVALUATION 4: Regression Sensitivity
Question: Are degradation patterns identified that span multiple slices?
Pass: Multi-slice trends are analyzed and patterns (positive or negative) are called out
Fail: Each slice is analyzed in isolation without cross-slice trend comparison

EVALUATION 5: Lesson Effectiveness
Question: For each promoted lesson, is there evidence it prevented or failed to prevent recurrence?
Pass: Every lesson in scope has a measured effectiveness assessment with the graduated formula
Fail: Any lesson lacks a recurrence check or effectiveness judgment

EVALUATION 6: Actionability
Question: Does every recommendation have a concrete next step?
Pass: Every recommendation specifies either a meta-slice, operator override, or investigation path with enough detail to act on
Fail: Any recommendation is vague ("should improve") without a concrete action

EVALUATION 7: Self-Consistency
Question: Do the audit's own recommendations not contradict each other or existing operator overrides?
Pass: No recommendation conflicts with another recommendation or a known operator decision
Fail: Any pair of recommendations is contradictory, or any recommendation overrides an explicit operator decision without flagging this

Graduated Lesson Effectiveness Formula

effectiveness_score = base_score * time_decay * opportunity_factor * confidence_weight

base_score (0.0 or 1.0): 1.0 if problem_key has NOT recurred after lesson promotion, 0.0 if it has.

time_decay (0.1 - 1.0): max(0.1, 1.0 - (days_since_lesson / 180)). Today = 1.0, 90 days = 0.5, 180+ days = 0.1 floor.

opportunity_factor (0.0 - 1.0): min(1.0, opportunities_for_recurrence / 3). Opportunities = post-promotion slices where the problem_key could have occurred (same repo, same verifier type). 0 opportunities = 0.0, 3+ = 1.0.

confidence_weight (0.0 - 1.0): The lesson's confidence field directly.

Score	Label	Interpretation
0.8 - 1.0	Highly effective	Working as intended
0.5 - 0.79	Moderately effective	May need refinement
0.2 - 0.49	Weakly effective	Not reliably preventing recurrence
0.0 - 0.19	Ineffective	Review for retirement or revision

Meta-Slice Risk Tiers and Routing

All tiers auto-route with notification. The auditor + self-heal loop ARE the safety net -- no human gates.

Risk	Notification	QA Gate
LOW	Standard	Ecosystem fitness pass
MEDIUM	Elevated	Ecosystem fitness pass + regression baseline snapshot
HIGH	Urgent	Ecosystem fitness pass + regression baseline + pre-route self-audit

Recursion Rules

Max depth: 2 (audit --> audit-of-audit --> stop)
Scope "self" or "pipeline" triggers recursive self-audit at depth+1
Depth tracked in audit_runs.detail_json as "recursion_depth": N
At depth 2: skip Phase 4 and Phase 5 -- report findings only

Subagent Information Isolation

Role	Receives	Does NOT receive
Evaluator	Audit report + evaluation criteria only	Mutation info, changelog, scores, reasoning, experiment history
Brainstormer	Audit report + failing evaluations (what/where/why) + CHANGELOG	Critic output, evaluator reasoning
Critic	Audit report ONLY	Evaluation criteria, scores, failures, changelog, mutation context

Evaluator Response Format

Per evaluation:

evaluation_name: [name]
status: true|false
what: [one sentence] (required if false, omit if true)
where: [section or element] (required if false, omit if true)
why: [reason] (required if false, omit if true)

CHANGELOG Entry Format

## Experiment [N] -- [keep/discard] (Phase: Forge)
**Score:** [X]/7 ([percent]%)
**Mutation:** [What was changed]
**Source:** [Brainstormer proposal #N / Critic finding / Both]
**Reasoning:** [Why this was expected to help]
**Result:** [Which evaluations improved/declined/unchanged]
**Failures:**
- [evaluation_name]: what=[what] where=[where] why=[why]

Working Directory Naming

auditor-{scope_short}-YYYY-MM-DD-HH-MM/

Scope Type	scope_short	Example
LOOP	loop_id first 8 chars	`auditor-a1b2c3d4-2026-03-30-14-22/`
REPO	repo_id	`auditor-my-repo-2026-03-30-14-22/`
PIPELINE	"pipeline"	`auditor-pipeline-2026-03-30-14-22/`
EFFECTIVENESS	repo_id + "-eff"	`auditor-my-repo-eff-2026-03-30-14-22/`

Output Files

auditor-{scope_short}-YYYY-MM-DD-HH-MM/
+-- dashboard.html              # Live dashboard (auto-refreshes)
+-- results.tsv                 # Score log (one row per evaluation per experiment)
+-- CHANGELOG.md                # Detailed mutation log
+-- audit-report-baseline.md    # Initial report (never modified)
+-- audit-report-optimized.md   # Working copy during Phase 2
+-- meta-slice-1.json           # Meta-slice contracts (if generated)
+-- meta-slice-2.json
+-- ...

Final accepted report is saved to docs/audits/YYYY-MM-DD-{scope_short}.md.

auditor

Resources

Install

Auditor

Goal

Non-Negotiable Rules

Entry Point

Phase 1: Audit Intake

Parse Scope

Gather Execution Data

Build Audit Corpus

Confirm Scope

Setup

Phase 2: Audit Forge Loop

Baseline

The Loop

Patience

Confirmation at 100%

Phase 3: Regression Analysis

Query and Compare

Annotate Report

Phase 4: Meta-Improvement

Generate Meta-Slices

Phase 5: Ecosystem Fit Review

Validate Each Meta-Slice

QA Gates

Phase 6: Audit Review

7 Stable Audit Evaluations

Graduated Lesson Effectiveness Formula

Meta-Slice Risk Tiers and Routing

Recursion Rules

Subagent Information Isolation

Evaluator Response Format

CHANGELOG Entry Format

Working Directory Naming

Output Files

Categories

Install

Recommended Skills