anthonyjlee

auditor

"QA auditor that examines completed pipeline executions, detects regressions, scores lesson effectiveness with a graduated model, and generates risk-tiered meta-improvement proposals through a forge loop."

anthonyjlee 0 Updated 2mo ago

Resources

13
GitHub

Install

npx skillscat add anthonyjlee/auditor

Install via the SkillsCat registry.

SKILL.md

Announce at start: "I'll run a QA audit on [scope]. Let me gather the execution data and build an audit corpus."

Auditor

Examine completed pipeline executions, detect regressions, score lesson effectiveness with a graduated model, and generate risk-tiered meta-improvement proposals through a forge loop.

Goal

Produce three artifact classes:

  1. A comprehensive, evidence-grounded audit report covering all slices in scope
  2. A lesson effectiveness assessment with graduated scoring for every promoted lesson
  3. Zero or more meta-slice contracts proposing concrete pipeline improvements, risk-tiered and ecosystem-validated

All artifacts are stress-tested through an autonomous scored loop before operator review.

Non-Negotiable Rules

  1. Use AskUserQuestion for ALL operator interaction -- never ask in plain text
  2. The operator confirms the audit scope interactively (Phase 1) before any autonomous forging begins (Phase 2)
  3. Complete each phase in order: Audit Intake --> Audit Forge --> Regression Analysis --> Meta-Improvement --> Ecosystem Fit Review --> Audit Review
  4. Every experiment is scored and recorded -- never skip logging
  5. Use subagents for evaluation, brainstorming, and criticism -- never rely on main agent alone
  6. Never touch the baseline -- always work on the optimized copy
  7. One mutation per experiment -- never change multiple things at once
  8. No fixed loop limits -- run until patience is exhausted
  9. Surface uncertainty honestly -- missing data becomes explicit assumptions, never hidden behind confident prose
  10. Self-audit recursion depth is capped at 2
  11. All meta-slice risk tiers auto-route with notification -- no human gates

Entry Point

Determine the audit scope from the operator's input:

  • loop-{id} -- Single loop audit (AuditScopeType.LOOP): Full six-phase pipeline scoped to one loop run.

  • {repo_id} last {N} -- Repo audit, max 20 slices (AuditScopeType.REPO): Full six-phase pipeline scoped to the repo's recent slices.

  • self or pipeline -- Recursive pipeline audit (AuditScopeType.PIPELINE): Full six-phase pipeline scoped to all components. At Phase 4: if recursion_depth < 2, flag meta-slices for self-audit. At depth 2: skip Phase 4 and Phase 5, report findings only.

  • effectiveness {repo_id} -- Quick lesson effectiveness check (no forge loop): Phase 1 (intake) then straight to Phase 6 (review). No forge, no regression, no meta-slices.

  • standalone -- Standalone repo audit (no DB): Full six-phase pipeline reading from git log, test-results/, and filesystem artifacts instead of the control plane DB. Auto-detected when packages/contracts/audit.py is not importable OR the audit_runs DB table does not exist.

  • standalone last {N} -- Standalone audit scoped to last N commits (max 50).

  • standalone since {YYYY-MM-DD} -- Standalone audit scoped to commits after the given date.

  • standalone plan {path} -- Standalone audit scoped to commits linked to a specific plan file.

Always follow the phases in order for each scope type -- never reorder or skip phases within a path (Rule 3).


Phase 1: Audit Intake

Input: Operator's scope specification
Output: Audit corpus, confirmed scope, working directory with templates

Parse Scope

Input Pattern Scope Type Scope Ref Max Slices
loop-{id} LOOP The loop_id 1
{repo_id} last {N} REPO The repo_id min(N, 20)
self or pipeline PIPELINE "pipeline" All
effectiveness {repo_id} EFFECTIVENESS The repo_id All

Gather Execution Data

Detect mode: Check for ailee control plane by testing whether packages/contracts/audit.py is importable AND the audit_runs DB table exists. If both true: DB mode. Otherwise: standalone mode. Mode is selected ONCE at Phase 1 intake; all subsequent phases operate on the common AuditCorpus model regardless of backend.

DB mode (unchanged): Query loop_runs, landing_results, lessons, repair_attempts, facts, thread_events as before.

Standalone mode data sources:

  • git log --since=<scope_start> --format -> execution timeline (one commit = one event)
  • test-results/*.json or test-results/*.xml -> verifier outcomes (pytest JSON / JUnit XML)
  • If no test-results/ directory exists -> run verifier_command from plan, capture output
  • docs/plans/ -> ideot plan artifact (for plan-vs-landed comparison)
  • .auditor/lessons.json -> lesson registry (created/appended by auditor after each run)
  • docs/audits/YYYY-MM-DD-*.md -> previous audit reports for regression baseline

Commit-to-slice linking: Commits tagged with [slice:X] in the commit message subject link to plan slices by ID.

Standalone scope parsing:

Input Pattern Regex Scope Ref Time Range
standalone ^standalone$ HEAD Last 20 commits
standalone last N ^standalone\s+last\s+(\d+)$ HEAD Last min(N, 50) commits
standalone since <date> ^standalone\s+since\s+(\d{4}-\d{2}-\d{2})$ HEAD Commits after date
standalone plan <path> ^standalone\s+plan\s+(.+)$ Plan file Commits with Plan: <path> in message

If input does not match any pattern, abort with: "Unrecognized standalone scope format. Expected: standalone [last N | since YYYY-MM-DD | plan ]"

Build Audit Corpus

Assemble queried data into a structured markdown document:

  1. Scope metadata (type, ref, time range, query timestamp)
  2. Slice inventory table (loop_id, status, verifier_status, cleanup_status)
  3. Failure analysis (per failed slice: blocker_reason, error_summary, classification, repair outcome)
  4. Lesson inventory table (lesson_id, problem_key, type, confidence, status)
  5. Recurrence data (per lesson: count of post-promotion slices where same problem_key appeared)
  6. Timeline reconstruction (chronological event chain)
  7. Missing data inventory (explicit list of empty/incomplete query results)

Confirm Scope

Present corpus summary via AskUserQuestion:

"Audit corpus built for [scope_type]: [scope_ref]. Time range: [earliest] to [latest]. [N] slices ([M] failed, [K] succeeded). [L] lessons ([P] promoted). [R] repair attempts. Missing data: [list or 'none']. Should I proceed?"

Options: Proceed / Narrow scope / Expand scope

Do not proceed until confirmed.

Setup

  1. Derive scope_short: LOOP = loop_id first 8 chars, REPO = repo_id, PIPELINE = "pipeline", EFFECTIVENESS = repo_id + "-eff"
  2. Create working directory: auditor-{scope_short}-YYYY-MM-DD-HH-MM/
  3. Copy DASHBOARD verbatim to dashboard.html
  4. Copy RESULTS verbatim to results.tsv
  5. Create empty CHANGELOG.md
  6. Write audit-report-baseline.md from corpus using AUDIT REPORT TEMPLATE
  7. Copy to audit-report-optimized.md (working copy)
  8. Start HTTP server:
    cd auditor-{scope_short}-YYYY-MM-DD-HH-MM && python3 -m http.server 7824 &
    Tell operator: "Dashboard live at http://localhost:7824/dashboard.html"

Phase 2: Audit Forge Loop

Input: audit-report-optimized.md (from Phase 1)
Output: Improved audit-report-optimized.md (scored against 7 evaluations)

Skipped for effectiveness scope.

Baseline

  1. Dispatch evaluator subagent to score initial report against all 7 evaluations
  2. Record in results.tsv with phase: audit, experiment: 0, status: baseline
  3. Announce: "Audit baseline: [score]%"

The Loop

Run autonomously. No interaction. No stopping to ask.

Each experiment:

  1. Analyze -- Which evaluations fail most? Read failing sections. Identify patterns across recent failures.

  2. Brainstorm + Critique -- Dispatch two subagents in parallel. Information isolation is critical:

    • Brainstormer receives: current artifact + structured failure output (what/where/why) + CHANGELOG.md. Does NOT receive critic output.

      "Here is an audit report and its failing evaluations with structured diagnostics: [list]. Here is the history of prior mutations: [changelog]. Propose 3 specific mutations. Each must change exactly one thing."

    • Critic receives: current artifact ONLY. No criteria, no scores, no changelog, no mutation context.

      "Review this audit report as a hostile reviewer. No context about history or scoring. Find every flaw, gap, unstated assumption, missing evidence, broken causal chain. Be ruthless."

  3. Synthesize -- Pick ONE best mutation from combined proposals. Prefer mutations addressing both critic findings AND failing evaluations.

  4. Mutate -- Apply single change to audit-report-optimized.md. Never touch baseline.

  5. Evaluate -- Dispatch evaluator subagent with strict isolation. Receives ONLY artifact + criteria. No mutation info, changelog, scores, or history.

    "Score the following audit report against these evaluations. For each: status (true/false). For every 'false': what, where, why. Do NOT suggest fixes.

    DOCUMENT: [full audit-report-optimized.md]
    EVALUATIONS: [7 stable audit evaluations]"

    Evaluator returns per evaluation:

    • status: boolean
    • what: one sentence (required if false)
    • where: section or element (required if false)
    • why: reason (required if false)
  6. Decide + Log -- ONE atomic step.

    Decide:

    • Score improved --> KEEP. Reset patience.
    • Score unchanged or worse --> DISCARD. Restore from last kept state. Decrement patience.

    IMMEDIATELY log:

    • results.tsv: one row per evaluation for THIS experiment
    • CHANGELOG.md: one entry for THIS experiment (see CHANGELOG format below)

    Never batch multiple experiments into one log entry.

  7. Repeat.

Patience

  • Default: 10 consecutive no-improvement experiments
  • Resets to 10 on every kept experiment
  • At 0: stop, proceed to Phase 3

Confirmation at 100%

Do NOT immediately stop at 100%. Run a confirmation evaluation (fresh subagent, same criteria, no mutation). Log as separate experiment with "confirmation run". If confirmation also 100%: done. If below: continue loop, do NOT reset patience.

When patience < 5 (stuck):

  • Re-read all failures and full CHANGELOG
  • Try opposites of previous failed mutations
  • Try combining near-miss mutations
  • Try simplification
  • Dispatch fresh brainstormer with complete CHANGELOG

Phase 3: Regression Analysis

Input: audit-report-optimized.md (from Phase 2) + historical audit data
Output: Annotated audit-report-optimized.md with regression section

NOT a forge loop -- deterministic comparison. No mutations, no patience.

Query and Compare

Standalone mode: Read docs/audits/ directory for prior reports matching the scope. Parse finding counts (critical/warning/info) from markdown headers. Parse effectiveness scores from the Lesson Effectiveness table. No .auditor/audit-cache/ -- the report IS the record.

DB mode: Query audit_runs for previous records matching same scope_type + scope_ref (last 10 runs).

If previous reports exist:

  1. Finding count trends -- Compare critical/warning/info counts. Calculate direction: increasing (degrading), decreasing (improving), stable (within +/- 1).
  2. Lesson effectiveness trends -- Compare graduated scores per lesson across audits. Flag declining (dropped > 0.1), improving (rose > 0.1), consistently low (< 0.3 across 3+ audits).
  3. Recurring failure patterns -- Same problem_key across multiple audits. Count recurrences. Flag problem_keys recurring despite promoted lessons.
  4. Trend summary -- Improving (criticals down AND effectiveness up), Degrading (criticals up OR effectiveness down), Stable, or Insufficient data (< 2 prior audits).

Annotate Report

Add "Regression Analysis" section to audit-report-optimized.md:

  • Finding count comparison table
  • Lesson effectiveness trend table
  • Recurring failure patterns
  • Overall trend with justification
  • If no history: "First audit for this scope -- baseline established."

Phase 4: Meta-Improvement

Input: audit-report-optimized.md (with regression annotations)
Output: Zero or more meta-slice-{N}.json files

Skipped for effectiveness scope or recursion_depth = 2.

Generate Meta-Slices

Review all critical/warning findings. For each systemic issue (affects multiple slices or caused by repeatable pipeline behavior), generate a meta-slice contract:

{
  "meta_slice_id": "uuid-v4",
  "audit_run_id": "parent audit uuid",
  "target": "ideot|ralph|auditor|hailee",
  "target_component": "specific file or evaluation",
  "hypothesis": "why this change will improve the pipeline",
  "proposed_change": "concrete description",
  "evidence": ["finding_id_1", "finding_id_2"],
  "verifier_command": "how to verify",
  "risk_level": "low|medium|high",
  "ecosystem_fit_notes": ""
}

Requirements: target_component must be a real path (validate with grep/glob). hypothesis must be falsifiable. proposed_change must be implementable without clarifying questions. ecosystem_fit_notes left empty for Phase 5.

Save as meta-slice-{N}.json, numbered from 1. If scope is "self"/"pipeline" and recursion_depth < 2: flag HIGH risk slices for Phase 5 self-audit.


Phase 5: Ecosystem Fit Review

Input: meta-slice-{N}.json files
Output: Updated meta-slices with ecosystem_fit_notes and validated risk_level

Skipped when no meta-slices generated, scope is effectiveness, or recursion_depth = 2.

Validate Each Meta-Slice

Three checks per meta-slice:

  1. Target existence -- grep/glob for target_component. If not found: mark invalid.
  2. Operator override conflicts -- Query lessons where lesson_type='operator_override'. Flag conflicts, elevate risk to at least MEDIUM.
  3. Risk assessment:
Risk Criteria Examples
LOW Thresholds, documentation, info-severity Timeout values, comments, log levels
MEDIUM New evaluations, contract fields, warning-severity Adding evaluations, schema changes
HIGH Lesson promotion logic, verifier behavior, state transitions, critical-severity Promotion rules, pass/fail semantics, state machines

QA Gates

Risk Gate
LOW Ecosystem fitness pass
MEDIUM Fitness pass + snapshot regression baseline
HIGH Fitness pass + regression baseline + pre-route self-audit (if recursion_depth < 2)

The HIGH self-audit examines: component dependencies, unintended side effects, verifier_command sufficiency.

Update each meta-slice-{N}.json with results.


Phase 6: Audit Review

Input: All artifacts from Phases 1-5
Output: Operator decision: accept / deeper audit / dismiss

Present via AskUserQuestion:

  1. Executive summary -- one paragraph
  2. Score improvement -- "Baseline [X]% --> Final [Y]% over [N] experiments" (skip for effectiveness scope)
  3. Top 3 mutations that helped most (skip for effectiveness scope)
  4. Remaining failures with what/where/why (skip for effectiveness scope)
  5. Finding count by severity with trend direction
  6. Lesson effectiveness summary -- table with best/worst highlighted
  7. Meta-slices generated -- count, targets, risk levels (skip for effectiveness scope)
  8. Regression trend -- one line

Options:

  • Accept:

    1. Save report to docs/audits/YYYY-MM-DD-{scope_short}.md
    2. Write DB records: audit_runs (audit_run_id, scope_type, scope_ref, created_at, detail_json with finding_counts, effectiveness_summary, recursion_depth, scores, meta_slice_count, regression_trend), audit_findings (finding_id, audit_run_id, severity, description, evidence, slice_id), lesson_effectiveness (lesson_id, audit_run_id, effectiveness_score, base_score, time_decay, opportunity_factor, confidence_weight)
    3. Auto-route valid meta-slices: LOW = standard notification, MEDIUM = elevated + baseline, HIGH = urgent + baseline + self-audit results
    4. Announce: "Audit accepted. Report saved. [N] meta-slices routed ([L] low, [M] medium, [H] high)."

    Standalone mode persist (no DB writes in standalone mode):

    1. Save report to docs/audits/YYYY-MM-DD-{scope_short}.md (same as DB mode)
    2. Append new lessons to .auditor/lessons.json (create file with {"schema_version":1,"lessons":[]} if it doesn't exist)
    3. Meta-slices saved as meta-slice-{N}.json in working directory (same as DB mode)
    4. No DB writes -- all persistence is filesystem-based
  • Deeper audit: Re-enter Phase 2 with patience 5 focusing on remaining failures. Skip Phases 3-5 on second pass.

  • Dismiss: Discard all. No DB records, no routing.


7 Stable Audit Evaluations

EVALUATION 1: Coverage
Question: Does the audit address every completed slice in scope with zero gaps?
Pass: Every loop_id in scope has at least one finding or explicit "no issues found" notation
Fail: Any slice is not mentioned or analyzed

EVALUATION 2: Evidence-Grounded
Question: Does every finding cite specific verifier outputs, lesson IDs, or git refs?
Pass: Every finding contains at least one concrete reference (ID, SHA, or output excerpt)
Fail: Any finding relies on general statements without specific evidence

EVALUATION 3: Causal Accuracy
Question: Does the failure-to-lesson-to-action chain accurately reflect what happened?
Pass: Every causal chain can be verified against the corpus data
Fail: Any chain misattributes causation, reverses temporal ordering, or cites nonexistent records

EVALUATION 4: Regression Sensitivity
Question: Are degradation patterns identified that span multiple slices?
Pass: Multi-slice trends are analyzed and patterns (positive or negative) are called out
Fail: Each slice is analyzed in isolation without cross-slice trend comparison

EVALUATION 5: Lesson Effectiveness
Question: For each promoted lesson, is there evidence it prevented or failed to prevent recurrence?
Pass: Every lesson in scope has a measured effectiveness assessment with the graduated formula
Fail: Any lesson lacks a recurrence check or effectiveness judgment

EVALUATION 6: Actionability
Question: Does every recommendation have a concrete next step?
Pass: Every recommendation specifies either a meta-slice, operator override, or investigation path with enough detail to act on
Fail: Any recommendation is vague ("should improve") without a concrete action

EVALUATION 7: Self-Consistency
Question: Do the audit's own recommendations not contradict each other or existing operator overrides?
Pass: No recommendation conflicts with another recommendation or a known operator decision
Fail: Any pair of recommendations is contradictory, or any recommendation overrides an explicit operator decision without flagging this

Graduated Lesson Effectiveness Formula

effectiveness_score = base_score * time_decay * opportunity_factor * confidence_weight

base_score (0.0 or 1.0): 1.0 if problem_key has NOT recurred after lesson promotion, 0.0 if it has.

time_decay (0.1 - 1.0): max(0.1, 1.0 - (days_since_lesson / 180)). Today = 1.0, 90 days = 0.5, 180+ days = 0.1 floor.

opportunity_factor (0.0 - 1.0): min(1.0, opportunities_for_recurrence / 3). Opportunities = post-promotion slices where the problem_key could have occurred (same repo, same verifier type). 0 opportunities = 0.0, 3+ = 1.0.

confidence_weight (0.0 - 1.0): The lesson's confidence field directly.

Score Label Interpretation
0.8 - 1.0 Highly effective Working as intended
0.5 - 0.79 Moderately effective May need refinement
0.2 - 0.49 Weakly effective Not reliably preventing recurrence
0.0 - 0.19 Ineffective Review for retirement or revision

Meta-Slice Risk Tiers and Routing

All tiers auto-route with notification. The auditor + self-heal loop ARE the safety net -- no human gates.

Risk Notification QA Gate
LOW Standard Ecosystem fitness pass
MEDIUM Elevated Ecosystem fitness pass + regression baseline snapshot
HIGH Urgent Ecosystem fitness pass + regression baseline + pre-route self-audit

Recursion Rules

  • Max depth: 2 (audit --> audit-of-audit --> stop)
  • Scope "self" or "pipeline" triggers recursive self-audit at depth+1
  • Depth tracked in audit_runs.detail_json as "recursion_depth": N
  • At depth 2: skip Phase 4 and Phase 5 -- report findings only

Subagent Information Isolation

Role Receives Does NOT receive
Evaluator Audit report + evaluation criteria only Mutation info, changelog, scores, reasoning, experiment history
Brainstormer Audit report + failing evaluations (what/where/why) + CHANGELOG Critic output, evaluator reasoning
Critic Audit report ONLY Evaluation criteria, scores, failures, changelog, mutation context

Evaluator Response Format

Per evaluation:

evaluation_name: [name]
status: true|false
what: [one sentence] (required if false, omit if true)
where: [section or element] (required if false, omit if true)
why: [reason] (required if false, omit if true)

CHANGELOG Entry Format

## Experiment [N] -- [keep/discard] (Phase: Forge)
**Score:** [X]/7 ([percent]%)
**Mutation:** [What was changed]
**Source:** [Brainstormer proposal #N / Critic finding / Both]
**Reasoning:** [Why this was expected to help]
**Result:** [Which evaluations improved/declined/unchanged]
**Failures:**
- [evaluation_name]: what=[what] where=[where] why=[why]

Working Directory Naming

auditor-{scope_short}-YYYY-MM-DD-HH-MM/

Scope Type scope_short Example
LOOP loop_id first 8 chars auditor-a1b2c3d4-2026-03-30-14-22/
REPO repo_id auditor-my-repo-2026-03-30-14-22/
PIPELINE "pipeline" auditor-pipeline-2026-03-30-14-22/
EFFECTIVENESS repo_id + "-eff" auditor-my-repo-eff-2026-03-30-14-22/

Output Files

auditor-{scope_short}-YYYY-MM-DD-HH-MM/
+-- dashboard.html              # Live dashboard (auto-refreshes)
+-- results.tsv                 # Score log (one row per evaluation per experiment)
+-- CHANGELOG.md                # Detailed mutation log
+-- audit-report-baseline.md    # Initial report (never modified)
+-- audit-report-optimized.md   # Working copy during Phase 2
+-- meta-slice-1.json           # Meta-slice contracts (if generated)
+-- meta-slice-2.json
+-- ...

Final accepted report is saved to docs/audits/YYYY-MM-DD-{scope_short}.md.