Use when documentation has been created or updated and the user has asked to review - MD files, beads issues, specs, or task lists
Install
npx skillscat add xexr/marketplace/review-documentation Install via the SkillsCat registry.
Review Documentation
Multi-LLM documentation review for catching inconsistencies, codebase mismatches, and gaps.
When to Use
- After an LLM generates or updates documentation
- Before implementing a spec or task list
- When reviewing PRDs, plans, or beads issue epics
- When documentation references code that may have changed
Workflow
flowchart TD
start([User invokes skill])
ask_llms[Step 1a: LLMs + Scope description]
show_cats[Step 1b: Show categories + Review All vs Customize]
fast_track_cats{Review All?}
ask_cats[Step 1c: 3 multiselect questions - 12 category options]
preflight_gpt{GPT selected?}
check_gpt[Check Codex CLI available + logged in]
fail_gpt([Abort: tell user to fix Codex setup])
preflight_gemini{Gemini selected?}
check_gemini[Check Gemini CLI available + logged in]
fail_gemini([Abort: tell user to fix Gemini setup])
haiku[Step 2: Haiku gathers paths]
build_brief[Build dynamic DO CHECK / DO NOT CHECK brief]
dispatch[Step 3: Dispatch reviewers IN PARALLEL]
opus[Opus 4.6 sub-agent]
gpt[Codex CLI GPT sub-agent]
gemini[Gemini CLI sub-agent]
collect[Collect results via TaskOutput]
synthesize[Step 4: Compare, deduplicate, synthesize]
findings([Present findings])
gate[Step 5.1: Gate - wait for 'go']
show_concerns[Step 5.2: Summarize + Address All vs Customize]
fast_track_concerns{Address All?}
select_concerns[Step 5.2b: Select specific findings]
resolve_ambig[Step 5.3: Resolve ambiguities one-by-one]
select_actions[Step 5.4: Select actions]
execute[Step 5.5: Execute actions]
summary([Final summary of actions taken])
start --> ask_llms
ask_llms --> show_cats
show_cats --> fast_track_cats
fast_track_cats -->|yes| preflight_checks
fast_track_cats -->|no| ask_cats
ask_cats --> preflight_checks
preflight_checks[Pre-flight checks IN PARALLEL]
preflight_checks -->|if GPT selected| check_gpt
preflight_checks -->|if Gemini selected| check_gemini
preflight_checks -->|neither selected| haiku
check_gpt -->|ok| haiku
check_gpt -->|fail| fail_gpt
check_gemini -->|ok| haiku
check_gemini -->|fail| fail_gemini
haiku --> build_brief
build_brief --> dispatch
dispatch -->|if selected| opus
dispatch -->|if selected| gpt
dispatch -->|if selected| gemini
opus --> collect
gpt --> collect
gemini --> collect
collect --> synthesize
synthesize --> findings
findings --> gate
gate --> show_concerns
show_concerns --> fast_track_concerns
fast_track_concerns -->|yes| resolve_ambig
fast_track_concerns -->|no| select_concerns
select_concerns --> resolve_ambig
resolve_ambig --> select_actions
select_actions --> execute
execute --> summaryStep 0: Pre-flight Checks (if GPT or Gemini selected)
Run pre-flight checks IN PARALLEL for all selected external CLI agents.
GPT Pre-flight (if GPT selected)
# Check Codex CLI is installed
command -v codex >/dev/null 2>&1 || { echo "Codex CLI not installed"; exit 1; }
# Test the actual model - this validates auth AND model availability
codex exec -m "gpt-5.3-codex" -c reasoning_effort="high" --sandbox workspace-write "Respond with only: READY" 2>&1Check the output for:
- ✓ "READY" in response = model works, proceed
- ✗ "not supported" or auth errors = model unavailable
If check fails, STOP and tell the user:
"Codex CLI check failed. Either the CLI is not installed, you're not logged in (
codex login), or the gpt-5.3-codex model is not available with your account. Select Opus 4.6 only for this review, or fix your Codex setup first."
Gemini Pre-flight (if Gemini selected)
# Check Gemini CLI is installed
command -v gemini >/dev/null 2>&1 || { echo "Gemini CLI not installed"; exit 1; }
# Test the actual model
gemini "Respond with only: READY" --model gemini-3-pro-preview -y 2>&1Check the output for:
- ✓ "READY" in response = model works, proceed
- ✗ Errors about authentication or model unavailable
If check fails, STOP and tell the user:
"Gemini CLI check failed. Select Opus 4.6 only, or fix your Gemini setup first."
Do NOT dispatch any sub-agents until all pre-flight checks pass.
Step 1: Get Scope from User (FAST)
CRITICAL: Do NOT explore the codebase yourself. Just collect scope from user.
Step 1a: Get Scope Description
Ask the user for scope (free text):
"What documentation should I review? (e.g., "Phase 5 beads documentation", "specs/auth.md", "all tasks under epic cgt-22")"
Step 1b: Configuration (Single Batched AskUserQuestion)
Batch all configuration into ONE AskUserQuestion call with 3 questions. Users can tab through efficiently:
questions: [
{
question: "Which models should perform this review?",
header: "Models",
multiSelect: true,
options: [
{ label: "Opus 4.6 (Recommended)", description: "Claude Opus 4.6 - strong reasoning, nuanced analysis" },
{ label: "GPT 5.3 Codex", description: "OpenAI's latest via Codex CLI - different perspective" },
{ label: "Gemini 3 Pro", description: "Google's latest via Gemini CLI - third perspective" }
]
},
{
question: "Review mode?",
header: "Categories",
multiSelect: false,
options: [
{ label: "Review All (Recommended)", description: "Check all categories: accuracy, design, robustness" },
{ label: "Customize", description: "Select specific categories to review" }
]
},
{
question: "Run CLI pre-flight checks for GPT/Gemini?",
header: "Pre-flight",
multiSelect: false,
options: [
{ label: "Skip (Recommended)", description: "Assume CLIs work - faster startup" },
{ label: "Run checks", description: "Verify Codex/Gemini CLIs are working before dispatch" }
]
}
]This reduces LLM round-trips from multiple separate questions to 1 batched interaction.
Handling Configuration Results
Single Model Warning:
If user selects only 1 model, show a follow-up warning:
question: "Only 1 model selected. Multi-model comparison provides better coverage. Proceed?"
header: "Single Model"
multiSelect: false
options:
- label: "Add another model (Recommended)"
description: "Go back and select additional models for comparison"
- label: "Continue with 1 model"
description: "Get review from single model (no comparison)"Review Mode Handling:
"Review All": Skip to Step 2 with defaults:
- Codebase Match ✓
- Cross-Document Consistency ✓
- API & Interface Assumptions ✓
- Security Concerns ✓
- Design Quality ✓
- TDD Alignment ✓
- Project Standards ✓
"Customize": Show the 3 multiselect questions below.
Pre-flight Handling:
- "Skip": Proceed directly to dispatch - trust that CLIs work
- "Run checks": Execute pre-flight checks (Step 0) before dispatch
Step 1c: Review Categories (if customizing - 3 multiselect questions)
Ask the user which review categories to include using THREE questions for granular control (12 toggleable options):
Question 1: Accuracy & Correctness
question: "Which accuracy checks should agents perform?"
header: "Accuracy"
multiSelect: true
options:
- label: "Codebase Match (Recommended)"
description: "Does documentation match actual code? File paths correct?"
- label: "Cross-Document Consistency (Recommended)"
description: "Are different documents consistent with each other?"
- label: "API & Interface Assumptions (Recommended)"
description: "Are assumptions about APIs, tools, external services correct?"
- label: "Security Concerns (Recommended)"
description: "Missing auth, exposed secrets, injection risks, OWASP issues"Question 2: Design & Standards
question: "Which design/standards checks should agents perform?"
header: "Design"
multiSelect: true
options:
- label: "Design Quality (Recommended)"
description: "Suitability, YAGNI, DRY - is the design appropriate?"
- label: "TDD Alignment (Recommended)"
description: "Does the plan account for testing? Test-first or afterthought?"
- label: "Project Standards (Recommended)"
description: "Alignment with CLAUDE.md, agents.md, documented conventions"
- label: "Architectural Consistency"
description: "Does the approach fit existing architecture and patterns?"Question 3: Robustness & Validation
question: "Which robustness/validation checks should agents perform?"
header: "Robustness"
multiSelect: true
options:
- label: "Error Handling & Edge Cases"
description: "Failure scenarios, API failures, missing data, invalid inputs"
- label: "Performance"
description: "N+1 queries, unbounded loops, missing pagination, algorithms"
- label: "Data & Schema Validity"
description: "Schema match, column types, foreign keys, constraints"
- label: "Task Dependencies & Completeness"
description: "Beads dependencies correct? Parallelizable? Critical path? Task status accuracy?"Default behavior: Items marked "(Recommended)" are typical defaults - first 7 options. User can toggle any combination.
Time budget for Step 1: Under 45 seconds total. You're collecting descriptions, not files.
Step 2: Haiku Gathers Paths (FAST)
Dispatch a Haiku subagent to discover relevant paths from the scope description.
This is the ONLY exploration step - and it's delegated to a fast, cheap agent.
Epic Disambiguation
CRITICAL: If scope description matches multiple epics (e.g., multiple "Phase 6" epics), Haiku MUST:
- List ALL matching epics with their titles and IDs
- Return structured output with a
DISAMBIGUATION_NEEDEDflag if multiple matches - Include epic titles to help distinguish (e.g., "Phase 6 Artifacts & Downloads" vs "Phase 6 Asset Equivalences")
If disambiguation is needed, present candidates to user:
question: "Multiple matching epics found. Which one should I review?"
header: "Epic"
multiSelect: false
options:
[Dynamically populate with epic ID + title for each match]Then re-run Haiku with the specific epic ID.
Task(
subagent_type="Explore",
model="haiku",
prompt="Find all relevant documentation paths for: [USER'S SCOPE DESCRIPTION]
Target codebase: [PATH]
Return ONLY a structured list of paths. Do NOT read file contents.
Look in:
- .beads/ directory for matching issue IDs (use bd list to find relevant IDs)
- specs/ or docs/ directories for matching markdown files
- CLAUDE.md, README.md, or similar project docs
- Any paths explicitly mentioned in the scope description
Output format:
BEADS_IDS: cgt-22, cgt-11, cgt-16, ...
FILE_PATHS: /path/to/spec.md, /path/to/design.md, ...
Be thorough but fast. Paths only, no content."
)Wait for Haiku to return (typically 15-30 seconds), then proceed to Step 2b.
Step 2b: Pre-Read Beads Issues
Always pre-read beads issues to include in both Opus and GPT prompts.
This speeds up the review by providing spec context upfront, reducing tool calls during review.
# For each issue ID from Haiku:
bd show cgt-XX
bd show cgt-YY
# ... etcStore the output and include it verbatim in both sub-agent prompts under ## Beads Issue Contents (Pre-Read).
Note: GPT can still explore additional issues (dependencies, linked issues) using bd commands if needed during review.
Step 3: Dispatch Reviewers (PARALLEL)
CRITICAL: All agents MUST use run_in_background: true for true parallelism.
Special Note on Gemini: Gemini requires a TWO-STEP approach to avoid JSON serialization issues:
- First, write the prompt to a temp file (quick sync Bash call)
- Then, call gemini reading from the temp file (background Bash call)
See "Dispatching Gemini Sub-Agent" section below for details.
# Dispatch sequence:
# Step 1: Write Gemini prompt to temp file (sync, fast)
Bash(command="cat > /tmp/gemini_prompt.txt <<'EOF' ... EOF")
# Step 2: In ONE message, dispatch ALL reviewers with run_in_background: true:
Task(
subagent_type="general-purpose",
model="opus",
run_in_background=true, # <-- REQUIRED
prompt="..."
)
Bash(
command="codex exec ...",
run_in_background=true, # <-- REQUIRED
timeout=900000
)
Bash(
command="gemini \"$(cat /tmp/gemini_prompt.txt)\" ...",
run_in_background=true, # <-- REQUIRED
timeout=900000
)If you forget run_in_background: true on any call, they will NOT run in parallel.
Sub-Agent Brief (DYNAMIC - build based on user selections)
Build the brief dynamically based on which categories the user selected from the THREE questions (12 possible categories). Include ONLY the selected categories in the "DO CHECK" section, and explicitly list skipped categories in the "DO NOT CHECK" section.
Template:
You are reviewing documentation for accuracy, completeness, and quality.
## IMPORTANT: Review Only - No Changes
**DO NOT modify any files during this review phase.**
This is a READ-ONLY review for context gathering and gap identification. Your role is to:
- Read and analyze documentation
- Compare documentation against codebase
- Identify gaps and issues
- Report findings
Any actual fixes will be made in Step 5 after user reviews and approves findings.
Do NOT use Edit, Write, or any file modification tools.
## Verification Requirement
**CRITICAL: Do NOT make assumptions about how the codebase works. VERIFY by reading code.**
Before stating how something works (e.g., "the CLI is workspace-agnostic"):
1. Read the actual code that implements it
2. Quote specific lines/functions that prove your assertion
3. If you cannot verify by reading code, mark as UNVERIFIED and flag for human review
Common assumption mistakes to avoid:
- Assuming a component hasn't changed since last review
- Inferring behavior from naming conventions without reading implementation
- Stating capabilities without checking the actual code
## DO CHECK - Review these categories:
[Include ONLY the categories the user selected from Step 1b questions]
### Codebase Match (if selected)
- Does the documentation match what's in the codebase?
- Are there gaps or ambiguities between what's documented and what exists?
- Are all file paths mentioned correct and existing?
### Cross-Document Consistency (if selected)
- Are different documents consistent with each other?
- Do we describe something one way in the spec but differently elsewhere?
### API & Interface Assumptions (if selected)
- Are assumptions about APIs, tool interfaces, or external services correct?
- If unsure about library/API usage, use Context7 to verify. Not required for well-known patterns.
### Security Concerns (if selected)
- Are there security issues in the planned approach?
- Missing auth, exposed secrets, injection risks, OWASP top 10 issues?
### Design Quality (if selected)
- Is the design appropriate? Over-engineered or under-engineered? Simpler alternatives?
- YAGNI - unnecessary features, abstractions, or complexity?
- DRY - duplication or missed reuse opportunities?
### TDD Alignment (if selected)
- Does the plan account for testing?
- Is there a test-first approach or are tests an afterthought?
- Are test scenarios defined for edge cases?
### Project Standards (if selected)
- Check CLAUDE.md, agents.md, constitution.md, or similar project config files
- Does the plan align with documented principles and conventions?
### Architectural Consistency (if selected)
- Does the approach fit existing architecture and patterns?
- Does it follow established patterns or introduce inconsistent new patterns?
### Error Handling & Edge Cases (if selected)
- Does the plan account for failure scenarios?
- What happens when APIs fail, data is missing, inputs are invalid?
- Are error states defined?
### Performance (if selected)
- N+1 queries, unbounded loops, missing pagination?
- Large payloads, inefficient algorithms, missing indexes?
- Backwards compatibility - breaking changes, migration paths needed?
### Data & Schema Validity (if selected)
- Do proposed structures match existing schemas?
- Column types, foreign keys, constraints correct?
- Do migrations preserve data integrity?
### Task Dependencies & Completeness (if selected)
- Are beads/task dependencies correct and accurate?
- Can tasks marked as parallelizable actually run in parallel?
- Is the critical path correctly identified?
- Are task statuses accurate (closed tasks actually complete)?
- Are blockers correctly marked?
## DO NOT CHECK - Skip these categories:
[List any categories the user did NOT select]
- [Category name]: User explicitly excluded this from review scope. Do not report issues in this area.
## Output Format
Return a RISK-WEIGHTED bullet list. Order by severity (CRITICAL > HIGH > MEDIUM > LOW).
For each issue:
- **[SEVERITY] Issue Title**
- What: Description of the problem
- Where: File/document/task affected
- Evidence: What you found that indicates this issue
- Recommendation: Specific fix
End with a brief summary of total issues by severity.Example: User selected only "Codebase Match", "Security Concerns", "Design Quality", and "Task Dependencies & Completeness":
## DO CHECK - Review these categories:
### Codebase Match
- Does the documentation match what's in the codebase?
- Are there gaps or ambiguities?
- Are all file paths correct?
### Security Concerns
- Missing auth, exposed secrets, injection risks?
### Design Quality
- Is the design appropriate? YAGNI, DRY?
### Task Dependencies & Completeness
- Are beads dependencies correct?
- Can parallel tasks actually parallelize?
- Are task statuses accurate?
## DO NOT CHECK - Skip these categories:
- **Cross-Document Consistency**: User excluded. Do not report.
- **API & Interface Assumptions**: User excluded. Do not report.
- **TDD Alignment**: User excluded. Do not report.
- **Project Standards**: User excluded. Do not report.
- **Architectural Consistency**: User excluded. Do not report.
- **Error Handling & Edge Cases**: User excluded. Do not report.
- **Performance**: User excluded. Do not report.
- **Data & Schema Validity**: User excluded. Do not report.Dispatching Opus Sub-Agent
Task(
subagent_type="general-purpose",
model="opus",
run_in_background=true,
prompt="[Sub-agent brief above]
Documentation to review:
- Beads issues: [BEADS_IDS from Haiku]
- Files: [FILE_PATHS from Haiku]
## Beads Issue Contents (Pre-Read)
[PASTE VERBATIM OUTPUT FROM `bd show` FOR EACH ISSUE HERE]
Target codebase: /path/to/project
## How to explore additional beads issues
The main beads issues are provided above. If you need to explore dependencies or linked issues:
- bd show <id> - View issue details
- bd dep show <id> - View issue dependencies
Use Read tool for markdown files.
Explore the codebase as needed to validate documentation accuracy.
REMINDER: This is a READ-ONLY review phase. Do NOT modify any files. Changes are made in Step 5 after user approval."
)Dispatching GPT Sub-Agent via Codex CLI
CRITICAL: The Bash call MUST have run_in_background: true
# Bash tool call parameters:
# command: "codex exec ..."
# run_in_background: true <-- REQUIRED FOR PARALLEL EXECUTION
# timeout: 900000 <-- 15 minutes (GPT reviews can take time)
codex exec -m "gpt-5.3-codex" -c reasoning_effort="high" --sandbox workspace-write "$(cat <<'PROMPT'
[Sub-agent brief above]
Documentation to review:
- Beads issues: [BEADS_IDS from Haiku]
- Files: [FILE_PATHS from Haiku]
## Beads Issue Contents (Pre-Read)
[PASTE VERBATIM OUTPUT FROM `bd show` FOR EACH ISSUE HERE]
Target codebase: [current working directory]
## How to explore additional beads issues
The main beads issues are provided above. If you need to explore dependencies or linked issues:
- bd show <id> - View issue details
- bd dep show <id> - View issue dependencies
- bd list --status=open - List open issues
## Shell Command Best Practices
**CRITICAL: Path quoting**
Always wrap file paths containing special characters in single quotes:
- Parentheses: `(dashboard)`
- Brackets: `[slug]`
- Spaces
Example:
```bash
# WRONG - will fail
sed -n '1,100p' apps/web/app/(dashboard)/workspace/[slug]/page.tsx
# CORRECT
sed -n '1,100p' 'apps/web/app/(dashboard)/workspace/[slug]/page.tsx'Avoid login shell flag
When running shell commands, prefer non-login shell:
/bin/zsh -c 'command' # Preferred - non-login shellAvoid login shell (-l flag) as it may trigger profile errors in sandbox environments.
Time Management
This review may take 10-15 minutes. Take your time to:
- Read all relevant files thoroughly
- Cross-reference documentation against codebase
- Build a complete findings list
Don't rush - thoroughness is more important than speed.
Review process
- Read the beads issue contents provided above
- If needed, explore dependencies or linked issues using bd commands
- Read any markdown files directly
- Explore the codebase to validate documentation accuracy
- Return findings in the specified output format
REMINDER: This is a READ-ONLY review phase. Do NOT modify any files. Changes are made in Step 5 after user approval.
PROMPT
)"
**Notes on Codex CLI:**
- Model: `gpt-5.3-codex` with high reasoning effort
- `--sandbox workspace-write` - can read/write files in workspace, run shell commands (including `bd` which needs SQLite WAL write access)
- **Timeout:** Set Bash timeout to 15 minutes (`timeout: 900000` ms) - GPT reviews can take time
- **Background:** Use `run_in_background: true` on the Bash call - THIS IS CRITICAL
- Beads issues are pre-read in Step 2b and passed in prompt, but GPT can explore additional issues with `bd` if needed
### Dispatching Gemini Sub-Agent via Gemini CLI
**CRITICAL: Use the TWO-STEP temp file approach to avoid JSON serialization issues with complex prompts.**
**WHY:** When the executing LLM constructs a Bash tool call containing a long HEREDOC with markdown, code blocks, and special characters, the JSON serialization can break (resulting in "Invalid tool parameters"). Using a temp file separates the complex prompt content from the shell command.
**IMPORTANT: Gemini runs without sandbox (docker/podman not required). The prompt explicitly instructs Gemini NOT to modify files during review phase.**
**Step 1: Write prompt to temp file (first Bash call)**
```bash
# Bash tool call parameters:
# command: "cat > /tmp/gemini_review_prompt.txt <<'GEMINI_PROMPT' ... "
# run_in_background: false <-- This is fast, no need for background
# timeout: 30000 <-- 30 seconds is plenty
cat > /tmp/gemini_review_prompt.txt <<'GEMINI_PROMPT'
[Sub-agent brief above]
Documentation to review:
- Beads issues: [BEADS_IDS from Haiku]
- Files: [FILE_PATHS from Haiku]
## Beads Issue Contents (Pre-Read)
[PASTE VERBATIM OUTPUT FROM `bd show` FOR EACH ISSUE HERE]
Target codebase: [current working directory]
## How to explore additional beads issues
The main beads issues are provided above. If you need to explore dependencies or linked issues:
- bd show <id> - View issue details
- bd dep show <id> - View issue dependencies
- bd list --status=open - List open issues
## Shell Command Best Practices
**CRITICAL: Path quoting**
Always wrap file paths containing special characters in single quotes:
- Parentheses: `(dashboard)`
- Brackets: `[slug]`
- Spaces
Example:
```bash
# WRONG - will fail
cat apps/web/app/(dashboard)/workspace/[slug]/page.tsx
# CORRECT
cat 'apps/web/app/(dashboard)/workspace/[slug]/page.tsx'Time Management
This review may take 10-15 minutes. Take your time to:
- Read all relevant files thoroughly
- Cross-reference documentation against codebase
- Build a complete findings list
Don't rush - thoroughness is more important than speed.
Review process
- Read the beads issue contents provided above
- If needed, explore dependencies or linked issues using bd commands
- Read any markdown files directly
- Explore the codebase to validate documentation accuracy
- Return findings in the specified output format
REMINDER: This is a READ-ONLY review phase. Do NOT modify any files. Changes are made in Step 5 after user approval.
GEMINI_PROMPT
**Step 2: Call Gemini reading from temp file (second Bash call, IN PARALLEL with Opus/GPT)**
```bash
# Bash tool call parameters:
# command: "gemini \"$(cat /tmp/gemini_review_prompt.txt)\" ..."
# run_in_background: true <-- REQUIRED FOR PARALLEL EXECUTION
# timeout: 900000 <-- 15 minutes
gemini "$(cat /tmp/gemini_review_prompt.txt)" --model gemini-3-pro-preview -y && rm /tmp/gemini_review_prompt.txtNotes on Gemini CLI:
- Model:
gemini-3-pro-preview - Runs without sandbox (prompt prohibits file modifications)
- Timeout: Set Bash timeout to 15 minutes (
timeout: 900000ms) - Background: Use
run_in_background: trueon the Bash call - THIS IS CRITICAL - Beads issues are pre-read in Step 2b and passed in prompt
- The
&& rmcleanup runs after Gemini completes
Collecting Results
After dispatching all selected agents in background, use TaskOutput to collect results.
CRITICAL: Call ALL TaskOutput calls IN PARALLEL in ONE message.
If you call them sequentially, the main thread blocks on each one, wasting time. Call them all in a single message:
# In ONE message, call all TaskOutput tools in parallel:
TaskOutput(task_id="<opus_task_id>", block=true, timeout=900000)
TaskOutput(task_id="<codex_task_id>", block=true, timeout=900000)
TaskOutput(task_id="<gemini_task_id>", block=true, timeout=900000)Timeouts: Use 900000ms (15 minutes) for all agents. Reviews take time - don't give up early.
This keeps the main conversation context clean while agents work.
Do NOT timeout prematurely. When waiting for results:
- Use
timeout: 900000(15 minutes) for all agents - If one agent times out but others succeed, note this in the output and proceed
- Do NOT poll repeatedly - call TaskOutput ONCE with blocking mode and let it wait
If you get a timeout and suspect the agent is still running:
- Wait an additional 2-3 minutes
- Try TaskOutput again with
block: falseto check status - Only proceed to synthesis when you have actual results (or confirmed failure)
Handling Failures
Known Sandbox Errors (Ignorable):
These stderr messages can be safely ignored - they don't affect command execution:
/opt/homebrew/Library/Homebrew/help.sh: cannot create temp file- Homebrew shell integration failing in sandbox- Similar brew/nvm/pyenv/rbenv profile errors
- Any "Operation not permitted" errors from shell profile scripts
Only treat as actual failures:
- Command exit codes ≠ 0
- Missing expected output
- Actual tool/permission errors (not profile initialization errors)
If GPT fails (rate limits, API errors, etc.):
- NOTIFY the user explicitly - don't silently proceed with fewer agents
- Present options:
question: "GPT review failed due to [error]. How should we proceed?" header: "Fallback" multiSelect: false options: - label: "Continue with remaining agents" description: "Proceed with available agents (reduced cross-validation)" - label: "Retry GPT" description: "Attempt the GPT review again" - label: "Abort review" description: "Stop and investigate the issue" - Record the failure in the final output
If Gemini fails (rate limits, API errors, etc.):
- NOTIFY the user explicitly - don't silently proceed with fewer agents
- Present options:
question: "Gemini review failed due to [error]. How should we proceed?" header: "Fallback" multiSelect: false options: - label: "Continue with remaining agents" description: "Proceed with available agents (reduced cross-validation)" - label: "Retry Gemini" description: "Attempt the Gemini review again" - label: "Abort review" description: "Stop and investigate the issue" - Record the failure in the final output
Never silently fall back to fewer agents - the user should know if they're missing cross-validation.
Transition to Synthesis
Once all TaskOutput calls return (or timeout/fail):
- Do NOT do additional exploration - use only what agents returned
- Do NOT supplement with your own reads - synthesis uses agent outputs only
- Proceed IMMEDIATELY to Step 4 - don't add extra investigation
- If results are incomplete, note gaps in the synthesis rather than trying to fill them
Step 4: Synthesize Results
After collecting results from sub-agents, follow the appropriate path:
flowchart TD
check{How many agents ran?}
single[Single-Agent Mode]
multi[Multi-Agent Mode]
check -->|one| single
check -->|two or more| multiSingle-Agent Mode (one LLM only)
If only one agent ran (user choice or other agents unavailable):
- Present findings directly - no comparison needed
- Skip the comparison table - not applicable
- Add your own analysis - review the agent's findings and note any you disagree with or want to highlight
- Note the limitation - mention that cross-validation wasn't performed
Multi-Agent Mode (multiple LLMs)
4.1 Deduplicate Issues
Create a merged list, removing duplicates where multiple agents identified the same issue.
4.2 Create Comparison Table
| # | Issue | Opus | GPT | Gemini | Agree? | Recommendation |
|---|---|---|---|---|---|---|
| 1 | [Issue title] | [Opus finding] | [GPT finding] | [Gemini finding] | Yes/No/Partial | [Your synthesis] |
(Include only columns for agents that were selected/succeeded)
Table Formatting Guidelines:
- Keep cell content concise - under 30 characters per cell where possible
- Use severity as shorthand - e.g., "HIGH - missing validation" not a full paragraph
- If content is too long, summarize in table and reference "See issue #X above" for full details
- Test table renders correctly before presenting to user
- Use consistent column widths - don't let one cell dominate
For each row:
- Agree? - Do all agents agree on the issue and severity?
- Recommendation - Your synthesized recommendation considering all perspectives
4.3 Reasoning Section
For each issue in the table, provide:
- Why you agree or disagree with each agent's assessment
- If agents disagree, which assessment is more accurate and why
- The final recommended action
4.4 Remaining Ambiguities (both modes)
List any:
- Unresolved questions that need human clarification
- Areas where neither agent had sufficient context
- Items requiring further investigation
- Dependencies on external information not available
Output Format
Multi-Agent Output
# Documentation Review: [Document/Epic Name]
## Review Configuration
- **LLMs Used:** [List agents used, e.g., "Opus + GPT + Gemini" / "Opus only" / "Opus + Gemini"]
- **Documents Reviewed:** [List]
- **Codebase:** [Path or N/A]
- **Categories Checked:** [List selected categories]
- **Categories Skipped:** [List unselected categories, or "None"]
## Synthesized Issues (Risk-Weighted)
### CRITICAL
- [Deduplicated critical issues with recommendations]
### HIGH
- [Deduplicated high issues]
### MEDIUM
- [Deduplicated medium issues]
### LOW
- [Deduplicated low issues]
## Agent Comparison
| # | Issue | Opus | GPT | Gemini | Agree? | Recommendation |
| --- | ----- | ---- | --- | ------ | ------ | -------------- |
| ... |
(Include only columns for agents that were selected/succeeded)
## Reasoning
[For each issue, explain agreement/disagreement and final position]
## Remaining Ambiguities
- [List unresolved questions]
- [Areas needing human input]
- [Items for further investigation]
## Summary
- **Total Issues:** X (Y critical, Z high, ...)
- **Agent Agreement Rate:** X%
- **Next Steps:** [Prioritized action items]Single-Agent Output
# Documentation Review: [Document/Epic Name]
## Review Configuration
- **LLMs Used:** [Opus 4.6 / GPT 5.3 Codex / Gemini 3 Pro] (single agent)
- **Documents Reviewed:** [List]
- **Codebase:** [Path or N/A]
- **Note:** Cross-validation not performed (single agent review)
## Issues (Risk-Weighted)
### CRITICAL
- [Critical issues with recommendations]
### HIGH
- [High issues]
### MEDIUM
- [Medium issues]
### LOW
- [Low issues]
## Additional Analysis
[Your own observations on the agent's findings - agreements, disagreements, highlights]
## Remaining Ambiguities
- [List unresolved questions]
- [Areas needing human input]
- [Items for further investigation]
## Summary
- **Total Issues:** X (Y critical, Z high, ...)Step 5: Next Steps (Interactive)
After presenting findings, guide the user through actioning them.
5.1 Gate: Let User Digest Findings
After presenting the synthesized findings, pause:
"Review complete. Say 'go' when you're ready to discuss next steps."
Wait for user to respond. Do not proceed until they acknowledge.
5.2 Select Concerns to Address (fast-track or customize)
First, summarize the findings:
"Found X issues: Y critical, Z high, W medium, V low."
Then ask (single choice):
question: "How would you like to proceed?"
header: "Findings"
multiSelect: false
options:
- label: "Address All Issues (Recommended)"
description: "Work through all identified issues"
- label: "Customize"
description: "Select specific issues to address"If "Address All": Proceed to Step 5.3 with all findings selected.
If "Customize": Show the multiselect of individual findings:
question: "Which findings do you want to address?"
header: "Select Issues"
multiSelect: true
options:
[Dynamically populate from the identified issues - each issue becomes an option]
- label: "[SEVERITY] Issue title"
description: "Brief summary of the issue"5.3 Resolve Ambiguities (if any in selected concerns)
For each ambiguity in the selected concerns, present one at a time:
[Present the ambiguity context - what's unclear and why it matters]
question: "How should this be resolved?"
header: "Ambiguity"
multiSelect: false
options:
- label: "[Recommended approach] (Recommended)"
description: "Why this is recommended"
- label: "[Alternative 1]"
description: "Trade-offs of this approach"
- label: "[Alternative 2]"
description: "Trade-offs of this approach"
- label: "Skip - decide later"
description: "Leave unresolved for now"Record each resolution for use in subsequent actions.
5.4 Select Actions
Ask user what actions to take on their selected concerns:
question: "What actions do you want to take?"
header: "Actions"
multiSelect: true
options:
- label: "Create beads issues"
description: "Track findings as beads issues for later resolution"
- label: "Save review to markdown"
description: "Save full review to docs/reviews/[name].md"
- label: "Fix documentation"
description: "Update docs/beads to match reality or resolve inconsistencies"
- label: "Fix code issues" (only show if code issues were identified)
description: "Update code to match documented intent"5.5 Execute Selected Actions
If "Create beads issues" selected:
Ask how to structure the issues:
question: "How should the beads issues be structured?"
header: "Structure"
multiSelect: false
options:
- label: "Epic with task breakdown (Recommended)"
description: "One epic for the review, individual tasks for each finding"
- label: "Individual issues"
description: "Separate standalone issue for each finding"
- label: "Single consolidated issue"
description: "One issue listing all findings"Then create the beads issues accordingly using bd create.
If "Save review to markdown" selected:
Ask for the review name:
"What should this review be called? (will be saved to
docs/reviews/[name].md)"
Save the full review output to the specified path.
If "Fix documentation" selected:
Apply fixes to all selected concerns that have documentation issues:
- Update markdown files to resolve inconsistencies
- Update beads issues via
bd updateto correct descriptions/status - Fix file path references
- Resolve cross-document inconsistencies
No per-issue confirmation needed - user already selected the findings they care about.
If "Fix code issues" selected:
Apply fixes to all selected concerns that have code issues:
- Update code to match documented intent
- Fix API mismatches
- Resolve schema/data issues
No per-issue confirmation needed - user already selected the findings they care about.
5.5b Verify Applied Fixes
After applying fixes (documentation or code), verify they were applied correctly:
- Re-read each modified file to confirm changes applied as intended
- Check for syntax issues - ensure markdown renders, code compiles
- Summarize diffs - briefly note what changed in each file for user awareness
Example verification:
Verifying fixes...
- specs/auth.md: ✓ Updated multi-account section (lines 45-52)
- packages/core/adapter.ts: ✓ Extended TransactionType enum
- .beads/cgt-123.md: ✓ Updated status to reflect current stateIf any fix failed to apply, note it in the summary and suggest manual intervention.
5.6 Summary
After executing all selected actions, summarize what was done:
## Actions Taken
- **Ambiguities resolved:** X of Y
- **Beads issues created:** [list issue IDs if created]
- **Review saved to:** [path if saved]
- **Documentation fixes applied:** X
- **Code fixes applied:** X
- **Deferred for manual action:** [list any skipped items]Execution Checklist
Use TodoWrite to track each step:
- Asked user which LLMs to use + scope description (Step 1a)
- Showed categories and asked Review All vs Customize (Step 1b)
- If Customize: asked user which review categories (Step 1c - 3 multiselect questions, 12 options)
- Pre-flight checks passed IN PARALLEL (if GPT or Gemini selected)
- Dispatched Haiku to gather paths
- Received path list from Haiku (if disambiguation needed, asked user)
- Pre-read beads issues with
bd show(Step 2b) - Built dynamic sub-agent brief with DO CHECK / DO NOT CHECK sections
- Dispatched ALL selected reviewers with
run_in_background: truein ONE message (included pre-read beads in prompts) - Collected results using TaskOutput (block: true)
- If any agent failed: notified user and asked how to proceed (don't silently fallback)
- Created deduplicated issue list (multi-agent) OR presented findings (single-agent)
- Created comparison table (multi-agent only)
- Wrote reasoning (multi-agent) OR additional analysis (single-agent)
- Listed remaining ambiguities
- Produced final summary
- Gated: waited for user to say "go" before next steps
- Summarized findings and asked Address All vs Customize (Step 5.2)
- If Customize: asked user which specific findings to address
- Resolved ambiguities one-by-one (if any in selected findings)
- Asked user which actions to take (multiselect)
- Executed selected actions (beads, markdown, fixes)
- Verified applied fixes (re-read files, check syntax, summarize diffs)
- Presented final summary of actions taken
CRITICAL: Do not skip the synthesis phase. Raw agent output is NOT the deliverable.
Common Mistakes
| Mistake | Correction |
|---|---|
| Exploring codebase yourself | Haiku gathers paths - you just pass the scope description |
Missing run_in_background |
BOTH Task and Bash calls MUST have run_in_background: true |
| Sequential dispatch | Use ONE message with multiple tool calls (Task + Bash) for true parallel |
| Asking user for file list | Ask for scope DESCRIPTION - Haiku discovers the actual paths |
| Codex blocking main thread | Codex Bash call needs run_in_background: true just like Task |
| Not waiting for Haiku | Wait for Haiku's path list before dispatching reviewers |
| Not checking codebase | Always verify documentation against actual code if a codebase exists |
| Overusing Context7 | Only use when unsure about library/API usage - not for well-known patterns |
| Surface-level review | Review must be comprehensive - check every file path, every API assumption |
| Missing synthesis | Raw agent output isn't enough - you must compare, contrast, and reason through |
| Dumping raw output | Agent results must be processed into comparison table with reasoning |
| Skipping Review All option | Always offer fast-track first - only show detailed questions if user picks Customize |
| Not disambiguating epics | If multiple epics match (e.g., multiple "Phase 6"), ask user which one |
| Not pre-reading beads issues | Always pre-read with bd show and include in all agent prompts |
| Making technical assumptions | Don't assume how code works - READ it and quote specific lines |
| Silent rate limit fallback | If any agent fails, NOTIFY user explicitly before proceeding with remaining agents |
| Modifying files during review | Review phase is READ-ONLY - no Edit/Write calls, fixes come in Step 5 |
| Premature timeout | Use 900000ms (15 min) timeout for all agents - don't give up after 2-5 minutes |
| Unquoted special paths | Always single-quote paths with (), [], or spaces |
| Treating sandbox stderr as errors | Homebrew/profile errors in stderr are ignorable - check exit codes instead |
| Sequential pre-flight checks | Run GPT and Gemini pre-flight checks IN PARALLEL, not sequentially |
| Gemini file modifications | Gemini runs unsandboxed - prompt explicitly prohibits file modifications during review |
| Inline Gemini HEREDOC | Use TWO-STEP temp file approach - write prompt to file first, then call gemini |
| Sequential TaskOutput calls | Call ALL TaskOutput in ONE message for parallel collection |
| Extra exploration after collection | Proceed IMMEDIATELY to synthesis - don't supplement with your own reads |
| Skipping fix verification | After applying fixes, re-read files and summarize diffs |
| Table cell overflow | Keep comparison table cells under 30 chars - summarize and reference full details above |