Use when the user says "/goal" or wants to autonomously pursue a durable objective — equivalent to Codex /goal. Decomposes goals into milestones, dispatches agents, and enforces independent verification before marking complete.
Resources
4Install
npx skillscat add gugug168/goal-skill Install via the SkillsCat registry.
/goal — Autonomous Goal Pursuit
Overview
The master entry point for fully autonomous goal-driven development. When the user says /goal or describes a durable objective, this skill orchestrates the complete lifecycle: understanding → planning → implementation → verification → delivery.
This skill is the equivalent of Codex /goal — a persistent, multi-hour autonomous development loop that works toward a clear stopping condition.
Core Principle
A cannot verify A. Every task has an implementer and a separate verifier using a different model/provider.
The 7-Phase Workflow
Phase 0: UNDERSTAND GOAL
│ User says "/goal [objective]"
↓
Phase 1: DECOMPOSE — task-decomp skill
│ Break into milestones with acceptance criteria
↓
Phase 1.5: AI REVIEW — Claude Code reviews task decomposition
│ Review task granularity, dependencies, difficulty
│ Revise based on feedback
↓
Phase 2: SETUP
│ Create Kanban board
│ Create working directory / git worktree
│ Create TASK-过程记录.md process document
↓
Phase 3: AUTONOMOUS EXECUTION — autonomous-dev-loop cronjob
│ Cronjob (every 5 min) wakes Hermes
│ Hermes delegates to: Claude Code / Codex / Gemini CLI / OpenCode
│ Writing → verification → kanban update → notify user
↓
Phase 4: PER-TASK VERIFICATION
│ Non-implementer Agent verifies each completed task
│ 3 failures → switch agent → still failing → "needs human"
↓
Phase 5: FINAL REVIEW
│ Least-participating Agent does final review
│ finishing-a-development-branch → merge/PR
↓
Phase 6: DELIVERY
│ Complete process document
│ Report to userPhase 0: Understand the Goal
Trigger: User says /goal [objective]
If Goal is Clear
Confirm with user:
Goal: [stated objective]
Stopping condition: [what "done" looks like]
Estimated scope: [big/small/medium]
Agent assignment: Claude Code (code) / Gemini CLI (visual) / Hermes (general)
Token budget: [unlimited / 50k / 100k / custom] ← 设置则限,无设置则不限
Ready to decompose? (Y/N)Token Budget 说明:
- 不设置 → 无限 token,直到目标完成或用户中断
- 设置值 → 80% 预警 / 100% 暂停并通知用户决策(见"Token Budget 系统"章节)
If Goal is Unclear
Invoke brainstorming skill first to clarify:
- What problem does this solve?
- What are the constraints?
- What does success look like?
- What are the boundaries?
After brainstorming → invoke task-decomp to decompose.
Token Budget System
每轮 cronjob 开始时检查 token 使用量。
Budget Modes
| 模式 | 行为 |
|---|---|
| 不设置 | 无限 token,直到目标完成、用户中断、或遇到真正的 blocker |
| 设置值 | 限流控制,见下方阈值规则 |
Threshold Rules(设置预算时生效)
0% ──────────────────────────────────
开始执行,记录基准
80% ─────────────
⚠️ 预警通知用户:
"Token 使用已达 80%(已用 X / 预算 Y)
剩余 Z,约可完成 N 个 task
要继续吗?要加预算吗?"
继续自主执行
100% ─────
🛑 暂停,通知用户:
"Token 预算已耗尽(X / Y)
已完成 N/M 个 tasks
进度:████░░░░░░░ 60%
决策:① 加预算继续 ② 暂停交付当前成果 ③ 中止"Budget Tracking File
在工作目录创建 .goal-budget.json:
{
"goal": "[goal name]",
"budget_tokens": 100000,
"warning_tokens": 80000,
"used_tokens": 0,
"last_updated": "YYYY-MM-DD HH:MM",
"status": "active|warning|paused|complete",
"progress": {
"tasks_completed": 3,
"tasks_total": 8
}
}Token 使用估算
每次 delegate_task 后,根据返回的 usage 字段估算:
# 估算每次 delegate_task 平均 token 消耗
avg_tokens_per_task = total_used / tasks_completed
remaining_tasks = total_tasks - tasks_completed
estimated_needed = avg_tokens_per_task * remaining_tasks80% 预警通知格式
⚠️ Token Budget 预警
Goal: [name]
已用: 80,000 / 100,000 tokens (80%)
剩余约可完成: ~2 个 task
当前进度: 3/8 tasks (37%)
最近完成: [task name]
选项:
① 继续执行(可能在 100% 前完成)
② 增加预算 +50%
③ 暂停,交付当前成果100% 暂停通知格式
🛑 Token Budget 耗尽
Goal: [name]
已用: 100,000 / 100,000 tokens (100%)
进度: 5/8 tasks (62%)
已完成:
✅ task 1: [name]
✅ task 2: [name]
...
未完成:
⬜ task 6: [name]
⬜ task 7: [name]
...
→ 暂停,发送此通知给用户,等待决策
选项:
① 加预算继续(推荐 +50% 或 +100%)
② 暂停交付(保留当前 branch/PR)
③ 中止(放弃本次执行)Soft Stop — 100% 前最后一个 Turn 的 Wrap-Up(Codex 启发)
即使 budget 耗尽,也不要突然中断。但也绝不能因此 mark complete。
规则(来自 Codex continuation.md):
"Do not call update_goal unless the goal is complete. Do not mark a goal complete merely because the budget is nearly exhausted or because you are stopping work."
→ Budget 耗尽而目标未达 → 状态 = budget_limited,永远不能 mark complete。
硬规则:budget_limited → complete 的转换路径不存在。
只有这一条路径可以变 complete:
active + completion audit 全绿 + update_goal(status="complete")以下任何一种都不能 mark complete:
- ❌ token/time 耗尽(budget_limited)
- ❌ 用户要求停止
- ❌ 3 次 pivot 后放弃
- ❌ "快完成了,应该没问题了"
- ❌ "用户没意见了"
如果 budget 耗尽但目标未完成 → 状态 = budget_limited → 等用户决策(加预算/暂停交付/中止)。
100% 前最后一个 turn(或 budget 耗尽后第一次唤醒):
→ 执行 Soft Stop:inject wrap-up steering
Wrap-Up Prompt(注入到下一个 agent prompt):
Budget 即将耗尽。执行收尾:
1. 不要 start new substantive work
2. 把当前 in-progress 的 task 收尾到一个干净状态
(commit 当前改动,附上清晰 message)
3. 在 TASK-过程记录.md 记录:
- 进度百分比
- 已完成 vs 未完成 task 列表
- 明确的 next step(如果用户选择继续)
4. 通知用户当前状态,等待决策
→ 等用户决策(加预算 / 暂停交付 / 中止)
→ 绝不能因为"预算快没了"就 mark completeGoal 状态机(来自 Codex core runtime,issue #18076)
状态定义
┌─────────────┐
│ active │ ← goal 运行中,可接受新 work
└──────┬──────┘
│ interrupt / 用户暂停
▼
┌─────────────┐
│ paused │ ← goal 暂停,可被 resume
└──────┬──────┘
│ token/time budget 耗尽(从 active)
▼
┌─────────────┐
│budget_limited│ ← 软停止,注入 wrap-up steering
└──────┬──────┘
│ 用户决策"加预算继续"
▼
┌─────────────┐
│ active │ ← resume
└──────┬──────┘
│ update_goal(status="complete")
▼
┌─────────────┐
│ complete │ ← goal 结束
└─────────────┘
│ update_goal(status="failed")
▼
┌─────────────┐
│ failed │ ← 需要人工介入
└─────────────┘状态转换规则
| 转换 | 触发条件 | Account Usage |
|---|---|---|
→ active |
goal 创建 / 用户"加预算继续" | reset baseline |
→ paused |
用户中断 / idle interrupt | ✅ checkpoint |
→ budget_limited |
token/time 耗尽(仅从 active) | ✅ final delta |
→ complete |
completion audit 全绿 + update_goal |
✅ final delta |
→ failed |
3 次 pivot 后仍失败 | — |
关键规则(来自 Codex P1 bugs)
⚠️ 必须 account usage,才能状态转换
状态转换时必须先做 usage snapshot,否则:
- 换 goal objective 时复用旧 objective 的 token 计数 → budget 不准
- 从 paused 切到 budget_limited 时丢失已消耗 token → 计数偏少
P1: token/time 耗尽时,如果 goal 处于 paused 而不是 active
→ 不能变成 budget_limited(SQL 只处理 active → budget_limited)
→ 必须先 resume 再耗尽,才能触发 budget_limited
→ 但这意味着 paused goal 可以超预算运行!
→ 正确:budget 耗尽检测要同时处理 active 和 paused 状态Cronjob 中的 Goal 状态检查
# 每轮 cronjob 开始时
goal_status = read_goal_status() # from .goal-state.json
if goal_status == "budget_limited":
# 注入 wrap-up steering,不 start new substantive work
inject_wrap_up_steering()
notify_user_budget_limited()
wait_for_user_decision() # 加预算 / 暂停交付 / 中止
return # 等用户,不继续
elif goal_status == "paused":
# 恢复 goal
resume_goal()
account_usage_checkpoint()
elif goal_status == "active":
# 正常执行
pass.goal-state.json 模板
{
"goal": "[goal name]",
"status": "active|paused|budget_limited|complete|failed",
"created_at": "YYYY-MM-DD HH:MM",
"updated_at": "YYYY-MM-DD HH:MM",
"usage": {
"tokens_used": 50000,
"time_seconds": 3600,
"tasks_completed": 3,
"tasks_total": 8
},
"last_transition": {
"from": "active",
"to": "budget_limited",
"trigger": "token_budget_exhausted",
"at": "YYYY-MM-DD HH:MM"
}
}Phase 1: Decompose — task-decomp Skill
Use the task-decomp skill to break the goal into a sequence of tasks.
Initializer + Executor 双代理模式(SUPERPOWERS autonomous-skill)
为什么需要分离?
单一 agent 同时做分解和执行会导致:
- 执行时被新任务打断,分解不完整
- 分解和执行用同一个思维模型,容易遗漏
Initializer(分解代理):
- 分析目标,创建
task_list.md(主人任务清单) - 将大目标分解为 phases/milestones/tasks
- 每个 task 有明确的 deliverables + acceptance criteria + verification command
- 创建
.autonomous/<task-name>/子目录
Executor(执行代理):
- 读取 task_list.md + progress.md
- 逐条完成并标记
[x] - 更新 progress.md(每 session 进度笔记)
- 不自己做 task 分解,只执行已有清单
文件结构(.autonomous/ 模式):
project-root/
└── .autonomous/<task-name>/
├── task_list.md # Master checklist(只读描述,Executor 只标记 [x])
├── progress.md # Per-session progress notes
└── sessions/ # Transcript logs per sessionTask List is Sacred(SUPERPOWERS 原则):
- task_list.md 的描述一旦写入,只允许标记
[x],禁止修改描述内容 - 禁止删除 task、缩小 task 范围
- 这防止了"做着做着把任务缩小"的 scope creep
Each task must have:
- Name — what this task achieves
- Deliverables — exact files/artifacts produced
- Acceptance Criteria — checklist with verification commands
- Constraints — what NOT to change
- Agent assignment — Claude Code / Gemini CLI / Hermes
- Dependencies — which tasks must complete first
Output: TASK-过程记录.md with full task list and Kanban board.
Phase 1.5: AI Review — Claude Code Reviews Decomposition
Who: Claude Code (via claude-code skill + delegate_task)
What to review:
- Are tasks the right granularity (15-60 min each)?
- Are acceptance criteria specific and verifiable?
- Are dependencies correctly identified?
- Is the difficulty estimate reasonable?
- Any tasks that could be parallelized?
Prompt to Claude Code:
Review this task decomposition for [goal]:
[tasks here]
Verify:
1. Each task is 15-60 min of focused work
2. Each acceptance criterion has a verification command
3. Dependencies are correct and minimal
4. No over-parallelization (tasks that must be sequential)
5. Agent assignments match task type (code → Claude Code, visual → Gemini CLI, general → Hermes)
Report: List specific improvements, then give overall verdict: APPROVED or NEEDS_REVISIONIf NEEDS_REVISION: Incorporate feedback, then re-review once before proceeding.
Phase 2: Setup
2.1 Create Kanban Board
hermes kanban create "[Goal Name]"
# Note the board ID2.2 Create Working Directory + External Memory Files
mkdir -p /tmp/[goal-slug]/
cd /tmp/[goal-slug]/
git init
git commit --allow-empty -m "Initial commit"Create external memory files(每个都是 TASK-过程记录.md 的补充):
WORKDIR/
├── SPEC.md # 目标 + 非目标 + 硬约束 + deliverable + done-when
├── PLANS.md # milestones + acceptance criteria + verification commands
├── IMPLEMENT.md # execution runbook,引用 PLANS.md
├── DOCUMENTATION.md # 实时状态 + decisions + known issues
├── TASK-过程记录.md # task list + execution log(主追踪文件)
└── .goal-budget.json # token budget 追踪(仅当设置了 budget)SPEC.md 模板:
# [Goal] — SPEC
## Goal
[用户描述的目标]
## Non-Goals(明确不做这些)
- [item 1]
- [item 2]
## Hard Constraints
- [必须满足的条件]
## Deliverables
- [file 1]
- [file 2]
## Done When
- [ ] [具体验收标准 1]
- [ ] [具体验收标准 2]PLANS.md 模板:
# [Goal] — PLANS
## Milestones
### M1: [milestone name]
- **Tasks:** T1, T2, T3
- **Acceptance Criteria:**
- [ ] criterion 1(验证命令:xxx)
- [ ] criterion 2(验证命令:xxx)
## Verification Commands
```bash
pytest
npm test
### 2.3 Create Process Document
Create `TASK-过程记录.md`:
```markdown
# [Goal Name] — 过程记录
**Started:** YYYY-MM-DD
**Goal:** [objective]
**Stopping condition:** [what done looks like]
**Token Budget:** [unlimited / X tokens]
---
## Agent Assignment
- Claude Code → [scope]
- Gemini CLI → [scope]
- Codex/OpenCode → [scope]
- Hermes → orchestration, verification, final review
---
## Task List
| # | Task | Agent | Status | Verification |
|---|------|-------|--------|-------------|
| 1 | ... | Claude Code | done/ready/in-progress | [cmd] |
| 2 | ... | Gemini CLI | ... | [cmd] |
---
## Execution Log
### YYYY-MM-DD HH:MM — [Event]
[What happened]Hard Rules(来自 Codex Autoresearch,强制执行)
启动前 — Ask-Before-Act
所有问题 → BEFORE LAUNCH(Phase 0-1)
用户说 "go" / "start" / "launch" → LOOP 完全自主
启动后:
- ❌ 不暂停问问题
- ❌ 不暂停确认
- ❌ 不暂停请求权限
- ✅ 如果模糊 → 用最佳实践 + 记录推理到 TASK-过程记录.md核心洞察:"用户可能在睡觉。" 自治循环一旦启动就不应该停下来等用户。
Phase Transition Guards(边界条件补充)
Phase 之间的转换必须满足以下条件,否则禁止进入下一阶段:
┌─────────┐ 用户说"go"/"start"/"launch" ┌─────────┐
│ Phase 0 │ ─────────────────────────────→ │ Phase 1 │
└─────────┘ 仅当用户明确授权 └─────────┘
┌─────────┐ Claude Code 审查结果=APPROVED ┌─────────┐
│ Phase 1 │ ─────────────────────────────→ │Phase 1.5│
└─────────┘ 任意 NEEDS_REVISION → 修后再审 └─────────┘
┌─────────┐ 以下全部满足: ┌─────────┐
│Phase 1.5│ ─────────────────────────────→ │ Phase 2 │
└─────────┘ ✅ Kanban board 已创建 └─────────┘
✅ task 状态非空
✅ 外部记忆文件框架已建立
┌─────────┐ 所有 task 达到以下之一: ┌─────────┐
│ Phase 3 │ ─────────────────────────────→ │ Phase 4 │
└─────────┘ ✅ done(已验证) └─────────┘
✅ needs human(无法自动完成)
┌─────────┐ completion audit 全绿 ┌─────────┐
│ Phase 4 │ ─────────────────────────────→ │ Phase 5 │
└─────────┘ 任意 FAIL → 返回 Phase 3 └─────────┘
┌─────────┐ finishing-a-development-branch ┌─────────┐
│ Phase 5 │ ──────────────────────────────→ │ Phase 6 │
└─────────┘ 完成 └─────────┘违反 Transition Guards 的后果:
- 跳过 Phase 1.5 审查 → 视为违规,记录到 TASK-过程记录.md
- Phase 2 未建看板就进入 Phase 3 → 停止,通知用户
- Phase 3 未完成就进入 Phase 4 → 禁止,强制等待
执行中 — 每次迭代只做一个改变
每次 delegate_task / 每次 agent 执行:只做一个聚焦的改变
一个假设 → 一个改变 → 一个验证 → 记录结果 → 下一个
WRONG: 一次做 5 个改动,然后无法归因
RIGHT: 一次一个改动,证据清晰Scope Fidelity(禁止缩小目标)
用户说"实现 X",不要因为 X 难做 / 难测试 / 改动大
就替换成"更安全的 X 版本"或"更小范围的 X"
WRONG: 因为更难,就用一个更窄/更易的方案替代
RIGHT: 保持用户要求的 end state,不因实现难度改变目标
"完成了更容易的版本" ≠ "完成了要求的版本"Alignment = 向最终状态移动
alignment = 向用户请求的最终状态移动
一个 edit 只有在让用户请求的最终状态更真实时才叫 aligned
"有用的行为但保留了不同的 end state" = misalignment
WRONG: 做出有用的行为,就觉得对目标有贡献
RIGHT: 每次改动必须让"请求的最终状态"更接近一步Verify Before Assumption(先 inspect,再 action)
每次决定下一步做什么之前:
→ 先 inspect current state(git log / 文件内容 / 命令输出)
→ 再决定 action
不要:
→ 假设改动前状态没变
→ 假设上次看到的代码还是现在这样
→ 假设某个 test 覆盖了某个 requirement(除非确认过)update_plan 纪律(单一 in_progress)
Task-过程记录.md 中的 plan_items:
规则:
1. 同一时间只能有 1 个 item 处于 in_progress
2. 完成当前 item → 立即 mark complete → 才能把下一个标为 in_progress
3. 绝对禁止:批量把 3 个 item 一次性标为 complete(事后归因不可能)
4. scope pivot(理解变了 / 拆分 / 合并 / 重排序任务)
→ 先更新 TASK-过程记录.md 的 plan,再继续,不要让 plan 变陈旧
5. plan 更新不是"做工作的替代品"——不要沉迷 plan 本身而不行动
WRONG: 一次 commit 把 5 个改动全部标记完成
RIGHT: 一个改动 → verify → mark complete → 下一个 → mark completeDirty Worktree 检测(每次 agent 执行前)
git status --porcelain发现意外变化(不是你做的)→ 立即停止 → 问用户如何处理
不要自己决定怎么处理,不要 revert,不要忽略,不要吸收到 commit 里。
| 状态 | 行动 |
|---|---|
| 空(干净) | 正常执行 |
| 有变化 + 属于当前 task | 继续,staged 变化 |
| 有变化 + 属于无关修改 | STOP → 通知用户 → 等待决策 |
绝对不吸收无关的用户编辑到 agent 的 commits 中。
发现 unexpected changes → STOP IMMEDIATELY → ask user
Bias to Action(行动偏好)
每次 rollout 必须以"一个具体 edit"或"一个明确 blocker + targeted 问题"结束
不要以"需要澄清"来结束 turn,除非真正被 block
WRONG: "我觉得应该这样做,但不确定,你想要 X 还是 Y?"(没开始干活就停了)
RIGHT: 用最佳假设先做了,附上结论:"我按 A 做了,如果不对告诉我"
除非真正被 block(缺信息/缺权限/不可逆),否则不要提前问问题Plan Closure(每次结束前必查)
每个 intention/TODO/plan item 必须标记为以下三种之一:
✅ Done — 完成,有 evidence
🚫 Blocked — 被 block,附一句话原因 + targeted 问题
❌ Cancelled — 取消,附原因
禁止:
→ 以 in_progress 结束 turn
→ 以 pending 结束 turn(没有解释为什么还没做)破坏性 Git 命令(硬规则,永远)
NEVER(除非用户明确要求):
- git reset --hard
- git checkout -- [file]
- git commit --amend
- git rebase -i(危险)
- 任何 destructive / irreversible 操作
原因:用户的改动可能丢失,且无法恢复过度循环检测
如果发现自己:
- 反复读取同一个文件
- 反复编辑同一个文件
- 没有任何明确进展却一直在工具调用
→ 立即停下来
→ 在 TASK-过程记录.md 记录当前状态
→ 附上:进展到什么地步 / 卡在哪里 / targeted 问题是什么
→ 结束这个 agent turn,等下一个 cronjob 唤醒再继续Condition-based Waiting(条件等待,SUPERPOWERS systematic-debugging)
遇到等待场景时,不要用固定 timeout 猜测。
WRONG: sleep 5 && assume it's ready
RIGHT: while ! condition_is_met; do sleep 0.5; done例如:等服务启动 → 轮询健康检查 endpoint,不是在日志里猜"应该快了"。
Escalation(主动升级风险)
当决策有非显而易见的后果或隐藏风险时:
→ 不要悄悄继续
→ 不要自己判断"应该没问题"
→ 主动升级给用户,用这个格式:
⚠️ 需要决策:[描述风险/权衡]
选项 A:[利]
选项 B:[弊]
我的倾向:[理由]
等待用户回复...Approval Mode 感知
Agent 的 approval mode 影响测试行为:
never / on-failure(非交互模式):
→ 主动运行测试/lint/验证,确保任务完成
→ 不需要等用户确认
untrusted / on-request(交互模式):
→ 建议想做什么,等用户确认后再跑测试
→ 不要自己跑(会拖慢迭代速度)
test-related tasks(测试相关任务):
→ 无论什么模式,都可以主动跑测试野心 vs 精度(上下文感知)
任务类型不同,策略不同:
新任务 / 无现有代码库约束:
→ 可以大胆创造、实验、提出新方案
现有代码库 / 已有明确范围的任务:
→ 手术精度:用户要什么做什么,不要多做
→ 不要因为觉得"这样更好"就擅自改用户没要求的部分
→ 不要加"有用但不在 scope 里"的功能
判断标准:这次改动是否让"用户要求的 end state"更接近?
是 → 做;否 → 不做Action Safety(行动前先 call out 风险)
执行有风险或不可逆的行动之前:
→ 先在 TASK-过程记录.md 记录:我要做什么 / 风险是什么 / 为什么必须现在做
→ 然后再执行
绝对不要在用户不知情的情况下:
→ 删除大量代码
→ 修改生产配置
→ 改动共享的基础模块
→ 执行有副作用的数据库操作Tool Persistence(工具坚持规则)
继续使用工具,直到有足够证据自信完成任务
部分读取后就放弃 → 不要
当另一个 targeted check 可能改变答案时 → 不要停止
WRONG: 看了 3 行代码就开始写修复,没看完整个相关文件
RIGHT: 读完所有相关文件,确认理解完整,再动手Dig Deeper(深层检查,找到问题后)
找到第一个 plausible issue 后,继续检查:
1. 二阶失效 — 这个 bug 会引发其他什么 bug?
2. 空状态行为 — 数据为空时行为正确吗?
3. 重试逻辑 — 失败重试时会发生什么?
4. 陈旧状态 — 有没有缓存或旧数据导致的假象?
5. 回滚路径 — 如果这个改动错了,怎么撤回?
然后再 finalize 结果No-tool Turn 也能继续(Codex #20523 修复)
不要因为"一个 turn 没有工具调用"就认为 agent 卡住了。
Codex 之前错误地用"no registry tool calls"作为"应该停止"的启发式信号,导致 agent 在做理解/规划/等待时就被停止。
正确的判断:
- ✅ agent 在做理解、规划、等待条件 → 继续
- ✅ agent 在思考下一步怎么做 → 继续
- ❌ agent 在重复同样的 action 且没有进展 → 触发过度循环检测
如果 agent 一个 turn 没有工具调用:
- 检查 git log — 看是否有有意义的 commit
- 检查 TASK-过程记录.md — 看是否有进展记录
- 只有当没有任何有意义进展时才停止
3 次失败后 → Pivot,不暴力重试
3 failures (same task)
→ 换 Agent 做 2 次
→ 还失败 → PIVOT
→ 换思路,而不是重复同样的尝试增量 <1% 且显著增加复杂度 → Discard
如果改进 < 1% 且代码复杂度显著增加:
→ 放弃这个改进,记录 "discard: 收益 < 1%"
→ 继续下一个方向Phase 2→3 Launch Gate(强制确认清单)
在进入 Phase 3 之前,必须确认以下所有项目。全部 ✓ 才能继续;有 × → 修复后再继续。
Launcher Checklist(发送给我,等 confirm 或修改意见):
□ SPEC.md 存在且完整(Goal + Acceptance Criteria 明确)
□ PLANS.md 存在且可执行(至少1个 milestone,验收标准具体)
□ Task List 完整(所有 task 有 agent 分配 + verification 命令)
□ 没有悬空 task(done/ready/in-progress/blocked 以外的 Status)
□ Dirty Worktree 已处理(git status 干净,或 staged 属于当前 task)
□ 依赖关系已确认(依赖链无环,ready 的 task 不依赖 blocked 的 task)
[可选,如适用]
□ Token Budget 已设定(estimate 是多少?有 buffer 吗?)
□ Verifier 已分配(Verifier ≠ Implementer 确认了吗?)
回复 'confirm' 继续 Phase 3,或指出需要修改的地方。注意:Phase 0-1 是人在回路的最后一站。这里的确认不是形式审查——是最后一道质量门。
Phase 3: Autonomous Execution — autonomous-dev-loop
Create the Cronjob
hermes cronjob create \
--name="[Goal] — Autonomous Dev Loop" \
--prompt="[See autonomous-dev-loop skill for prompt template]" \
--schedule="*/5 * * * *" \
--repeat=100 \
--skills="kanban-orchestrator,task-decomp,claude-code,codex,gemini-cli,autonomous-dev-loop" \
--deliver="origin" \
--workdir="[workdir]"Cronjob Behavior (per run)
1. 读 .goal-state.json → 检查 goal status
├─ budget_limited → inject wrap-up steering → notify user → 等待决策
├─ paused → resume_goal() + account_usage_checkpoint()
└─ active → 继续
2. 读 .goal-budget.json(如果设置了 token budget)
├─ 读取 used_tokens
├─ 80% ≤ used < 100% → 发送预警通知 → 继续
└─ used ≥ 100% → 触发 budget_limited 状态转换 → 注入 wrap-up → 通知
3. Verify Before Assumption:先 inspect current state,再决定 action
- git log --oneline -5(看最后几个 commit 是什么)
- git status --porcelain(看 worktree 干不干净)
- TASK-过程记录.md(确认当前 task 状态)
4. Dirty worktree 检测:git status --porcelain
└─ 有无关变化 → 停止,通知用户
5. 读 TASK-过程记录.md(task_list)→ 找 "ready" 状态 task
└─ task_list.md 是唯一事实来源,Executor 不自己做分析
6. 对每个 ready task(单一 in_progress 纪律):
- Dirty worktree 再检测
- delegate_task(goal=..., toolsets=['terminal','file','web'], role='leaf')
- Agent implements → commits → reports
- 更新 TASK-过程记录.md(立即 mark complete,不 batch)
- 更新 .goal-budget.json 的 used_tokens(如设置了 budget)
- Log to TASK-过程记录.md
- 发送 Feishu DM 通知用户
7. 如果 task 依赖未完成 → 跳过,通知依赖方
8. 所有 tasks 完成 → 触发 Phase 5
### Agent Selection Matrix(集中决策表)
以下所有决策规则均汇总于此,其他章节引用本表。
#### A. Implementer 选择
| Task Type | Primary | Fallback 1 | Fallback 2 |
|-----------|---------|------------|------------|
| Code implementation | Claude Code | Codex | OpenCode |
| Visual/UI/审美 | Gemini CLI | Claude Code | — |
| General/process/coordination | Hermes | — | — |
| Script/automation | Codex | Claude Code | — |
#### B. Verifier 选择(≠ Implementer)
| Implementer | 首选 Verifier | 备选 Verifier |
|-------------|--------------|---------------|
| Claude Code | Hermes | Gemini CLI |
| Gemini CLI | Claude Code | Hermes |
| Hermes | Claude Code | — |
| Codex | Claude Code | Hermes |
**硬规则:Verifier ≠ Implementer。不同模型/Provider 强制执行。**
#### C. Approval Mode 行为
| Mode | Test/Lint 行为 | 确认要求 |
|------|---------------|---------|
| `never` / `on-failure`(非交互) | 主动运行,确保任务完成 | 不需要 |
| `untrusted` / `on-request`(交互) | 建议想做什么 | 等用户确认后再跑 |
| 测试相关任务 | 无论什么模式 | 可主动跑测试 |
#### D. Dirty Worktree 响应
| 状态 | 行动 |
|------|------|
| 空(干净) | 正常执行 |
| 有变化 + 属于当前 task | 继续,staged 变化 |
| 有变化 + 属于无关修改 | **STOP → 通知用户 → 等待决策** |
#### E. Retry Strategy
| 失败次数 | 行动 |
|---------|------|
| 1-2 次(同 task,同 agent) | 继续重试 |
| 3 次(同 task,同 agent) | 换 fallback agent |
| 再 2 次失败 | Mark "needs human" + 立即通知用户 |
| 3 次 fix 均失败 | STOP → 质疑架构 → 升级给用户 |
#### F. Condition-based Waiting
WRONG: sleep 5 && assume it's ready
RIGHT: while ! condition_is_met; do sleep 0.5; done
等待外部条件时,轮询健康检查或状态文件,不用固定 timeout 猜测。
#### G. Goal State → Cronjob 行为
| Goal State | Cronjob 动作 |
|------------|-------------|
| `active` | 正常执行 |
| `paused` | resume_goal() + checkpoint,继续 |
| `budget_limited` | inject wrap-up steering → 通知用户 → 等待决策 |
| `complete` | 停止 cronjob,触发 Phase 5 |
| `failed` | 停止,通知用户 |
---
### Agent 报告成功 ≠ 成功(强制查 VCS diff)
> *"Agent reports success → Check VCS diff → Verify changes → Report actual state"*
每次 delegate_task 完成后,必须强制检查 git diff:
- Agent reports: "Task N complete"
- Hermes runs: git diff --stat
- Hermes verifies: diff matches expected deliverables
- Hermes states: "Confirmed: [files] modified, [lines] changed"
- If diff is empty or wrong → FAIL, re-dispatch
**禁止:** 信任 Agent 的"success"报告,不经验证就认为完成。
### Task List is Sacred(SUPERPOWERS 原则)
看板 task 的描述一旦写入,只允许:
- ✅ 标记 `[x]`(完成)
- ✅ 更新状态(ready → in-progress → done)
- ❌ **禁止修改 task 描述内容**
- ❌ **禁止删除 task**
- ❌ **禁止缩小 task 范围**(把难做的大 task 改成小task)
### Retry Strategy
3 failures (same task, same agent)
→ Switch to fallback agent
→ 2 more attempts
→ Still failing
→ Mark task "needs human"
→ Notify user immediately
→ Continue with independent tasks
**3 次修复后质疑架构(来自 SUPERPOWERS systematic-debugging):**
如果同一个问题用了 3 次 fix 还修不好:
- 停止继续打补丁
- 在 TASK-过程记录.md 记录:
- 症状 / 根因假设 / 已尝试的修复 × 3
- **STOP → 质疑架构**:这个问题的根本是不是系统设计问题?
- 升级给用户:"这可能不是个 bug,而是架构问题,要重构还是要继续打补丁?"
- 不要把"打了 3 次补丁"当成正常迭代,那是架构预警信号。
---
## Phase 4: Per-Task Verification
**Rule:** Verifier ≠ Implementer (different model/provider)
### Verification Flow
Implementer completes task →
Verifier (different agent) checks:
1. All acceptance criteria met?
2. Verification commands pass?
3. No side effects on other tasks?
→ PASS → Mark done, next task
→ FAIL → Return to implementer with specific gaps
### Verification Agent Assignment
- **Claude Code tasks** → Hermes or Gemini CLI verifies
- **Gemini CLI tasks** → Claude Code or Hermes verifies
- **Hermes tasks** → Claude Code verifies
- **Codex tasks** → Claude Code or Hermes verifies
### Logging Verification
Verification — Task N
Verifier: [Agent]
Result: PASS / FAIL / NEEDS_REVISION
Evidence: [verification command output]
Date: YYYY-MM-DD HH:MM
---
## Phase 5: Final Review
**Who:** The Agent that participated LEAST in this goal (fresh perspective)
### Completion Audit(必须逐条验证,不能凭感觉)
**核心原则(来自 Codex continuation.md):Treat completion as unproven.**
> *"Before deciding that the goal is achieved, **treat completion as unproven** and verify it against the actual current state."*
**不许的信念:**
- "快了,应该快完成了"
- "测试全绿,应该没问题了"
- "改了这么多,肯定完成了"
- "用户没意见就是过了"
**正确的态度:Completion 从来不是信念,而是必须用证据逐步证明的命题。**
### Completion 作为"未证明的假设"(Codex continuation.md 原文)
> *"Before deciding that the goal is achieved, **treat completion as unproven** and verify it against the actual current state."*
**不等式:**信念 ≠ 证据
进度感 ≠ 证据
测试全绿 ≠ Goal 完成
实现努力 ≠ 完成
代理报告成功 ≠ 实际完成
**唯一的完成标准:证据覆盖了目标中每一个明确的交付物。**
---
**Before marking goal complete — perform a completion audit:**
For EVERY explicit requirement from the original goal:
- Derive the requirement (what did the user explicitly ask for?)
- Identify authoritative evidence: files / cmd output / test results / runtime behavior
- Determine:
✅ PROVES completion — evidence shows requirement is satisfied
❌ CONTRADICTS completion — evidence shows requirement is NOT satisfied
⏳ INCOMPLETE — partial work, not fully done
❓ MISSING — no evidence found
⚠️ TOO WEAK — evidence is indirect/weak for the scope of the claim
不接受代理信号(Codex continuation.md 原文):
"Passing tests, a complete manifest, a successful verifier, or substantial implementation effort are useful evidence only if they cover every requirement in the objective."
| 代理信号 | 为什么不够 | 正确做法 |
|---|---|---|
| 测试全绿 | 可能没覆盖所有 requirement | 必须逐条确认测试覆盖了每个 requirement |
| 完整 manifest | manifest 本身不等于交付完成 | 打开文件,确认实际内容符合 spec |
| verifier 通过 | verifier 可能范围不够 | 独立检查 evidence |
| 实现投入了大量 effort | effort ≠ 结果 | 只看最终 state |
| Agent 说"完成了" | Agent 可能误判 | 必须查 VCS diff 验证 |
每个 requirement 必须有直接的、具体的 evidence。不是间接信号,不是代理信号。
### Gate Function — 五步验证(SUPERPOWERS verification-before-completion)
在声称任何状态之前(包括"完成"、"通过"、"没问题"),必须执行五步 Gate:
BEFORE claiming any status:
- IDENTIFY — What command/file/proof proves this claim?
- RUN — Execute the FULL command (fresh, complete run)
- READ — Full output, check exit code, count failures
- VERIFY — Does output actually confirm the claim?
- STATE — If YES: claim WITH evidence / If NO: state actual status with evidence
跳过任何一步 = 作弊,不是验证。
**示例对比:**✅ [Run pytest] [See: 34/34 pass] "All tests pass"
❌ "Should pass now" / "Looks correct" / "Tests were green before"
**Red Flags — 立即停止:**
- 使用 "should", "probably", "seems to"
- 还没 run 验证就说 "Great!" / "Perfect!" / "Done!"
- commit/push/PR 前没验证
- 信任 agent 的"success"报告
- 部分验证就当全部验证
- "这次例外"
**Examples:**Requirement: "API must support pagination"
Evidence: pytest passes → ❌ DOES NOT prove — tests don't verify pagination exists
Evidence: GET /api/users?page=2 returns {"items": [...], "has_next": true} → ✅ PROVES completion
Requirement: "All existing tests pass"
Evidence: pytest → 100% passed → ✅ PROVES completion
### Final Review Checklist
After completion audit passes:
- ✅ Completion audit: ALL requirements proven met
- All verification commands pass?
- Process document complete and accurate?
- Any technical debt introduced?
- Documentation updated?
- Tests comprehensive?
- Git history clean (relevant commits only)?
**Then invoke `finishing-a-development-branch`:**
- Merge to main? → local merge + test
- Push and create PR? → `gh pr create`
- Keep branch? → report location
### Report to User
✅ Goal complete: [name]
Tasks: N completed | N failed
Duration: X hours
Agents used: Claude Code / Gemini CLI / Codex / Hermes
Deliverables:
- [file 1]
- [file 2]
Process document: [path]
---
## Phase 6.1: Automatic Evaluation — Retrospective + Darwin Self-Assessment
**触发时机:** Phase 6 Delivery 完成后(每个 /goal 只执行一次)
**目的:** 自动生成执行画像 + Darwin 自评 + 可操作改进建议
---
### 6.1.1 收集执行数据
从以下文件读取数据,构建执行画像:
```bash
# 1. 从 TASK-过程记录.md 提取
# - task 数量、状态分布、agent 分配
# - execution log 时间线
# - 触发的 exceptions
# 2. 从 .goal-budget.json 提取
# - token 使用率、是否超预算
# 3. 从 Kanban board 提取
# - tasks_done / tasks_total
# - tasks_blocked / tasks_needs_human
# 4. 从 git log 提取
# - commit 数量、频率、author 分布6.1.2 生成 Retrospective JSON
输出到 ~/.hermes/goal-runs/run_{timestamp}__{goal-slug}.json:
{
"run_id": "uuid-v4",
"goal": "[原始目标]",
"goal_slug": "[slug]",
"started_at": "YYYY-MM-DD HH:MM",
"ended_at": "YYYY-MM-DD HH:MM",
"duration_minutes": 47,
"token_budget": {
"set": 100000,
"used": 73000,
"pct": 73,
"mode": "soft_limit",
"overrun": false
},
"task_stats": {
"total": 9,
"done": 8,
"blocked": 0,
"needs_human": 1,
"completion_rate": 0.89
},
"agent_stats": {
"claude_code": { "assigned": 5, "done": 4, "failed": 1 },
"codex": { "assigned": 2, "done": 2, "failed": 0 },
"gemini_cli": { "assigned": 1, "done": 1, "failed": 0 },
"hermes": { "assigned": 3, "done": 3, "failed": 0 }
},
"phase_transitions": [
{ "from": "phase_0", "to": "phase_1", "trigger": "user_confirm", "at": "HH:MM" },
{ "from": "phase_1", "to": "phase_2", "at": "HH:MM" },
{ "from": "phase_2", "to": "phase_3", "trigger": "launch_gate_confirmed", "at": "HH:MM" },
{ "from": "phase_3", "to": "phase_4", "at": "HH:MM" },
{ "from": "phase_4", "to": "phase_5", "trigger": "all_tasks_verified", "at": "HH:MM" },
{ "from": "phase_5", "to": "phase_6", "at": "HH:MM" }
],
"decomposition_quality": {
"spec_fulfilled": true,
"plan_fulfilled": false,
"task_gaps": ["T3 范围中途变大", "T7 被 T5 依赖导致串行"],
"decomposition_failures": [
{ "task": "T5", "reason": "低估了 API 复杂度", "rework": "拆成 T5a+T5b" }
]
},
"completion_audit": {
"conducted": true,
"caught_gaps_before_delivery": 2,
"gaps_found": ["漏了输入校验", "错误消息不一致"],
"all_requirements_met": false,
"requirement_match_rate": 0.85
},
"rule_violations": [
{ "rule": "Dirty Worktree Guard", "detected": true, "action": "stopped_notified" }
],
"rule_effectiveness": [
{ "rule": "Verify Before Assumption", "triggered": 6, "prevented_mistake": 4, "missed": 1 }
],
"agent_decisions": [
{
"task": "T4",
"assigned": "Claude Code",
"outcome": "failed_after_3_retries",
"should_have_been": "Codex",
"reason": "T4 是 deep refactor,Codex 的 deep search 能力更强"
}
],
"exceptions": [
{
"type": "retry_exhausted",
"task": "T4",
"agent": "Claude Code",
"attempts": 3,
"errors": ["类型不匹配", "边界条件错误"],
"resolution": "switched_to_codex"
},
{
"type": "pivot",
"task": "T2→T2b",
"reason": "第三方库许可证问题",
"new_approach": "用标准库重写"
},
{
"type": "escalation",
"task": "T8",
"reason": "数据库选型需要业务决策",
"user_decision": "选择了 PostgreSQL"
}
],
"darwin_self_assessment": {
"d1_frontmatter": 7,
"d2_workflow_clarity": 13,
"d3_boundary_coverage": 9,
"d4_checkpoint_design": 7,
"d5_instruction_specificity": 13,
"d6_resource_integration": 5,
"d7_architecture": 14,
"d8_measurable_effects": 0,
"total": 68,
"assessor": "hermes",
"note": "d8 由累积样本自动计算,首次执行为 0"
},
"lessons_learned": [
"T5 分解粒度不够细,下次遇到 API 集成类任务,预估工时翻倍"
],
"improvement_suggestions": [
{
"priority": "high",
"dimension": "D5",
"issue": "Agent Selection Matrix 对 deep refactor 任务分配不准",
"evidence": "T4 Claude Code 3次失败,Codex 一次过",
"fix": "在 Implementer 表增加 'deep refactor' 行,primary=Codex"
}
],
"git_commits": ["abc1234", "def5678"],
"workdir": "/tmp/goal-slug"
}6.1.3 追加到累积样本库
# 追加到累积文件
cat >> ~/.hermes/goal-runs/aggregate.jsonl << 'EOF'
{"run_id": "...", "goal": "...", "darwin_total": 68, ...}
EOF
# git commit(如果是 git 工作区)
cd ~/.hermes/goal-runs && git add . && git commit -m "goal-run: {goal-slug} {date}"aggregate.jsonl 是流式追加格式(JSON Lines),方便后续分析:
# 分析累积数据
cat ~/.hermes/goal-runs/aggregate.jsonl | jq '.darwin_self_assessment.total, .task_stats.completion_rate'6.1.4 生成人类可读评估报告
发送给我(用户):
📊 /goal 执行评估报告
━━━━━━━━━━━━━━━━━━━━━
Goal: [目标]
耗时: 47 分钟 | Token: 73% 使用
完成度: 8/9 tasks (89%) | Agent 成功率: 87%
🔴 例外情况(需要关注)
- T4 Claude Code 3次失败后换 Codex
- T2 因许可证问题 pivot
- 1个 task 需要人工介入
🟡 分解质量
- SPEC 交付: ✅ 符合
- PLAN 里程碑: ⚠️ 1个未完成
- 分解失误: T5 低估复杂度,T5→T5a+T5b 拆分
🟢 Hard Rule 表现
- Dirty Worktree Guard: 1次触发(正常)
- Verify Before Assumption: 6次触发,挡掉4个错误
- Plan Closure: 2次Blocked标注到位
🟡 Agent 分配评估
- T4 应分配 Codex(Claude Code 失败)
- Matrix 需补充 "deep refactor" 行
📈 Darwin 自评: 68/100(D8=0,因样本不足)
💡 改进建议(优先级排序)
[HIGH] D5: Agent Selection Matrix 增加 deep refactor 类型
[MED] D3: 边界覆盖 — 增加 "T5类 API 集成任务" 边界条件
[LOW] D7: 三车道持久模型 — 首次运行未触发 compaction
━━━━━━━━━━━━━━━━━━━━━
完整报告: ~/.hermes/goal-runs/run_{timestamp}__{slug}.json6.1.5 D8 累积分数自动计算规则
当 aggregate.jsonl 积累 ≥3 条样本后,自动计算 D8:
# D8: 可测量效果(目标结果 vs 期望)
d8_score = (
avg_completion_rate * 4 + # 完成率(0-1)×4
avg_requirement_match * 3 + # 需求匹配率 ×3
(1 - avg_needs_human) * 2 + # 人工介入越少越好 ×2
avg_rule_effectiveness * 1 # 规则有效率 ×1
)
# 满分 10,上限 10存储结构:
~/.hermes/goal-runs/
├── run_2026-05-12_141522__calc-cli_abc123.json # 每次 run 的详细记录
├── aggregate.jsonl # 流式追加,所有 run 的汇总
└── .git/ # 可选,git 跟踪历史6.1.6 Phase 6.1 执行时机
Phase 6 Delivery
↓
Phase 6.1 Retrospective(自动,无需用户触发)
↓
发送评估报告给用户
↓
等待用户反馈(确认/修改建议)
↓
如有修改意见 → 更新 SKILL.md(进入下一轮达尔文优化)下一步: 我把这段实现到 SKILL.md,然后发一次给用户预览格式。你看这个框架有没有要调整的?
Codex /goal Best Practices (Integrated)
References
references/codex-superpowers-research.md— Full authoritative source analysis: Codex continuation.md (gold standard prompt + rules), Codex #19910 (three-lane persistence), Codex #20523 (no-tool suppression), SUPERPOWERS verification-before-completion (gate function), systematic-debugging (3-fix architecture threshold), autonomous-skill (Initializer+Executor), subagent-driven-development, writing-plans. Contains original quoted text and gap comparison table.
External Memory Files
For long-running goals, maintain these files as external memory.
三车道持久模型(来自 Codex #19910):
Codex 在 compaction(上下文压缩)时分离存储三个通道,防止全局 goal 信息丢失:
- Objective — 原始目标(非摘要的摘要)
- Completion Contract — 完成前必须满足的 checklist
- Evidence Ledger — 已修改文件 / 未解决 TODO / 已做决策
对应到我们的外部记忆文件:
| 文件 | 对应 Codex 通道 |
|---|---|
SPEC.md |
Objective + Non-Goals + Done-When |
PLANS.md |
Completion Contract(milestone checklist) |
TASK-过程记录.md |
Evidence Ledger(执行日志 + decisions) |
这三个文件在任何时刻都要保持一致、同步更新。
文件清单:
- SPEC.md — goal, non-goals, hard constraints, deliverables, "done when"
- PLANS.md — milestones with acceptance criteria + verification commands
- IMPLEMENT.md — execution runbook referencing PLANS.md
- DOCUMENTATION.md — real-time status + decisions + known issues
- TASK-过程记录.md — task list + execution log(主追踪文件,Evidence Ledger)
- .goal-budget.json — token budget 追踪(仅当设置了 budget)
Milestone Verification Rule
After each milestone: run verification commands. Fail → fix before continuing.
WRONG: Milestone done, move on, hope it works
RIGHT: Milestone done, run verification, fix if fails, then continueTreat Worktree as "Another Agent"
The workspace doesn't remember. Write everything to files:
- Current milestone status
- What was verified
- What remains
- Blockers
Agent CLI Commands Reference
Claude Code (via delegate_task)
delegate_task(
goal="[task description]",
context="...[full context]...",
toolsets=['terminal', 'file', 'web'],
role='leaf'
)Gemini CLI (via terminal)
# Headless (returns text, doesn't write files)
gemini -p "[task]" --approval-mode=yolo
# ACP mode (skills enabled)
gemini --acp -p "[task]" --approval-mode=yolo
# With worktree isolation
gemini -w "feature-name" -p "[task]" --approval-mode=yoloCodex CLI (via delegate_task)
delegate_task(
goal="[task description]",
context="...[full context]...",
toolsets=['terminal', 'file', 'web'],
acp_command='codex',
role='leaf'
)Skill Loading Order
When this skill is invoked, load these skills in order:
brainstorming— if goal is uncleartask-decomp— decompose into milestonesclaude-code— coding implementationcodex— coding implementationgemini-cli— visual/审美 implementationautonomous-dev-loop— cron-driven executionkanban-orchestrator— Kanban operationsfinishing-a-development-branch— completion workflow
Common Failure Modes
| Failure | Response |
|---|---|
| Task blocked on dependency | Notify, skip, continue independent tasks |
| Implementer returns empty output | Re-dispatch with explicit file paths |
| Verification fails | Return to implementer with gap list |
| 3 failures on same task | Switch agent, 2 more attempts, then mark "needs human" |
| User interrupts cronjob | Resume from last checkpoint in TASK-过程记录.md |
| Agent produces wrong thing | Verify against spec, not assumption |
When to Use This Skill
Use /goal when:
- User says "/goal [objective]"
- Task is bigger than one prompt (multi-file, multi-step)
- Code quality matters (not a prototype script)
- Independent verification is required before advancing
- User wants autonomous progress without constant steering
- Equivalent to Codex
/goaluse cases: migrations, large refactors, prototype creation
Don't use for:
- Simple one-off questions ("what is X?")
- Tasks that are purely research
- Operational tasks (restart server, check logs)
Related Skills
| Skill | Role in /goal |
|---|---|
brainstorming |
Phase 0 — clarify unclear goals |
task-decomp |
Phase 1 — decompose into tasks |
claude-code |
Phase 3 — code implementation |
codex |
Phase 3 — code implementation |
gemini-cli |
Phase 3 — visual/审美 implementation |
autonomous-dev-loop |
Phase 3 — cron-driven execution |
kanban-orchestrator |
Phase 2/3 — Kanban operations |
finishing-a-development-branch |
Phase 5 — completion workflow |
subagent-driven-development |
Per-task execution pattern |
requesting-code-review |
Phase 4 — verification |
receiving-code-review |
Handling review feedback |
writing-plans |
Per-task implementation plans |
darwin-evaluation |
对 /goal 做系统性评估和优化(8维度Rubric+实测对比) |
test-prompts.json |
3个典型 /goal 场景的测试prompt,用于达尔文实测验证 |