Self-healing CI pipeline -- 3-attempt retry budget with strategy rotation, inbox-wait for results, autonomous fix loop with escalation
Resources
3Install
npx skillscat add omninode-ai/omniclaude/ci-fix-pipeline Install via the SkillsCat registry.
CI Fix Pipeline
Overview
Autonomous pipeline that fetches GitHub Actions CI failures and fixes them -- ALL failures by
default. No selective mode. Failures beyond max_fix_files trigger sub-ticket creation and
continue with the remaining fixable failures.
v2.0 Self-Healing Mode (OMN-2829): When --self-heal is enabled, the pipeline enters a
multi-attempt repair loop with strategy rotation. Each attempt uses a different fix strategy
(targeted -> broad lint -> regenerate). Between attempts, the pipeline uses inbox-wait (not
polling) to detect CI re-run results. The node_ci_repair_effect ONEX node orchestrates this
loop.
Workflow (standard): Fetch CI failures -> Slack start -> Classify + Sub-ticket large-scope -> Fix ALL fixable -> Commit -> Slack complete -> ModelSkillResult
Workflow (self-healing): _bin/ci-status.sh -> detect failures -> attempt 1 (targeted fix) -> push -> inbox-wait -> if still failing -> attempt 2 (broad lint fix) -> push -> inbox-wait -> if still failing -> attempt 3 (regenerate) -> push -> inbox-wait -> inbox notification on success/exhaustion
Announce at start: "I'm using the ci-fix-pipeline skill to fix CI failures."
Policy Defaults
policy:
fix_all: true # always fix all -- no selective mode
max_fix_files: 10 # files in scope trigger sub-ticket (not skip)
fix_preexisting_in_touched: true # fix pre-existing issues in touched files
slack_on_start: true # notify Slack before fixing
slack_on_complete: true # notify Slack with fix summary
max_attempts: 3 # self-heal: retry budget (1-3)
self_heal: false # self-heal: enable multi-attempt loopQuick Start
/ci-fix-pipeline # Fix all CI failures on current branch
/ci-fix-pipeline --pr 42 # Fix failures for PR #42
/ci-fix-pipeline --ticket-id OMN-1234 # Include ticket context in Slack messages
/ci-fix-pipeline --no-slack # Suppress Slack notifications
/ci-fix-pipeline --skip-patterns "test_*" # Skip jobs/steps matching pattern
/ci-fix-pipeline --max-fix-files 20 # Raise the sub-ticket threshold
/ci-fix-pipeline --self-heal --pr 42 # Self-healing mode with retry loop
/ci-fix-pipeline --self-heal --max-attempts 2 # Limit to 2 attempts
/ci-fix-pipeline --analyze-only --pr 42 # Diagnose CI failures without fixing (formerly ci-failures)Arguments
| Argument | Default | Description |
|---|---|---|
--pr <number> |
none | PR number for CI failure fetch |
--branch <ref> |
current | Branch name for CI failure fetch |
--skip-patterns <patterns> |
none | Comma-separated job/step name patterns to skip |
--max-fix-files <n> |
10 | Files-in-scope threshold; above this -> sub-ticket |
--no-slack |
false | Disable Slack notifications |
--ticket-id <id> |
none | Context ticket ID for Slack messages |
--self-heal |
false | Enable self-healing retry loop with strategy rotation |
--max-attempts <n> |
3 | Maximum repair attempts (only with --self-heal) |
--analyze-only |
false | Fetch and display CI failure analysis without attempting fixes |
Analyze-Only Mode (--analyze-only)
When --analyze-only is passed, the pipeline stops after the analysis phase without attempting fixes.
This reproduces the former ci-failures skill behavior.
What happens:
- Fetch CI failure logs (same as normal mode)
- Parse and categorize failures
- Display analysis results with categorization
- Stop -- do not enter the fix loop
Use case: Quick CI failure diagnosis without committing to the full fix pipeline.
Tools available in analyze-only mode
The underlying CI scripts remain in ${CLAUDE_PLUGIN_ROOT}/skills/ci-fix-pipeline/:
ci-quick-review <PR#>-- Quick summary of all failures (Tier 1)get-ci-job-details <job_id>-- Deep dive into specific job (Tier 2)fetch-ci-data <PR#>-- Raw CI data in JSON format
Two-Tier Investigation Workflow:
- Tier 1: Run
ci-quick-reviewto see summary of all failures - Tier 2: For each critical/major failure, run
get-ci-job-details <job_id>for deep investigation
Execution Phases
Phase 1: Fetch CI Failures
Dispatch to polymorphic agent:
Task(
subagent_type="onex:polymorphic-agent",
description="Fetch CI failures for ci-fix-pipeline",
prompt="Fetch CI failures for analysis.
Run: ${CLAUDE_PLUGIN_ROOT}/skills/ci-fix-pipeline/ci-quick-review {N | branch_name}
Return the raw JSON from ci-quick-review (pass through unchanged).
The response has structure:
{\"repository\": str, \"pr_number\": int,
\"summary\": {\"total\": N, \"critical\": N, \"major\": N, \"minor\": N},
\"failures\": [{\"workflow\": str, \"job\": str, \"job_id\": str, \"step\": str,
\"severity\": str, \"workflow_id\": str, \"job_url\": str}],
\"fetched_at\": str}"
)Branch resolution: After Phase 1, the orchestrator must have a branch name for use in later
phases. Resolve as follows:
- If
--branchwas provided: use that value directly. - If
--prwas provided (no--branch): rungh pr view {N} --json headRefName --jq '.headRefName'
to get the branch name. - If neither was provided: use the current branch (already known from
ci-quick-reviewinvocation).
Phase 2: Slack Start Notification
If slack_on_start: true and Slack is available, notify:
ci-fix-pipeline starting
PR/Branch: {context}
Failures found: {N} ({critical} critical, {major} major, {minor} minor)
Ticket: {ticket_id if provided}Skip silently if Slack unavailable (non-blocking).
Phase 3: Classify and Route Failures
For each failure:
Skip check: Does the failure
joborstepname match any--skip-patternspattern?- Yes → mark as
skipped, record reason - No → continue
- Yes → mark as
Scope check: Does the failure
jobtouch more thanmax_fix_filesfiles?
Determine scope by inspecting the job logs (viagh api repos/{repo}/actions/jobs/{job_id}/logs)
to count affected files. If log inspection is unavailable, treat scope as within threshold.- Scope > max_fix_files → mark as
capped, create Linear sub-ticket inline (see Sub-Ticket Creation below), continue to next failure - Scope ≤ max_fix_files → add to fix queue
- Scope > max_fix_files → mark as
Result: failures split into skipped, capped, and to_fix buckets.
Phase 4: Fix Failures
Dispatch one polymorphic agent per severity group (critical first, then major, then minor) for all to_fix failures:
Task(
subagent_type="onex:polymorphic-agent",
description="Fix {severity} CI failures",
prompt="**AGENT REQUIREMENT**: You MUST be a polymorphic-agent.
Fix the following {severity} CI failures:
{failures_list}
Instructions:
1. Read each affected file
2. Apply the fix
3. If fix_preexisting_in_touched is true: also fix any pre-existing lint/mypy
issues in those files (only files already in scope — not a full repo scan)
4. Do NOT commit
Return classification for each failure:
{\"fixed\": [failure_ids], \"architectural\": [failure_ids], \"unfixable\": [failure_ids]}"
)Post-fix architectural check: For each failure returned as architectural by the fix agent:
- Send a Slack message (via
HandlerSlackWebhookin omnibase_infra) describing the architectural
change and asking for human approval. Include the failure description, the proposed fix, and
the files affected. Wait for a reply (poll or webhook callback). - Approved (human replies "approve" or "yes") → apply fix; Declined (any other reply or timeout after 10 min) → mark
escalated
Phase 5: Commit Fixes
Skip this phase if fixed count is 0 (no code was changed — all failures were skipped, capped, or unfixable). Proceed directly to Phase 6 with commit: null in ModelSkillResult.
Otherwise, orchestrator stages and commits inline (no dispatch needed):
git add <changed_files>
git commit -m "fix(ci): resolve {N} {severity} failures [{ticket_id}]"Commit message format: fix(ci): resolve {N} {severity} failures [{ticket_id}]
where {severity} is the highest severity fixed (e.g., critical, major, minor) and{ticket_id} is the value from --ticket-id (or omitted if not provided).
Sub-Ticket Creation
For each capped failure (scope > max_fix_files), created inline during Phase 3:
# current_team: resolved from --ticket-id parent team (via mcp__linear-server__get_issue),
# or from the first team returned by mcp__linear-server__list_teams if no ticket is provided.
mcp__linear-server__create_issue(
title=f"CI: {failure.job} — {failure.step} (large scope)",
team=current_team,
description=f"""
## CI Failure Requiring Human Review
**Job**: {failure.job}
**Step**: {failure.step}
**Severity**: {failure.severity}
**Scope**: Exceeds max_fix_files={max_fix_files} threshold
**Triggered by**: ci-fix-pipeline run for {ticket_id or branch}
**Job URL**: {failure.job_url}
## Definition of Done
- [ ] All affected files reviewed and fixed
- [ ] CI passing on {branch}
""",
parentId=ticket_id if ticket_id else None,
labels=["ci-failure", "needs-human"]
)Phase 6: Slack Complete Notification
If slack_on_complete: true, notify with diff summary:
ci-fix-pipeline complete
Fixed: {N} failures
Skipped: {M} failures (patterns: {patterns})
Sub-tickets created: {K} (large-scope failures)
Escalated: {L} (architectural — declined or timed out)
Ticket: {ticket_id if provided}
Branch: {branch}ModelSkillResult Output
Emits to ~/.claude/skill-results/{context_id}/ci-fix-pipeline.json
where {context_id} is the Claude session ID (from $CLAUDE_SESSION_ID env var) or default
if the session ID is unavailable:
{
"status": "completed|capped|escalated|failed",
"fixed_count": 5,
"skipped_count": 1,
"capped_count": 2,
"escalated_count": 0,
"unfixable_count": 0,
"sub_tickets": ["OMN-XXXX", "OMN-XYYY"],
"commit": "abc1234",
"branch": "feature/my-branch",
"ticket_id": "OMN-1234"
}Status values:
completed— All fixable failures resolvedcapped— Some failures deferred to sub-tickets; fixed what was in scopeescalated— One or more architectural failures declined or timed out; human review requiredfailed— CI fetch failed or commit failed; pipeline halted
Failure Handling
| Error | Behavior |
|---|---|
| CI fetch fails | Hard exit with status: failed, reason in output |
| Fix agent fails | Log failure, mark as unfixable, continue with others |
| Sub-ticket creation fails | Log warning, continue (non-blocking) |
| Slack unavailable | Skip notification, continue (non-blocking) |
| Commit fails | Exit with status: failed, leave changes staged |
Sub-Ticket Threshold Policy
The max_fix_files threshold is a routing decision, not a skip:
- Failures within threshold: fixed autonomously
- Failures above threshold: sub-ticket created, pipeline continues with remaining
This ensures large-scope failures are never silently dropped — they are tracked in Linear.
Self-Healing Mode (OMN-2829)
When --self-heal is enabled, the pipeline wraps the standard fix flow in a multi-attempt
retry loop orchestrated by node_ci_repair_effect.
Architecture
CI fails
-> _bin/ci-status.sh --pr N --repo ORG/REPO
-> parse failure JSON
-> node_ci_repair_effect.execute_effect(event)
-> for attempt in 1..max_attempts:
strategy = RepairStrategy.for_attempt(attempt)
-> dispatch fix agent with strategy-specific prompt
-> git add + commit + push
-> inbox-wait for CI re-run result (not polling)
-> if CI passing: inbox notification "repaired" -> exit
-> if CI failing: rotate strategy, continue loop
-> if all attempts exhausted: inbox notification "exhausted" -> exitStrategy Rotation
Each attempt uses a progressively broader fix strategy:
| Attempt | Strategy | Description |
|---|---|---|
| 1 | targeted_fix |
Parse error logs, fix only the specific failing lines/files |
| 2 | broad_lint_fix |
Run ruff/mypy auto-fix on all files touched by the PR |
| 3 | regenerate_and_fix |
Rewrite failing sections, apply all auto-fixers |
CI Status Extraction
Self-healing mode uses _bin/ci-status.sh instead of the heavier ci-quick-review
for fast, structured CI status checks between attempts:
# Fetch CI status as structured JSON
${CLAUDE_PLUGIN_ROOT}/_bin/ci-status.sh --pr 42 --repo OmniNode-ai/omniclaudeOutput:
{
"status": "failing",
"pr_number": 42,
"repo": "OmniNode-ai/omniclaude",
"branch": "jonah/omn-2829-self-healing-ci",
"run_id": "12345678",
"failed_jobs": [
{
"job_id": "56174634733",
"job_name": "lint / ruff",
"step": "Run ruff check",
"conclusion": "failure",
"log_excerpt": "..."
}
],
"failure_summary": "1 job(s) failed: lint / ruff",
"fetched_at": "2026-02-26T13:00:00Z"
}Inbox-Wait Pattern
After each fix attempt pushes code, the pipeline waits for CI results using inbox-wait
rather than fixed-interval polling:
- Push fix commit to branch
- Call
node_ci_repair_effect.wait_for_ci_rerun()which:- Polls
_bin/ci-status.shat 30s intervals - Detects when a new
run_idappears (different from pre-push run) - Waits for terminal state (
passingorfailing) - Times out after 5 minutes (configurable)
- Polls
- If
passing: record success, send inbox notification, exit - If
failing: rotate strategy, begin next attempt
Self-Healing Fix Dispatch
Task(
subagent_type="onex:polymorphic-agent",
description="ci-fix-pipeline: self-heal attempt {N}/{max} for PR #{pr_number}",
prompt="**SELF-HEALING CI REPAIR -- Attempt {N}/{max}**
Strategy: {strategy_name}
{strategy_prompt}
CI Failure Details:
{failure_json}
Branch: {branch}
Repo: {repo}
Instructions:
1. Read the failure logs carefully
2. Apply the fix strategy described above
3. Stage and commit with message:
fix(ci): self-heal attempt {N} -- {strategy_name} [{ticket_id}]
4. Push to the branch
Report: files changed, fix description, confidence level."
)ONEX Node: node_ci_repair_effect
Tier: EVENT_BUS+
Type: Effect node (external I/O)
Location: plugins/onex/hooks/lib/node_ci_repair_effect.py
Provides:
execute_effect(event)-- Initialize repair run staterecord_attempt_result(run_state, attempt, ...)-- Record attempt outcomefinalize_with_error(run_state, error)-- Record error and notifyfetch_ci_status(pr, repo, branch)-- Wrapper around_bin/ci-status.shwait_for_ci_rerun(pr, repo, branch, prev_run_id)-- Inbox-wait for new CI runsend_inbox_notification(run_state, message)-- Write to~/.claude/inbox/save_repair_state(run_state)/load_repair_state(run_id)-- State persistence
Self-Healing ModelSkillResult
When --self-heal is active, the ModelSkillResult includes additional fields:
{
"status": "repaired|exhausted|failed",
"repair_run_id": "ci-repair-42-1740000000",
"attempts_used": 2,
"max_attempts": 3,
"strategy_used": "broad_lint_fix",
"fixed_count": 5,
"skipped_count": 0,
"commit": "abc1234",
"branch": "jonah/omn-2829-self-healing-ci",
"ticket_id": "OMN-2829",
"inbox_notification_sent": true
}Status values (self-healing):
repaired-- CI fixed within the retry budgetexhausted-- All attempts used, CI still failingfailed-- Error during repair (fetch failed, commit failed, etc.)
Verification
To verify self-healing works end-to-end:
- Push a deliberately failing commit (e.g., syntax error, failing test)
- Run:
/ci-fix-pipeline --self-heal --pr <N> --ticket-id OMN-2829 - Confirm:
- Attempt 1 uses
targeted_fixstrategy - If still failing, attempt 2 uses
broad_lint_fix - If still failing, attempt 3 uses
regenerate_and_fix - Inbox notification sent on success or exhaustion
- State persisted to
~/.claude/state/ci-repair/
- Attempt 1 uses
See Also
ci-watchskill -- poll CI status and auto-fix (OMN-2523)local-reviewskill -- review and fix local code changesticket-pipelineskill -- end-to-end ticket pipeline (Phase 4 invokes ci-watch)_bin/ci-status.sh-- lightweight CI status extraction scriptnode_ci_repair_effect-- ONEX effect node for self-healing orchestrationHandlerSlackWebhookin omnibase_infra -- Slack delivery infrastructure