gateway-diagnose

"Diagnose gateway failures by reading daemon logs, session transcripts, Redis state, and OTEL telemetry. Full Telegram path triage: daemon process → Redis channel → command queue → pi session → model API → Telegram delivery. Use when: 'gateway broken', 'telegram not working', 'why is gateway down', 'gateway not responding', 'check gateway logs', 'what happened to gateway', 'gateway diagnose', 'gateway errors', 'review gateway logs', 'fallback activated', 'gateway stuck', or any request to understand why the gateway failed. Distinct from the gateway skill (operations) — this skill is diagnostic."

joelhooks 61 3 Updated 4mo ago

GitHub

Install

npx skillscat add joelhooks/joelclaw/gateway-diagnose

Install via the SkillsCat registry.

SKILL.md

Gateway Diagnosis

Structured diagnostic workflow for the joelclaw gateway daemon. Runs top-down from process health to message delivery, stopping at the first failure layer.

Default time range: 1 hour. Override by asking "check gateway logs for the last 4 hours" or similar.

CLI Commands (use these first)

# Automated health check — runs all layers, returns structured findings
joelclaw gateway diagnose [--hours 1] [--lines 100]

# Session context — what happened recently? Exchanges, tools, errors.
joelclaw gateway review [--hours 1] [--max 20]

Start with diagnose to find the failure layer. Use review to understand what the gateway was doing when it broke. Only drop to manual log reading (below) when the CLI output isn't enough.

Artifact Locations

Artifact	Path	What's in it
Daemon stdout	`/tmp/joelclaw/gateway.log`	Startup info, event flow, responses, fallback messages
Daemon stderr	`/tmp/joelclaw/gateway.err`	Errors, stack traces, retries, fallback activations — check this first
PID file	`/tmp/joelclaw/gateway.pid`	Current daemon process ID
Session ID	`~/.joelclaw/gateway.session`	Current pi session ID
Session transcripts	`~/.joelclaw/sessions/gateway/*.jsonl`	Full pi session history (most recent by mtime)
Gateway working dir	`~/.joelclaw/gateway/`	Has `.pi/settings.json` for compaction config
Launchd plist	`~/Library/LaunchAgents/com.joel.gateway.plist`	Service config, env vars, log paths
Start script	`~/.joelclaw/scripts/gateway-start.sh`	Secret leasing, env setup, bun invocation
Tripwire	`/tmp/joelclaw/last-heartbeat.ts`	Last heartbeat timestamp (updated every 15 min)
WS port	`/tmp/joelclaw/gateway.ws.port`	WebSocket port for TUI attach (default 3018)

Diagnostic Procedure

Run these steps in order. Stop and report at the first failure.

Layer 0: Process Health

# Is the daemon running?
launchctl list | grep gateway
ps aux | grep gateway | grep -v grep

# What's the PID and uptime?
cat /tmp/joelclaw/gateway.pid
# Compare PID to launchctl list output — mismatch = stale PID file

Failure patterns:

PID mismatch between launchctl and PID file → daemon restarted, PID file stale
Exit code non-zero in launchctl → crash loop, check gateway.err
Process not running but launchctl shows it → zombie, launchctl kickstart -k

Layer 1: CLI Status

joelclaw gateway status

Check:

redis: "connected" — if not, Redis pod is down
activeSessions — should have gateway with alive: true
pending: 0 — if >0, messages are backing up (session busy or stuck)

Layer 2: Error Log (the money log)

# Default: last 100 lines. Adjust for time range.
tail -100 /tmp/joelclaw/gateway.err

Known error patterns:

Pattern	Meaning	Root Cause
`Agent is already processing`	Command queue tried to prompt while session streaming	Session busy — long turn, compaction, or initialization race
`fallback activated`	Model timeout or consecutive failures triggered model swap	Primary model API down or slow
`no streaming tokens after Ns`	Timeout — prompt dispatched but no response	Model API issue, auth failure, or session not ready
`session still streaming, retrying`	Drain loop retry (3 attempts, 2s each)	Turn taking longer than expected
`watchdog: session appears stuck`	No turn_end for 10+ minutes after prompt	Hung tool call or model hang
`watchdog: session appears dead`	3+ consecutive prompt failures	Triggers self-restart via graceful shutdown
`OTEL emit request failed: TimeoutError`	Typesense unreachable	k8s port-forward or Typesense pod issue (secondary)
`prompt failed` with `consecutiveFailures: N`	Nth failure in a row	Check model API, session state

Layer 3: Stdout Log (event flow)

tail -100 /tmp/joelclaw/gateway.log

Look for:

[gateway] daemon started — last startup time, model, session ID
[gateway:telegram] message received — did the message arrive?
[gateway:store] persisted inbound message — was it persisted?
[gateway:fallback] prompt dispatched — was a prompt sent to the model?
[gateway] response ready — did the model respond?
[gateway:fallback] activated — is fallback model in use?
[redis] suppressed N noise event(s) — which events are being filtered
[gateway:store] replayed unacked messages — startup replay (can cause races)

Layer 4: E2E Delivery Test

joelclaw gateway test
# Wait 5 seconds
joelclaw gateway events

Expected: Test event pushed and drained (totalCount: 0 after drain).
Failure: Event stuck in queue → session not draining → check Layer 2 errors.

Layer 5: Session Transcript

# Find most recent gateway session
ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1

# Read last N lines of the session JSONL
tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl

Each line is a JSON object. Look for:

"type": "turn_end" — confirms turns are completing
"type": "error" — model or tool errors
Long gaps between turn_start and turn_end — slow turns
Tool call entries — what was the session doing when it got stuck?

Layer 6: OTEL Telemetry

# Gateway-specific events
joelclaw otel search "gateway" --hours 1

# Fallback events
joelclaw otel search "fallback" --hours 1

# Queue events
joelclaw otel search "command-queue" --hours 1

Layer 7: Model API Health

# Quick API reachability test (auth error = API reachable)
curl -s -m 10 https://api.anthropic.com/v1/messages \
  -H "x-api-key: test" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{}' | jq .error.type
# Expected: "authentication_error" (means API is reachable)

Layer 8: Redis State

# Check gateway queue directly
kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway

# Check message store
kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages

# Check unacked messages (these replay on restart)
kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5

Known Failure Scenarios

1. Session Initialization Race (ADR-0103 era)

Symptoms: "already processing" errors immediately after restart, unacked message replay fails.
Cause: Drain loop processes replayed messages before the pi session finishes initializing.
Fix: Restart clears it. If persistent, check if replayUnacked() runs before session is ready.

2. Model API Timeout

Symptoms: "no streaming tokens after 90s", fallback activated.
Cause: Primary model (claude-opus-4-6) API slow or down.
Fix: Fallback auto-activates. Recovery probe runs every 10 min. If persistent, check Anthropic status.

3. Stuck Tool Call

Symptoms: Watchdog fires after 10 min, session stuck.
Cause: A tool call (bash, read, etc.) hanging indefinitely.
Fix: Watchdog auto-aborts. If stuck persists, joelclaw gateway restart.

4. Redis Disconnection

Symptoms: Status shows redis disconnected, no events flowing.
Cause: Redis pod restart or port-forward dropped.
Fix: kubectl get pods -n joelclaw to verify, ioredis auto-reconnects.

5. Compaction During Message Delivery

Symptoms: "already processing" after a successful turn_end.
Cause: Auto-compaction triggers after turn_end, session enters streaming state again before drain loop processes next message.
Fix: The idle waiter should block until compaction finishes. If not, this is a pi SDK gap.

Fallback Controller State

The gateway has a model fallback controller (ADR-0091) that swaps models when the primary fails:

Threshold: 90s timeout for first token, or 3 consecutive prompt failures (configurable)
Fallback model: gpt-5.3-codex-spark (via openai-codex provider)
Recovery: Probes primary model every 10 minutes
OTEL events: model_fallback.swapped, model_fallback.primary_restored, model_fallback.probe_failed

Check fallback state in gateway.log: [gateway:fallback] activated / recovered.

Architecture Reference

Telegram → channels/telegram.ts → enqueueToGateway()
Redis    → channels/redis.ts    → enqueueToGateway()
                                        ↓
                                 command-queue.ts
                                   (serial FIFO)
                                        ↓
                              session.prompt(text)
                                        ↓
                              pi SDK (isStreaming gate)
                                        ↓
                              Model API (claude-opus-4-6)
                                        ↓
                              turn_end → idleWaiter resolves
                                        ↓
                              Response routed to origin channel

The command queue processes ONE prompt at a time. idleWaiter blocks until turn_end fires. If a prompt is in flight, new messages queue behind it.

Key Code

File	What to look for
`packages/gateway/src/daemon.ts`	Session creation, event handler, idle waiter, watchdog
`packages/gateway/src/command-queue.ts`	`drain()` loop, retry logic, idle gate
`packages/gateway/src/model-fallback.ts`	Timeout tracking, fallback swap, recovery probes
`packages/gateway/src/channels/redis.ts`	Event batching, prompt building, sleep mode
`packages/gateway/src/channels/telegram.ts`	Bot polling, message routing
`packages/gateway/src/heartbeat.ts`	Tripwire writer only (ADR-0103: no prompt injection)

Related Skills

gateway — operational commands (restart, push, drain)
joelclaw-system-check — full system health (broader scope)
k8s — if Redis/Inngest pods are the problem

gateway-diagnose

Install

Gateway Diagnosis

CLI Commands (use these first)

Artifact Locations

Diagnostic Procedure

Layer 0: Process Health

Layer 1: CLI Status

Layer 2: Error Log (the money log)

Layer 3: Stdout Log (event flow)

Layer 4: E2E Delivery Test

Layer 5: Session Transcript

Layer 6: OTEL Telemetry

Layer 7: Model API Health

Layer 8: Redis State

Known Failure Scenarios

1. Session Initialization Race (ADR-0103 era)

2. Model API Timeout

3. Stuck Tool Call

4. Redis Disconnection

5. Compaction During Message Delivery

Fallback Controller State

Architecture Reference

Key Code

Related Skills

Categories

Install

Recommended Skills