"Diagnose gateway failures by reading daemon logs, session transcripts, Redis state, and OTEL telemetry. Full Telegram path triage: daemon process → Redis channel → command queue → pi session → model API → Telegram delivery. Use when: 'gateway broken', 'telegram not working', 'why is gateway down', 'gateway not responding', 'check gateway logs', 'what happened to gateway', 'gateway diagnose', 'gateway errors', 'review gateway logs', 'fallback activated', 'gateway stuck', or any request to understand why the gateway failed. Distinct from the gateway skill (operations) — this skill is diagnostic."
Install
npx skillscat add joelhooks/joelclaw/gateway-diagnose Install via the SkillsCat registry.
Gateway Diagnosis
Structured diagnostic workflow for the joelclaw gateway daemon. Runs top-down from process health to message delivery, stopping at the first failure layer.
Default time range: 1 hour. Override by asking "check gateway logs for the last 4 hours" or similar.
CLI Commands (use these first)
# Automated health check — runs all layers, returns structured findings
joelclaw gateway diagnose [--hours 1] [--lines 100]
# Session context — what happened recently? Exchanges, tools, errors.
joelclaw gateway review [--hours 1] [--max 20]Start with diagnose to find the failure layer. Use review to understand what the gateway was doing when it broke. Only drop to manual log reading (below) when the CLI output isn't enough.
Artifact Locations
| Artifact | Path | What's in it |
|---|---|---|
| Daemon stdout | /tmp/joelclaw/gateway.log |
Startup info, event flow, responses, fallback messages |
| Daemon stderr | /tmp/joelclaw/gateway.err |
Errors, stack traces, retries, fallback activations — check this first |
| PID file | /tmp/joelclaw/gateway.pid |
Current daemon process ID |
| Session ID | ~/.joelclaw/gateway.session |
Current pi session ID |
| Session transcripts | ~/.joelclaw/sessions/gateway/*.jsonl |
Full pi session history (most recent by mtime) |
| Gateway working dir | ~/.joelclaw/gateway/ |
Has .pi/settings.json for compaction config |
| Launchd plist | ~/Library/LaunchAgents/com.joel.gateway.plist |
Service config, env vars, log paths |
| Start script | ~/.joelclaw/scripts/gateway-start.sh |
Secret leasing, env setup, bun invocation |
| Tripwire | /tmp/joelclaw/last-heartbeat.ts |
Last heartbeat timestamp (updated every 15 min) |
| WS port | /tmp/joelclaw/gateway.ws.port |
WebSocket port for TUI attach (default 3018) |
Diagnostic Procedure
Run these steps in order. Stop and report at the first failure.
Layer 0: Process Health
# Is the daemon running?
launchctl list | grep gateway
ps aux | grep gateway | grep -v grep
# What's the PID and uptime?
cat /tmp/joelclaw/gateway.pid
# Compare PID to launchctl list output — mismatch = stale PID fileFailure patterns:
- PID mismatch between launchctl and PID file → daemon restarted, PID file stale
- Exit code non-zero in launchctl → crash loop, check gateway.err
- Process not running but launchctl shows it → zombie,
launchctl kickstart -k
Layer 1: CLI Status
joelclaw gateway statusCheck:
redis: "connected"— if not, Redis pod is downactiveSessions— should havegatewaywithalive: truepending: 0— if >0, messages are backing up (session busy or stuck)
Layer 2: Error Log (the money log)
# Default: last 100 lines. Adjust for time range.
tail -100 /tmp/joelclaw/gateway.errKnown error patterns:
| Pattern | Meaning | Root Cause |
|---|---|---|
Agent is already processing |
Command queue tried to prompt while session streaming | Session busy — long turn, compaction, or initialization race |
fallback activated |
Model timeout or consecutive failures triggered model swap | Primary model API down or slow |
no streaming tokens after Ns |
Timeout — prompt dispatched but no response | Model API issue, auth failure, or session not ready |
session still streaming, retrying |
Drain loop retry (3 attempts, 2s each) | Turn taking longer than expected |
watchdog: session appears stuck |
No turn_end for 10+ minutes after prompt | Hung tool call or model hang |
watchdog: session appears dead |
3+ consecutive prompt failures | Triggers self-restart via graceful shutdown |
OTEL emit request failed: TimeoutError |
Typesense unreachable | k8s port-forward or Typesense pod issue (secondary) |
prompt failed with consecutiveFailures: N |
Nth failure in a row | Check model API, session state |
Layer 3: Stdout Log (event flow)
tail -100 /tmp/joelclaw/gateway.logLook for:
[gateway] daemon started— last startup time, model, session ID[gateway:telegram] message received— did the message arrive?[gateway:store] persisted inbound message— was it persisted?[gateway:fallback] prompt dispatched— was a prompt sent to the model?[gateway] response ready— did the model respond?[gateway:fallback] activated— is fallback model in use?[redis] suppressed N noise event(s)— which events are being filtered[gateway:store] replayed unacked messages— startup replay (can cause races)
Layer 4: E2E Delivery Test
joelclaw gateway test
# Wait 5 seconds
joelclaw gateway eventsExpected: Test event pushed and drained (totalCount: 0 after drain).
Failure: Event stuck in queue → session not draining → check Layer 2 errors.
Layer 5: Session Transcript
# Find most recent gateway session
ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1
# Read last N lines of the session JSONL
tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonlEach line is a JSON object. Look for:
"type": "turn_end"— confirms turns are completing"type": "error"— model or tool errors- Long gaps between
turn_startandturn_end— slow turns - Tool call entries — what was the session doing when it got stuck?
Layer 6: OTEL Telemetry
# Gateway-specific events
joelclaw otel search "gateway" --hours 1
# Fallback events
joelclaw otel search "fallback" --hours 1
# Queue events
joelclaw otel search "command-queue" --hours 1Layer 7: Model API Health
# Quick API reachability test (auth error = API reachable)
curl -s -m 10 https://api.anthropic.com/v1/messages \
-H "x-api-key: test" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{}' | jq .error.type
# Expected: "authentication_error" (means API is reachable)Layer 8: Redis State
# Check gateway queue directly
kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway
# Check message store
kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages
# Check unacked messages (these replay on restart)
kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5Known Failure Scenarios
1. Session Initialization Race (ADR-0103 era)
Symptoms: "already processing" errors immediately after restart, unacked message replay fails.
Cause: Drain loop processes replayed messages before the pi session finishes initializing.
Fix: Restart clears it. If persistent, check if replayUnacked() runs before session is ready.
2. Model API Timeout
Symptoms: "no streaming tokens after 90s", fallback activated.
Cause: Primary model (claude-opus-4-6) API slow or down.
Fix: Fallback auto-activates. Recovery probe runs every 10 min. If persistent, check Anthropic status.
3. Stuck Tool Call
Symptoms: Watchdog fires after 10 min, session stuck.
Cause: A tool call (bash, read, etc.) hanging indefinitely.
Fix: Watchdog auto-aborts. If stuck persists, joelclaw gateway restart.
4. Redis Disconnection
Symptoms: Status shows redis disconnected, no events flowing.
Cause: Redis pod restart or port-forward dropped.
Fix: kubectl get pods -n joelclaw to verify, ioredis auto-reconnects.
5. Compaction During Message Delivery
Symptoms: "already processing" after a successful turn_end.
Cause: Auto-compaction triggers after turn_end, session enters streaming state again before drain loop processes next message.
Fix: The idle waiter should block until compaction finishes. If not, this is a pi SDK gap.
Fallback Controller State
The gateway has a model fallback controller (ADR-0091) that swaps models when the primary fails:
- Threshold: 90s timeout for first token, or 3 consecutive prompt failures (configurable)
- Fallback model: gpt-5.3-codex-spark (via openai-codex provider)
- Recovery: Probes primary model every 10 minutes
- OTEL events:
model_fallback.swapped,model_fallback.primary_restored,model_fallback.probe_failed
Check fallback state in gateway.log: [gateway:fallback] activated / recovered.
Architecture Reference
Telegram → channels/telegram.ts → enqueueToGateway()
Redis → channels/redis.ts → enqueueToGateway()
↓
command-queue.ts
(serial FIFO)
↓
session.prompt(text)
↓
pi SDK (isStreaming gate)
↓
Model API (claude-opus-4-6)
↓
turn_end → idleWaiter resolves
↓
Response routed to origin channelThe command queue processes ONE prompt at a time. idleWaiter blocks until turn_end fires. If a prompt is in flight, new messages queue behind it.
Key Code
| File | What to look for |
|---|---|
packages/gateway/src/daemon.ts |
Session creation, event handler, idle waiter, watchdog |
packages/gateway/src/command-queue.ts |
drain() loop, retry logic, idle gate |
packages/gateway/src/model-fallback.ts |
Timeout tracking, fallback swap, recovery probes |
packages/gateway/src/channels/redis.ts |
Event batching, prompt building, sleep mode |
packages/gateway/src/channels/telegram.ts |
Bot polling, message routing |
packages/gateway/src/heartbeat.ts |
Tripwire writer only (ADR-0103: no prompt injection) |
Related Skills
- gateway — operational commands (restart, push, drain)
- joelclaw-system-check — full system health (broader scope)
- k8s — if Redis/Inngest pods are the problem