pueue-job-orchestration

Pueue universal CLI telemetry and job orchestration. TRIGGERS - run on bigblack, run on littleblack, queue job, long-running task, cache population, batch processing, GPU workstation, pueue callback, pueue delay, pueue priority.

terrylica 59 8 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add terrylica/cc-skills/pueue-job-orchestration

Install via the SkillsCat registry.

SKILL.md

Pueue Job Orchestration

Universal CLI telemetry layer and job management — every command routed through pueue gets precise timing, exit code capture, full stdout/stderr logs, environment snapshots, and callback-on-completion.

Overview

Pueue is a Rust CLI tool for managing shell command queues. It provides:

Daemon persistence - Survives SSH disconnects, crashes, reboots
Disk-backed queue - Auto-resumes after any failure
Group-based parallelism - Control concurrent jobs per group
Easy failure recovery - Restart failed jobs with one command
Full telemetry - Timing, exit codes, stdout/stderr logs, env snapshots per task

When to Route Through Pueue

Operation	Route Through Pueue?	Why
Any command >30 seconds	Always	Telemetry, persistence, log capture
Batch operations (>3 items)	Always	Parallelism control, failure isolation
Build/test pipelines	Recommended	`--after` DAGs, group monitoring
Data processing	Always	Checkpoint resume, state management
Quick one-off commands (<5s)	Optional	Overhead is ~100ms, but you get logs
Interactive commands (editors, REPLs)	Never	Pueue can't handle stdin interaction

When to Use This Skill

Use this skill when the user mentions:

Trigger	Example
Running tasks on BigBlack/LittleBlack	"Run this on bigblack"
Long-running data processing	"Populate the cache for all symbols"
Batch/parallel operations	"Process these 70 jobs"
SSH remote execution	"Execute this overnight on the GPU server"
Cache population	"Fill the ClickHouse cache"
Pueue features	"Set up a callback", "delay this job"

Quick Reference

Check Status

# Local
pueue status

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue status"

Queue a Job

# Local (with working directory)
pueue add -w ~/project -- python long_running_script.py

# Local (simple)
pueue add -- python long_running_script.py

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue add -w ~/project -- uv run python script.py"

# With group (for parallelism control)
pueue add --group p1 --label "BTCUSDT@1000" -w ~/project -- python populate.py --symbol BTCUSDT

Monitor Jobs

pueue follow <id>         # Watch job output in real-time
pueue log <id>            # View completed job output
pueue log <id> --full     # Full output (not truncated)

Manage Jobs

pueue restart <id>        # Restart failed job
pueue restart --all-failed # Restart ALL failed jobs
pueue kill <id>           # Kill running job
pueue clean               # Remove completed jobs from list
pueue reset               # Clear all jobs (use with caution)

Host Configuration

Host	Location	Parallelism Groups
BigBlack	`~/.local/bin/pueue`	p1 (16), p2 (2), p3 (3), p4 (1)
LittleBlack	`~/.local/bin/pueue`	default (2)
Local (macOS)	`/opt/homebrew/bin/pueue`	default

Workflows

1. Queue Single Remote Job

# Step 1: Verify daemon is running
ssh bigblack "~/.local/bin/pueue status"

# Step 2: Queue the job
ssh bigblack "~/.local/bin/pueue add --label 'my-job' -- cd ~/project && uv run python script.py"

# Step 3: Monitor progress
ssh bigblack "~/.local/bin/pueue follow <id>"

2. Batch Job Submission (Multiple Symbols)

For rangebar cache population or similar batch operations:

# Use the pueue-populate.sh script
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh setup"   # One-time
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh phase1"  # Queue Phase 1
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh status"  # Check progress

3. Configure Parallelism Groups

# Create groups with different parallelism limits
pueue group add fast      # Create 'fast' group
pueue parallel 4 --group fast  # Allow 4 parallel jobs

pueue group add slow
pueue parallel 1 --group slow  # Sequential execution

# Queue jobs to specific groups
pueue add --group fast -- echo "fast job"
pueue add --group slow -- echo "slow job"

4. Handle Failed Jobs

# Check what failed
pueue status | grep Failed

# View error output
pueue log <id>

# Restart specific job
pueue restart <id>

# Restart all failed jobs
pueue restart --all-failed

Installation

macOS (Local)

brew install pueue
pueued -d  # Start daemon

Linux (BigBlack/LittleBlack)

# Download from GitHub releases (see https://github.com/Nukesor/pueue/releases for latest)
curl -sSL https://raw.githubusercontent.com/terrylica/rangebar-py/main/scripts/setup-pueue-linux.sh | bash

# Or manually:
# SSoT-OK: Version from GitHub releases page
PUEUE_VERSION="v4.0.2"
curl -sSL "https://github.com/Nukesor/pueue/releases/download/${PUEUE_VERSION}/pueue-x86_64-unknown-linux-musl" -o ~/.local/bin/pueue
curl -sSL "https://github.com/Nukesor/pueue/releases/download/${PUEUE_VERSION}/pueued-x86_64-unknown-linux-musl" -o ~/.local/bin/pueued
chmod +x ~/.local/bin/pueue ~/.local/bin/pueued

# Start daemon
~/.local/bin/pueued -d

Systemd Auto-Start (Linux)

mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/pueued.service << 'EOF'
[Unit]
Description=Pueue Daemon
After=network.target

[Service]
ExecStart=%h/.local/bin/pueued -v
Restart=on-failure

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now pueued

Integration with rangebar-py

The rangebar-py project has Pueue integration scripts:

Script	Purpose
`scripts/pueue-populate.sh`	Queue cache population jobs with group-based parallelism
`scripts/setup-pueue-linux.sh`	Install Pueue on Linux servers
`scripts/populate_full_cache.py`	Python script for individual symbol/threshold jobs

Phase-Based Execution

# Phase 1: 1000 dbps (fast, 4 parallel)
./scripts/pueue-populate.sh phase1

# Phase 2: 250 dbps (moderate, 2 parallel)
./scripts/pueue-populate.sh phase2

# Phase 3: 500, 750 dbps (3 parallel)
./scripts/pueue-populate.sh phase3

# Phase 4: 100 dbps (resource intensive, 1 at a time)
./scripts/pueue-populate.sh phase4

Troubleshooting

Issue	Cause	Solution
`pueue: command not found`	Not in PATH	Use full path: `~/.local/bin/pueue`
`Connection refused`	Daemon not running	Start with `pueued -d`
Jobs stuck in Queued	Group paused or at limit	Check `pueue status`, `pueue start`
SSH disconnect kills jobs	Not using Pueue	Queue via Pueue instead of direct SSH
Job fails immediately	Wrong working directory	Use `pueue add -w /path` or `cd /path && pueue add`

Production Lessons (Issue #88)

Battle-tested patterns from real production deployments.

Dependency Chaining with `--after`

Pueue supports automatic job dependency resolution via --after. This is critical for post-processing pipelines where steps must run sequentially after batch jobs complete.

Key flags:

--after <id>... -- Start job only after ALL specified jobs succeed. If any dependency fails, this job fails too.
--print-task-id (or -p) -- Return only the numeric job ID (for scripting).

Pattern: Capturing job IDs for dependency wiring

# Capture job IDs during batch submission
JOB_IDS=()
for symbol in BTCUSDT ETHUSDT; do
    job_id=$(cd /path/to/project && pueue add --print-task-id --group mygroup \
        --label "${symbol}@250" \
        -- uv run python scripts/process.py --symbol "$symbol")
    JOB_IDS+=("$job_id")
done

# Chain post-processing after ALL batch jobs
optimize_id=$(pueue add --print-task-id --group mygroup \
    --label "optimize-table" \
    --after "${JOB_IDS[@]}" \
    -- clickhouse-client --query "OPTIMIZE TABLE mydb.mytable FINAL")

# Chain validation after optimize
pueue add --group mygroup \
    --label "validate" \
    --after "$optimize_id" \
    -- uv run python scripts/validate.py

Result in pueue status:

Job 0  BTCUSDT@250    Running
Job 1  ETHUSDT@250    Running
Job 2  optimize-table Queued  Deps: 0, 1
Job 3  validate       Queued  Deps: 2

When to use --after:

Post-processing steps (OPTIMIZE TABLE, validation scripts, cleanup)
Multi-stage pipelines where Stage N depends on Stage N-1
Verification jobs that should only run after data is fully written

Anti-pattern: Manual waiting

# BAD: Manual polling or instructions to "run this after that finishes"
postprocess_all() {
    queue_repopulation_jobs
    echo "Run 'pueue wait --group postfix' then run optimize manually"  # NO!
}

# GOOD: Automatic dependency chain
postprocess_all() {
    queue_repopulation_jobs  # captures JOB_IDS
    pueue add --after "${JOB_IDS[@]}" -- optimize_command
    pueue add --after "$optimize_id" -- validate_command
}

Mise Task to Pueue Pipeline Integration

Pattern for mise run commands that build pueue DAGs:

# .mise/tasks/cache.toml
["cache:postprocess-all"]
description = "Full post-fix pipeline via pueue: repopulate -> optimize -> detect (auto-chained)"
run = "./scripts/pueue-populate.sh postprocess-all"

The shell script captures pueue job IDs and chains them with --after. Mise provides the entry point; pueue provides the execution engine with dependency resolution.

Forensic Audit Before Deployment

ALWAYS audit the remote host before mutating anything:

# 1. Pueue job state
ssh host 'pueue status'
ssh host 'pueue status --json | python3 -c "import json,sys; d=json.load(sys.stdin); print(sum(1 for t in d[\"tasks\"].values() if \"Running\" in str(t[\"status\"])))"'

# 2. Database state (ClickHouse example)
ssh host 'clickhouse-client --query "SELECT symbol, threshold, count(), countIf(volume < 0) FROM mytable GROUP BY ALL"'

# 3. Checkpoint state
ssh host 'ls -la ~/.cache/myapp/checkpoints/'
ssh host 'cat ~/.cache/myapp/checkpoints/latest.json'

# 4. System resources
ssh host 'uptime && free -h && df -h /home'

# 5. Installed version
ssh host 'cd ~/project && git log --oneline -1'

Force-Refresh vs Checkpoint Resume

Decision matrix for restarting killed/failed jobs:

Scenario	Action	Flag
Job killed mid-run, data is clean	Resume from checkpoint	(no --force-refresh)
Data is corrupt (overflow, schema bug)	Wipe and restart	--force-refresh
Code fix changes output format	Wipe and restart	--force-refresh
Code fix is internal-only (no output change)	Resume from checkpoint	(no --force-refresh)

PATH Gotcha: Rust Not in PATH via `uv run`

On remote hosts, uv run maturin develop may fail because ~/.cargo/bin is not in uv run's PATH:

# FAILS: rustc not found
ssh host 'cd ~/project && uv run maturin develop --uv'

# WORKS: Prepend cargo bin to PATH
ssh host 'cd ~/project && PATH="$HOME/.cargo/bin:$PATH" uv run maturin develop --uv'

For pueue jobs that need Rust compilation:

pueue add -- env PATH="/home/user/.cargo/bin:$PATH" uv run maturin develop

Per-Year (Epoch) Parallelization — DEFAULT STRATEGY

This is the default approach for all multi-year cache population. Never queue a monolithic multi-year job when epoch boundaries exist. A single DOGEUSDT@500 job estimated 22 days; per-year splits brought it to ~3-4 days with 4 parallel cores.

When a processing pipeline has natural reset boundaries (yearly, monthly, etc.) where processor state resets, each epoch becomes an independent processing unit. This enables massive speedup by splitting a multi-year sequential job into concurrent per-year pueue jobs.

Why it's safe (three isolation layers):

Layer	Why No Conflicts
Checkpoint files	Filename includes `{start}_{end}` — each year gets unique file
Database writes	INSERT is append-only; `OPTIMIZE TABLE FINAL` deduplicates after
Source data	Read-only files (Parquet, CSV, etc.) — no write contention

Pattern: Per-symbol pueue groups

Give each symbol (or job family) its own pueue group for independent parallelism control:

# Create per-symbol groups
pueue group add btc-yearly --parallel 4
pueue group add eth-yearly --parallel 4
pueue group add shib-yearly --parallel 4

# Queue per-year jobs
for year in 2019 2020 2021 2022 2023 2024 2025 2026; do
    pueue add --group btc-yearly \
        --label "BTC@250:${year}" \
        -- uv run python scripts/process.py \
        --symbol BTCUSDT --threshold 250 \
        --start-date "${year}-01-01" --end-date "${year}-12-31"
done

# Chain post-processing after ALL groups complete
ALL_JOB_IDS=($(pueue status --json | jq -r \
    '.tasks | to_entries[] | select(.value.group | test("-yearly$")) | .value.id'))
pueue add --after "${ALL_JOB_IDS[@]}" \
    --label "optimize-table:final" \
    -- clickhouse-client --query "OPTIMIZE TABLE mydb.mytable FINAL"

When to use per-year vs sequential:

Scenario	Approach
High-volume symbol (many output items)	Per-year (5+ cores idle)
Low-volume symbol (fast enough already)	Sequential (simpler)
Single parameter, long backfill	Per-year
Multiple parameters, same symbol	Sequential per parameter

Critical rules:

Working directory: Use pueue add -w ~/project (preferred) or cd ~/project && pueue add — SSH cwd defaults to $HOME, not the project directory. Jobs fail instantly with No such file or directory if this is missed. Note: on macOS, -w /tmp resolves to /private/tmp (symlink).
First year uses domain-specific effective start date, not 01-01
Last year uses actual latest available date as end
Chain OPTIMIZE TABLE FINAL after ALL year-jobs via --after
Memory budget: each job peaks independently — with 61 GB total, 4-5 concurrent jobs at 5 GB each are safe
No --force-refresh on per-year jobs when other year-jobs for the same symbol are running — it deletes cached bars by date range and can conflict with concurrent writes.

Pipeline Monitoring (Group-Based Phase Detection)

For multi-group pipelines, monitor job phases by group completion, not hardcoded job IDs. Job IDs change when jobs are removed, re-queued, or split into per-year jobs.

Anti-pattern: Hardcoded job IDs in monitors

# WRONG: Breaks when jobs are removed/re-queued
job14=$(echo "$JOBS" | grep "^14|")
if [ "$(echo "$job14" | cut -d'|' -f2)" = "Done" ]; then
    echo "Phase 1 complete"
fi

Correct pattern: Dynamic group detection

get_job_status() {
    ssh host "pueue status --json 2>/dev/null" | jq -r \
        '.tasks | to_entries[] |
         "\(.value.id)|\(.value.status | if type == "object" then keys[0] else . end)|\(.value.label // "-")|\(.value.group)"'
}

group_all_done() {
    local group="$1"
    local group_jobs
    group_jobs=$(echo "$JOBS" | grep "|${group}$" || true)
    [ -z "$group_jobs" ] && return 1
    echo "$group_jobs" | grep -qE "\|(Running|Queued)\|" && return 1
    return 0
}

# Detect phase transitions by group name
SEEN_GROUPS=""
for group in $(echo "$JOBS" | cut -d'|' -f4 | sort -u); do
    if group_all_done "$group" && [[ "$SEEN_GROUPS" != *"|${group}|"* ]]; then
        echo "GROUP COMPLETE: $group"
        run_integrity_checks "$group"
        SEEN_GROUPS="${SEEN_GROUPS}|${group}|"
    fi
done

Integrity checks at phase boundaries:

Run automated validation when a group finishes, before starting the next phase:

run_integrity_checks() {
    local phase="$1"
    # Check 1: Data corruption (negative values, out-of-bounds)
    ssh host 'clickhouse-client --query "SELECT ... countIf(value < 0) ... HAVING count > 0"'
    # Check 2: Duplicate rows
    ssh host 'clickhouse-client --query "SELECT ... count(*) - uniqExact(key) as dupes HAVING dupes > 0"'
    # Check 3: Coverage gaps (NULL required fields)
    ssh host 'clickhouse-client --query "SELECT ... countIf(field IS NULL) ... HAVING missing > 0"'
    # Check 4: System resources (load, memory)
    ssh host 'uptime && free -h'
}

Monitoring as a background loop:

POLL_INTERVAL=300  # 5 minutes
while true; do
    JOBS=$(get_job_status)
    # Count statuses, detect failures, detect group completions
    # Run integrity checks at phase boundaries
    # Exit when all jobs complete
    sleep "$POLL_INTERVAL"
done

State File Management (CRITICAL)

Pueue stores ALL task metadata in a single state.json file. This file grows with every completed task and is read/written on EVERY pueue add call. Neglecting state hygiene is the #1 cause of slow job submission in large sweeps.

The State Bloat Anti-Pattern

Symptom: pueue add takes 1-2 seconds instead of <100ms.

Root cause: Pueue serializes/deserializes the entire state file on every operation. With 50K+ completed tasks, state.json grows to 80-100MB. Each pueue add becomes 80MB read + 80MB write = 160MB I/O.

Benchmarks (pueue v4, NVMe SSD, 32-core Linux):

Completed Tasks	state.json Size	`pueue add` Latency (sequential)	`pueue add` Latency (xargs -P16)
53,000	94 MB	1,300 ms/add	455 ms/add (mutex contention)
0 (after clean)	245 KB	106 ms/add	8 ms/add (effective)

Key insight: Parallelism does NOT help when state is bloated — the pueue daemon serializes all operations through a mutex. The 455ms at P16 is WORSE per-operation than 1,300ms sequential because of lock contention overhead. Clean first, then parallelize.

Pre-Submission Clean (Mandatory Pattern)

Before any bulk submission (>100 jobs), clean completed tasks:

# ALWAYS clean before bulk submission
pueue clean -g mygroup 2>/dev/null || true

# Verify state is manageable
STATE_FILE="$HOME/.local/share/pueue/state.json"
STATE_SIZE=$(stat -c%s "$STATE_FILE" 2>/dev/null || stat -f%z "$STATE_FILE" 2>/dev/null || echo 0)
if [ "$STATE_SIZE" -gt 52428800 ]; then  # 50MB
    echo "WARNING: state.json is $(( STATE_SIZE / 1048576 ))MB — running extra clean"
    pueue clean 2>/dev/null || true
fi

Periodic Clean During Long Sweeps

For sweeps with 100K+ jobs, clean periodically between submission batches:

BATCH_SIZE=5000
POS=0
while [ "$POS" -lt "$TOTAL" ]; do
    # Submit batch
    tail -n +$((POS + 1)) "$CMDFILE" | head -n "$BATCH_SIZE" | \
        xargs -P16 -I{} bash -c '{}' 2>/dev/null || true
    POS=$((POS + BATCH_SIZE))

    # Prevent state bloat between batches
    pueue clean -g mygroup 2>/dev/null || true
done

Bulk Submission with xargs -P (High-Throughput Pattern)

For large job counts (1K+), submitting one pueue add at a time via SSH is prohibitively slow. Use a batch command file fed through xargs -P for parallel submission.

Why Not GNU Parallel?

CRITICAL: Many Linux hosts (including Ubuntu/Debian) ship with moreutils parallel, NOT GNU Parallel. They share the binary name /usr/bin/parallel but are completely different tools:

Feature	GNU Parallel	moreutils parallel
Job file	`--jobs 16 --bar < commands.txt`	Not supported
Progress bar	`--bar`, `--eta`	None
Resume	`--resume --joblog log.txt`	Not supported
Syntax	`parallel ::: arg1 arg2`	`parallel -- cmd1 -- cmd2`
`--version` output	`GNU parallel YYYY`	`parallel from moreutils`

Detection:

if parallel --version 2>&1 | grep -q 'GNU'; then
    echo "GNU Parallel available"
else
    echo "moreutils parallel (or none) — use xargs -P instead"
fi

Safe default: Always use xargs -P — it's POSIX standard and available everywhere.

Batch Command File Pattern

Step 1: Generate commands file (one pueue add per line):

# gen_commands.sh — generates commands.txt
for SQL_FILE in /tmp/sweep_sql/*.sql; do
    echo "pueue add -g p1 -- /tmp/run_job.sh '${SQL_FILE}' '${LOG_FILE}'"
done > /tmp/commands.txt
echo "Generated $(wc -l < /tmp/commands.txt) commands"

Step 2: Feed via xargs -P (parallel submission):

# Submit in batches with periodic state cleanup
BATCH=5000
P=16
TOTAL=$(wc -l < /tmp/commands.txt)
POS=0

while [ "$POS" -lt "$TOTAL" ]; do
    tail -n +$((POS + 1)) /tmp/commands.txt | head -n "$BATCH" | \
        xargs -P"$P" -I{} bash -c '{}' 2>/dev/null || true
    POS=$((POS + BATCH))

    # Clean between batches to prevent state bloat
    pueue clean -g p1 2>/dev/null || true

    QUEUED=$(pueue status -g p1 --json 2>/dev/null | python3 -c \
        "import json,sys; d=json.load(sys.stdin); print(sum(1 for t in d.get('tasks',{}).values() if 'Queued' in str(t.get('status',''))))" 2>/dev/null || echo "?")
    echo "Batch: ${POS}/${TOTAL} | Queued: ${QUEUED}"
done

Crash Recovery with Skip-Done

For idempotent resubmission after SSH drops or crashes:

# Build done-set from existing JSONL output
declare -A DONE_SET
for logfile in /tmp/sweep_*.jsonl; do
    while IFS= read -r config_id; do
        DONE_SET["${config_id}"]=1
    done < <(jq -r '.feature_config // empty' "$logfile" 2>/dev/null | sort -u)
done

# Generate commands, skipping completed configs
for SQL_FILE in /tmp/sweep_sql/*.sql; do
    CONFIG_ID=$(basename "$SQL_FILE" .sql)
    if [ "${DONE_SET[${CONFIG_ID}]+_}" ]; then
        continue  # Already completed
    fi
    echo "pueue add -g p1 -- /tmp/run_job.sh '${SQL_FILE}' '${LOG_FILE}'"
done > /tmp/commands.txt

Requirements: bash 4+ for associative arrays (declare -A).

Two-Tier Architecture (300K+ Jobs)

For sweeps exceeding 10K queries, the single-tier "pueue add per query" pattern is unusable — pueue add has 148ms overhead per call even with clean state (= 8+ hours for 196K jobs). The fix is eliminating pueue add at the query level entirely.

Architecture

macOS (local)
  mise run gen:generate   → N SQL files
  mise run gen:submit-all → rsync + queue M pueue units
  mise run gen:collect    → scp + validate JSONL

BigBlack (remote)
  pueue group p1 (parallel=1)   ← sequential units (avoid log contention)
    ├── Unit 1: submit_unit.sh pattern1 BTCUSDT 750
    │     └── xargs -P16 → K queries (direct clickhouse-client, no pueue add)
    ├── Unit 2: submit_unit.sh pattern1 BTCUSDT 1000
    │     └── xargs -P16 → K queries
    └── ... (M total units)

Key Principles

Principle	Rationale
Pueue at unit level (100s of tasks)	Crash recovery per unit, `pueue status` readable
xargs -P16 at query level (1000s per unit)	Zero overhead, direct process execution
Sequential units (`parallel=1`)	Each unit appends to one JSONL file via `flock` — parallel units would contend
Skip-done dedup inside each unit	`comm -23` on sorted config lists (O(N+M))

When to Use Each Tier

Job Count	Pattern
1-10	Direct `pueue add` per job
10-1K	Batch `pueue add` via xargs -P (see "Bulk Submission" section above)
1K-10K	Batch `pueue add` with periodic `pueue clean` between batches
10K+	Two-tier: pueue per unit + xargs -P per query (this section)

Shell Script Safety (set -euo pipefail)

Trap	Symptom	Fix
SIGPIPE (exit 141)	`ls path/*.sql \| head -10` — `head` closes pipe early	Write to temp file first, or use `find -print0 \| head -z`
Pipe subshell data loss	`echo "$OUT" \| while read ...; done > file` — writes lost in subshell	Process substitution: `while read ...; done < <(echo "$OUT")`
eval injection	`eval "val=\$$var"` with untrusted input	Use `case` statement or parameter expansion instead

Skipped Config NDJSON Pattern

Configs with 0 signals after feature filtering produce 1 JSONL line (skipped entry), not N barrier lines. This is correct behavior, not data loss.

When validating line counts:

expected_lines = (N_normal × barriers_per_query) + (N_skipped × 1) + (N_error × 1)

Example: 95 normal configs × 3 barriers + 5 skipped × 1 = 290 lines (not 300).

comm -23 for Large Skip-Done Sets (100K+)

For done-sets exceeding 10K entries, comm -23 (sorted set difference) is O(N+M) vs grep-per-file O(N×M):

# Build sorted done-set from JSONL
python3 -c "
import json
seen = set()
for line in open('\${LOG_FILE}'):
    try:
        d = json.loads(line)
        fc = d.get('feature_config','')
        if fc: seen.add(fc)
    except: pass
for s in sorted(seen): print(s)
" > /tmp/done.txt

# Build sorted all-configs, compute set difference
ls \${DIR}/*.sql | xargs -n1 basename | sed 's/\.sql$//' | sort > /tmp/all.txt
comm -23 /tmp/all.txt /tmp/done.txt > /tmp/todo.txt

# Submit remaining via xargs
cat /tmp/todo.txt | while read C; do echo "\${DIR}/\${C}.sql"; done | \
    xargs -P16 -I{} bash /tmp/wrapper.sh {} \${LOG} \${SYM} \${THR} \${GIT}

ClickHouse Parallelism Tuning (pueue + ClickHouse)

When using pueue to orchestrate ClickHouse queries, the interaction between pueue parallelism and ClickHouse's thread scheduler determines actual throughput.

The Thread Soft Limit

ClickHouse has a concurrent_threads_soft_limit_ratio_to_cores setting (default: 2). On a 32-core machine, this means ClickHouse allows 64 concurrent execution threads total, regardless of how many queries are running.

Each query requests max_threads threads (default: auto = nproc = 32 on a 32-core machine). With 8 parallel queries each requesting 32 threads (= 256 requested), ClickHouse throttles to 64 actual threads. The queries get ~8 effective threads each, not 32.

Right-Size `max_threads` Per Query

Anti-pattern: Letting each query request 32 threads when it only gets 8 effective threads. This creates scheduling overhead for no benefit.

Fix: Set --max_threads to match the effective thread count:

# In the job wrapper script:
clickhouse-client --max_threads=8 --multiquery < "$SQL_FILE"

This reduces thread scheduling overhead and allows higher pueue parallelism without oversubscription.

Parallelism Sizing Formula

effective_threads_per_query = concurrent_threads_soft_limit / pueue_parallel_slots
concurrent_threads_soft_limit = nproc * concurrent_threads_soft_limit_ratio_to_cores

# Example: 32-core machine, ratio=2, soft_limit=64
# 8 pueue slots  → 8 effective threads/query  → ~55% CPU (baseline)
# 16 pueue slots → 4 effective threads/query  → ~87% CPU (1.5-1.8x throughput)
# 24 pueue slots → 2-3 effective threads/query → ~95% CPU (diminishing returns)

Decision Matrix

Dimension	Check	Safe Threshold
Memory	p99 per-query × N slots < server memory limit	< 50% of `max_server_memory_usage`
CPU	Load average < 90% of nproc	load < 0.9 × nproc
I/O	`iostat` disk utilization	< 70%
Swap	`vmstat` si/so columns	Must be 0
CH errors	`system.query_log` ExceptionWhileProcessing	Must be 0

Live Tuning (No Restart Required)

Pueue parallelism can be changed live — running jobs finish with old settings, new jobs use the new limit:

# Check current
pueue group | grep mygroup

# Bump up
pueue parallel 16 -g mygroup

# Monitor for 2-3 minutes, then check
uptime                    # Load average
free -h                   # Memory
vmstat 1 3                # Swap (si/so = 0?)
clickhouse-client --query "SELECT count() FROM system.query_log
    WHERE event_time > now() - INTERVAL 5 MINUTE
    AND type = 'ExceptionWhileProcessing'"  # Errors = 0?

Callback Hooks (Completion Notifications)

Pueue fires a callback command on every task completion. Configure in pueue.yml:

daemon:
  callback: 'curl -s -X POST https://hooks.example.com/pueue -d ''{"id":{{id}},"result":"{{result}}","exit_code":{{exit_code}},"command":"{{command}}"}'''
  callback_log_lines: 10 # Lines of stdout/stderr available in {{output}}

Template Variables (14 total, Handlebars syntax)

Variable	Type	Description
`{{id}}`	int	Task ID
`{{command}}`	string	The command that was run
`{{path}}`	string	Working directory
`{{group}}`	string	Group name
`{{result}}`	string	`Success`, `Failed`, `Killed`, `DependencyFailed`
`{{exit_code}}`	string	`0` on success, error code on failure, `None` otherwise
`{{start}}`	string	Unix timestamp of start time
`{{end}}`	string	Unix timestamp of end time
`{{output}}`	string	Last N lines of stdout/stderr (see `callback_log_lines`)
`{{output_path}}`	string	Full path to log file on disk
`{{queued_count}}`	string	Remaining queued tasks in this group
`{{stashed_count}}`	string	Remaining stashed tasks in this group

Production Examples

# File-based sentinel (for script polling)
callback: "echo '{{id}}:{{result}}:{{exit_code}}' >> /tmp/pueue-completions.log"

# Telegram notification
callback: "curl -s 'https://api.telegram.org/bot${BOT_TOKEN}/sendMessage?chat_id=${CHAT_ID}&text=Job%20{{id}}%20{{result}}%20(exit%20{{exit_code}})'"

# Conditional alert (only on failure)
callback: "/bin/bash -c 'if [ \"{{result}}\" != \"Success\" ]; then echo \"FAILED: {{command}}\" | mail -s \"Pueue Alert\" user@example.com; fi'"

Config File Location (Platform Difference)

Platform	Config Path
macOS	`~/Library/Application Support/pueue/pueue.yml`
Linux	`~/.config/pueue/pueue.yml`

See Pueue Config Reference for all settings.

Delayed Scheduling (`--delay`)

Queue a job that starts after a specified delay:

# Relative time
pueue add --delay 3h -- python heavy_computation.py

# Natural language
pueue add --delay "next wednesday 5pm" -- python weekly_report.py

# RFC 3339
pueue add --delay "2026-03-01T02:00:00" -- python overnight_batch.py

Stashed + Delay Combo

Create stashed jobs that auto-enqueue at a future time:

# Stash now, auto-enqueue in 2 hours
pueue add --stashed --delay 2h -- python populate_cache.py

Patterns

Pattern	Command
Off-peak batch scheduling	`pueue add --delay "2am" -- python heavy_etl.py`
Staggered thundering-herd prevention	`pueue add --delay "${i}s" -- curl api/endpoint`
Weekend-only processing	`pueue add --delay "next saturday" -- python batch.py`

Priority Scheduling (`--priority`)

Higher priority number = runs first when a queue slot opens:

# Urgent validation (runs before queued lower-priority jobs)
pueue add --priority 10 -- python validate_critical.py

# Normal compute (default priority is 0)
pueue add -- python train_model.py

# Low-priority background task
pueue add --priority -5 -- python cleanup_logs.py

Priority only affects queued jobs waiting for an open slot. Running jobs are not preempted.

Per-Task Environment Override (`pueue env`)

Inject or override environment variables on stashed or queued tasks:

# Create a stashed job
JOB_ID=$(pueue add --stashed --print-task-id -- python train.py)

# Set environment variables (NOTE: separate args, NOT KEY=VALUE)
pueue env set "$JOB_ID" BATCH_SIZE 64
pueue env set "$JOB_ID" LEARNING_RATE 0.001

# Enqueue when ready
pueue enqueue "$JOB_ID"

Syntax: pueue env set <id> KEY VALUE — the key and value are separate positional arguments.

Constraint: Only works on stashed/queued tasks. Cannot modify environment of running tasks.

Relationship to mise.toml [env]: mise [env] remains the SSoT for default environment. Use pueue env set only for one-off overrides (e.g., hyperparameter sweeps) without modifying config files.

Preferred Pattern: python-dotenv for Pueue Job Secrets

Pueue jobs run in clean shells without .bashrc, .zshrc, or mise activation. This means mise.toml [env] variables are invisible to pueue jobs. The most portable solution is python-dotenv:

Architecture

mise.toml           → Task definitions only (no [env] for secrets)
.env                → Secrets (gitignored, loaded by python-dotenv at runtime)
scripts/backfill.sh → Pueue orchestrator (just `cd $PROJECT_DIR` for dotenv)

Implementation

1. Project .env (gitignored):

# .env — loaded by python-dotenv at runtime
API_KEY=sk-abc123
DATABASE_URL=postgresql://localhost/mydb

2. Python entry point — call load_dotenv() early:

from dotenv import load_dotenv
load_dotenv()  # Auto-loads .env from cwd

import os
API_KEY = os.getenv("API_KEY")  # Works everywhere

3. Pueue job — just needs cd to project root:

# The only requirement: cwd must contain .env
pueue add -- bash -c 'cd ~/project && uv run python my_script.py'

Why This Beats Alternatives

Approach	Interactive Shell	Pueue Job	Cron	SSH Remote	Cross-Platform
mise `[env]`	Yes	No	No	Fragile	macOS+Linux
`pueue env set`	N/A	Yes	No	No	N/A
Export in `.bashrc`	Yes	No	No	Depends	Varies
python-dotenv + `.env`	Yes	Yes	Yes	Yes	Yes

Cross-reference: See distributed-job-safety skill — G-15, AP-16

Blocking Wait (`pueue wait`)

Block until tasks complete — simpler than polling loops for scripts:

# Wait for specific task
pueue wait 42

# Wait for all tasks in a group
pueue wait --group mygroup

# Wait for ALL tasks across all groups
pueue wait --all

# Wait quietly (no progress output)
pueue wait 42 --quiet

# Wait for tasks to reach a specific status
pueue wait --status queued

Script Integration Pattern

# Queue → wait → process results
TASK_ID=$(pueue add --print-task-id -- python etl_pipeline.py)
pueue wait "$TASK_ID" --quiet
EXIT_CODE=$(pueue status --json | jq -r ".tasks[\"$TASK_ID\"].status.Done.result" 2>/dev/null)
if [ "$EXIT_CODE" = "Success" ]; then
    echo "Pipeline succeeded"
    pueue log "$TASK_ID" --full
else
    echo "Pipeline failed"
    pueue log "$TASK_ID" --full >&2
fi

Compressed State File

Reduce I/O for state persistence with zstd compression:

# In pueue.yml
daemon:
  compress_state_file: true

Compression ratio: ~10:1 (from pueue source code).

When to enable:

I/O-constrained hosts (spinning disks, NFS mounts)
Large task histories (hundreds of completed tasks)
Defense-in-depth alongside periodic pueue clean

Note: Compression helps I/O performance. pueue clean reduces data volume. They are complementary, not alternatives.

macOS Auto-Start (launchd)

Auto-start the pueue daemon on login. Create the plist at ~/Library/LaunchAgents/com.nukesor.pueued.plist:

# Generate the launchd plist (standard Apple plist format)  # SSoT-OK
cat > ~/Library/LaunchAgents/com.nukesor.pueued.plist << 'PLIST'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.nukesor.pueued</string>
    <key>ProgramArguments</key>
    <array>
        <string>/opt/homebrew/bin/pueued</string>
        <string>-v</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/pueued.stdout.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/pueued.stderr.log</string>
</dict>
</plist>
PLIST

Then load the agent:

# Load (starts immediately + on login)
launchctl load ~/Library/LaunchAgents/com.nukesor.pueued.plist

# Unload
launchctl unload ~/Library/LaunchAgents/com.nukesor.pueued.plist

# Check status
launchctl list | grep pueued

Linux equivalent: Use systemd — see pueued --systemd or create a user service in ~/.config/systemd/user/.

Hook: itp-hooks/posttooluse-reminder.ts - Reminds to use Pueue for detected long-running commands
Reference: Pueue GitHub
Issue: rangebar-py#77 - Original implementation
Issue: rangebar-py#88 - Production deployment lessons

pueue-job-orchestration

Resources

Install

Pueue Job Orchestration

Overview

When to Route Through Pueue

When to Use This Skill

Quick Reference

Check Status

Queue a Job

Monitor Jobs

Manage Jobs

Host Configuration

Workflows

1. Queue Single Remote Job

2. Batch Job Submission (Multiple Symbols)

3. Configure Parallelism Groups

4. Handle Failed Jobs

Installation

macOS (Local)

Linux (BigBlack/LittleBlack)

Systemd Auto-Start (Linux)

Integration with rangebar-py

Phase-Based Execution

Troubleshooting

Production Lessons (Issue #88)

Dependency Chaining with --after

Mise Task to Pueue Pipeline Integration

Forensic Audit Before Deployment

Force-Refresh vs Checkpoint Resume

PATH Gotcha: Rust Not in PATH via uv run

Per-Year (Epoch) Parallelization — DEFAULT STRATEGY

Pipeline Monitoring (Group-Based Phase Detection)

State File Management (CRITICAL)

The State Bloat Anti-Pattern

Pre-Submission Clean (Mandatory Pattern)

Periodic Clean During Long Sweeps

Bulk Submission with xargs -P (High-Throughput Pattern)

Why Not GNU Parallel?

Batch Command File Pattern

Crash Recovery with Skip-Done

Two-Tier Architecture (300K+ Jobs)

Architecture

Key Principles

When to Use Each Tier

Shell Script Safety (set -euo pipefail)

Skipped Config NDJSON Pattern

comm -23 for Large Skip-Done Sets (100K+)

ClickHouse Parallelism Tuning (pueue + ClickHouse)

The Thread Soft Limit

Right-Size max_threads Per Query

Parallelism Sizing Formula

Decision Matrix

Live Tuning (No Restart Required)

Callback Hooks (Completion Notifications)

Template Variables (14 total, Handlebars syntax)

Production Examples

Config File Location (Platform Difference)

Delayed Scheduling (--delay)

Stashed + Delay Combo

Patterns

Priority Scheduling (--priority)

Per-Task Environment Override (pueue env)

Preferred Pattern: python-dotenv for Pueue Job Secrets

Architecture

Implementation

Why This Beats Alternatives

Blocking Wait (pueue wait)

Script Integration Pattern

Compressed State File

macOS Auto-Start (launchd)

Related

Categories

Install

Recommended Skills

Dependency Chaining with `--after`

PATH Gotcha: Rust Not in PATH via `uv run`

Right-Size `max_threads` Per Query

Delayed Scheduling (`--delay`)

Priority Scheduling (`--priority`)

Per-Task Environment Override (`pueue env`)

Blocking Wait (`pueue wait`)