Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher.
Resources
2Install
npx skillscat add fl-sean03/agentic-science-worker/hpc-cluster Install via the SkillsCat registry.
CURC HPC Cluster Access (CU Boulder Alpine)
You have full SSH access to CU Boulder's Alpine HPC cluster. You can do everything a human researcher can do: submit jobs, debug failures, load modules, transfer files, and work autonomously.
Quick Reference
| Item | Value |
|---|---|
| Login | ssh $CURC_USER@login.rc.colorado.edu |
| Filesystem | /scratch/alpine/$CURC_USER/ (10TB, fast I/O) |
| Agent Workspace | /scratch/alpine/$CURC_USER/Agent_Runs/ |
| Job Scheduler | SLURM |
| Default Partition | amilan (CPU), aa100 (GPU) |
| Authentication | SSH key (pre-configured) |
| HPC Client | .claude/skills/hpc-cluster/hpc_client.py |
Two Ways to Work
You have two approaches available:
1. Python HPC Client (Recommended for common operations)
A lightweight client that handles connection management and common patterns:
import sys
import os
# Add the skill directory to path (relative to project root)
skill_dir = os.path.join(os.environ.get('PROJECT_ROOT', '.'), '.claude/skills/hpc-cluster')
sys.path.insert(0, skill_dir)
from hpc_client import HPCClient
hpc = HPCClient()
hpc.connect()
# Create workspace, upload files, submit job, wait for completion
run_dir = hpc.create_run("argon-diffusion")
hpc.upload("input.lmp", f"{run_dir}/input.lmp")
hpc.upload("job.slurm", f"{run_dir}/job.slurm")
job_id = hpc.submit(f"{run_dir}/job.slurm")
status = hpc.wait_for_job(job_id, timeout=3600)
if status.is_success:
hpc.download(f"{run_dir}/output.dat", "./results/")
else:
# Debug: read error output
print(hpc.read_file(f"{run_dir}/my_job_{job_id}.err"))
hpc.disconnect()2. Direct SSH (For full control)
When you need to do something the client doesn't support, use raw SSH:
# Run any command
ssh $CURC_USER@login.rc.colorado.edu "your command here"
# Interactive debugging
ssh $CURC_USER@login.rc.colorado.eduUse the client for: workspace setup, file transfer, job submission, job monitoring
Use raw SSH for: debugging, exploring, unusual operations, anything not covered
Connection
SSH Access
SSH is pre-configured with key-based authentication and connection multiplexing via ~/.ssh/config. Use the cu_alpine alias for simplicity:
# Connect to CURC login node (uses ~/.ssh/config)
ssh cu_alpine
# Run a single command
ssh cu_alpine "squeue -u $CURC_USER"
# Or use full address
ssh $CURC_USER@login.rc.colorado.edu "squeue -u $CURC_USER"
# Transfer files TO HPC
scp local_file.txt $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/
# Transfer files FROM HPC
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/results.dat ./Connection multiplexing: The SSH config uses ControlMaster to reuse connections - the first connection is slower, but subsequent ones are instant.
Important: The login node is for submitting jobs and light tasks. Never run compute-intensive work directly on login nodes.
Workspace Structure
All agent work on HPC goes in the existing Agent_Runs directory:
/scratch/alpine/$CURC_USER/Agent_Runs/
├── argon-diffusion-20260118/
│ ├── inputs/
│ ├── outputs/
│ ├── job.slurm
│ └── README.md
├── water-tip4p-20260119/
├── shared/
│ ├── potentials/ # Downloaded force fields
│ ├── pseudopotentials/ # Downloaded pseudopotentials
│ └── scripts/ # Reusable analysis scripts
└── ...Creating a New Run
# Create run directory with timestamp
RUN_NAME="project-name-$(date +%Y%m%d-%H%M%S)"
RUN_DIR="/scratch/alpine/$CURC_USER/Agent_Runs/$RUN_NAME"
ssh cu_alpine "mkdir -p $RUN_DIR/{inputs,outputs}"SLURM Job Submission
Job Script Template
#!/bin/bash
#SBATCH --job-name=my_simulation
#SBATCH --partition=amilan # CPU partition (or aa100 for GPU)
#SBATCH --nodes=1
#SBATCH --ntasks=32 # Number of MPI tasks
#SBATCH --time=04:00:00 # Max runtime (HH:MM:SS)
#SBATCH --output=%x_%j.out # stdout file
#SBATCH --error=%x_%j.err # stderr file
#SBATCH --mail-type=END,FAIL # Email notifications
#SBATCH --mail-user=your@email.com
# Load required modules
module purge
module load gcc/13.1.0
module load openmpi/4.1.6
# Change to run directory
cd $SLURM_SUBMIT_DIR
# Run your simulation
mpirun -np $SLURM_NTASKS ./your_program input.inKey SLURM Commands
| Command | Purpose |
|---|---|
sbatch job.slurm |
Submit batch job |
squeue -u $USER |
Check your job status |
squeue -j <jobid> |
Check specific job |
scancel <jobid> |
Cancel a job |
sinfo -p amilan |
Check partition status |
sacct -j <jobid> |
Job accounting info |
scontrol show job <jobid> |
Detailed job info |
Job Status Codes
| Code | Meaning |
|---|---|
PD |
Pending (waiting for resources) |
R |
Running |
CG |
Completing |
CD |
Completed |
F |
Failed |
TO |
Timeout |
CA |
Cancelled |
Available Partitions
Partition Selection Strategy
CRITICAL: Always validate on testing partition first before production runs!
Workflow:
1. atesting / atesting_a100 → Validate job script works (1 hour max)
2. amilan / aa100 → Production runs (24 hour max)
3. amilan + qos=long → Extended runs (7 day max, lower priority)Testing Partitions (Use First!)
| Partition | Limits | Max Time | Purpose |
|---|---|---|---|
atesting |
2 nodes, 16 cores max | 1h | Validate CPU jobs work before production |
atesting_a100 |
1 GPU, 10 cores max | 1h | Validate GPU jobs work before production |
atesting_mi100 |
1 GPU, 10 cores max | 1h | Validate AMD GPU jobs |
Always run a short test on atesting first to catch:
- Module loading issues
- Path errors
- Input file problems
- Memory requirements
Production CPU Partitions
| Partition | Nodes | Cores/Node | RAM/Node | Max Time | Use For |
|---|---|---|---|---|---|
amilan |
387 | 32-64 | 256 GB (3.75 GB/core) | 24h | Default for production CPU jobs |
amilan128c |
16 | 128 | 256 GB (2 GB/core) | 24h | High core count on single node (see below) |
amem |
24 | 48-128 | up to 2 TB | 24h | Memory-intensive (requires --qos=mem, must request 256GB+) |
When to Use amilan128c vs amilan
Use amilan128c when:
- Your job benefits from 128 cores on ONE node (vs spreading across multiple nodes)
- Running OpenMP/shared-memory parallel codes
- High inter-process communication (MPI with frequent small messages)
- Tightly-coupled simulations where network latency hurts performance
- Large LAMMPS/QE jobs that scale well but suffer from inter-node communication
Use regular amilan when:
- Your job needs fewer than 64 cores
- You need multiple nodes (amilan has 387 nodes vs only 16 for 128c)
- Memory per core matters more (3.75 GB/core vs 2 GB/core on 128c)
- Queue wait time is a concern (more nodes = shorter queue)
Example: 128-core single-node LAMMPS job
#SBATCH --partition=amilan128c
#SBATCH --nodes=1
#SBATCH --ntasks=128 # Use all 128 cores
#SBATCH --time=12:00:00Production GPU Partitions
| Partition | Nodes | GPUs/Node | GPU Type | Max Time | Use For |
|---|---|---|---|---|---|
aa100 |
11 | 3 | NVIDIA A100 (40GB) | 24h | Best for CUDA, ML/DL, GPU-accelerated MD |
ami100 |
7 | 3 | AMD MI100 | 24h | ROCm/HIP workloads |
al40 |
3 | 3 | NVIDIA L40 | 24h | Newer architecture, visualization |
Special Partitions
| Partition | Max Time | Purpose |
|---|---|---|
acompile |
12h | Compiling software only (use via acompile command) |
csu |
24h | Colorado State contributed nodes |
amc |
24h | CU Anschutz contributed nodes |
QoS (Quality of Service)
| QoS | Max Time | Priority | When to Use |
|---|---|---|---|
normal |
24h | Normal | Default - use for most jobs |
long |
7 days | Lower | Extended simulations (will wait longer in queue) |
mem |
24h | Normal | Required for amem partition (high-memory jobs) |
Partition Selection Examples
# 1. TESTING: Always start here to validate your job works
#SBATCH --partition=atesting
#SBATCH --time=00:30:00
#SBATCH --ntasks=4
# 2. PRODUCTION CPU: After testing passes
#SBATCH --partition=amilan
#SBATCH --time=04:00:00
#SBATCH --ntasks=32
# 3. PRODUCTION GPU: For GPU-accelerated codes
#SBATCH --partition=aa100
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
# 4. LONG RUNS: When 24h isn't enough (lower priority)
#SBATCH --partition=amilan
#SBATCH --qos=long
#SBATCH --time=168:00:00 # 7 days
# 5. HIGH MEMORY: For memory-intensive jobs (256GB+ required)
#SBATCH --partition=amem
#SBATCH --qos=mem
#SBATCH --mem=512G
#SBATCH --time=12:00:00Module System
Software is managed through environment modules. Always work from a compute node or compile node, not login.
Essential Commands
# List available modules
module avail
# Search for specific software
module spider lammps
module spider python
# Load modules
module load gcc/13.1.0
module load openmpi/4.1.6
module load lammps/20230802
# See what's loaded
module list
# Unload all modules
module purge
# Save/restore module sets
module save my_env
module restore my_envFinding and Loading Software
Software on CURC is installed in /curc/sw/install/. To find what's available:
# List all installed software
ls /curc/sw/install/
# Check specific software versions
ls /curc/sw/install/lammps/ # LAMMPS versions (22July25, 2Sept25, etc.)
ls /curc/sw/install/QE/ # Quantum ESPRESSO (7.0, 7.2)
ls /curc/sw/install/gromacs/ # GROMACS versionsLAMMPS example (check exact paths for current versions):
# Find the binary
ls /curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin/
# In job script
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS lmp -in input.lmpQuantum ESPRESSO example:
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS pw.x < input.in > output.outNote: Module dependencies matter. Load compiler first, then MPI. Check exact version paths as they may change.
Storage Filesystem
Paths and Quotas
| Path | Quota | Purge | Use For |
|---|---|---|---|
/home/$USER |
2 GB | Never | Scripts, small configs |
/projects/$USER |
250 GB | Never | Code, small datasets |
/scratch/alpine/$USER |
10 TB | 90 days | Job I/O, large files |
$SLURM_SCRATCH |
~300 GB | Job end | Node-local temp storage |
Performance Rules
DO:
- Run all job I/O on
/scratch/alpine/ - Use
$SLURM_SCRATCHfor intensive temporary files - Copy results back after job completes
DON'T:
- Run I/O-intensive jobs on
/homeor/projects(will be killed) - Store important data only on
/scratch(it's purged!) - Leave large files on login nodes
Example Workflows
Recommended Workflow: Test First, Then Production
Step 1: Create a testing job script (job_test.slurm)
#!/bin/bash
#SBATCH --job-name=argon_test
#SBATCH --partition=atesting # <-- TEST PARTITION FIRST
#SBATCH --nodes=1
#SBATCH --ntasks=4 # Small scale for testing
#SBATCH --time=00:30:00 # 30 min is plenty for testing
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
echo "=== Testing job script ==="
echo "Started at: $(date)"
echo "Running on: $(hostname)"
module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
cd $SLURM_SUBMIT_DIR
echo "Working directory: $(pwd)"
echo "Input files: $(ls -la)"
# Run short test (reduce timesteps in input for testing)
mpirun -np $SLURM_NTASKS lmp -in input.lmp
echo "Finished at: $(date)"Step 2: If test passes, create production job (job_prod.slurm)
#!/bin/bash
#SBATCH --job-name=argon_prod
#SBATCH --partition=amilan # <-- PRODUCTION PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks=32 # Full scale
#SBATCH --time=04:00:00 # Appropriate for full run
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmpLAMMPS MD Simulation (Full Example)
#!/bin/bash
#SBATCH --job-name=argon_md
#SBATCH --partition=amilan
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmpQuantum ESPRESSO DFT
#!/bin/bash
#SBATCH --job-name=si_scf
#SBATCH --partition=amilan
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS pw.x < si_scf.in > si_scf.outGPU Job (Testing First)
Test on atesting_a100:
#!/bin/bash
#SBATCH --job-name=md_gpu_test
#SBATCH --partition=atesting_a100 # <-- GPU TESTING
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=%x_%j.out
module purge
module load gcc/12.2.0 cuda/12.1.1
# Add LAMMPS GPU path here
cd $SLURM_SUBMIT_DIR
lmp -k on g 1 -sf kk -pk kokkos gpu/aware off -in input.lmpThen production on aa100:
#!/bin/bash
#SBATCH --job-name=md_gpu_prod
#SBATCH --partition=aa100 # <-- GPU PRODUCTION
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:3 # Can use up to 3 GPUs per node
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
module purge
module load gcc/12.2.0 cuda/12.1.1
cd $SLURM_SUBMIT_DIR
lmp -k on g 3 -sf kk -pk kokkos gpu/aware off -in input.lmpDebugging Failed Jobs
When a job fails, investigate systematically:
1. Check Job Status
# See why it failed
sacct -j <jobid> --format=JobID,State,ExitCode,Reason
# Get detailed info
scontrol show job <jobid>2. Read Output Files
# Check stdout
cat my_job_12345.out
# Check stderr (often has the real error)
cat my_job_12345.err
# Check application logs
cat log.lammps3. Common Failure Reasons
| Issue | Symptom | Solution |
|---|---|---|
| Timeout | State=TIMEOUT | Increase --time or optimize |
| Memory | State=OUT_OF_MEMORY | Increase nodes or use amem |
| Module not found | "command not found" | Check module load order |
| Bad path | "file not found" | Use absolute paths |
| Wrong partition | Job pending forever | Check partition resources |
4. Interactive Debugging
# Get interactive session for debugging
sinteractive --partition=atesting --time=01:00:00 --ntasks=4
# Then run commands interactively to debug
module load lammps
lmp -in input.lmp # See errors in real-timeFile Transfer
Between Local and HPC
# Upload input files
scp -r ./inputs/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/
# Download results
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/output.dat ./
# Sync directories (rsync is more efficient for updates)
rsync -avz ./project/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/project/Large File Transfers
For very large files, use Globus (web-based) or DTN nodes:
# Use data transfer node for large transfers
scp large_file.tar $CURC_USER@dtn.rc.colorado.edu:/scratch/alpine/$CURC_USER/Queue Times and Async Job Management
Understanding Queue Wait Times
CRITICAL: HPC jobs don't start immediately. Queue times vary dramatically:
| Partition | Typical Wait | Why |
|---|---|---|
atesting |
Minutes | Testing partition, low demand |
amilan |
Minutes to hours | Many nodes (387), high throughput |
amilan128c |
Hours to DAYS | Only 16 nodes, high demand |
aa100 |
Hours to days | Only 11 nodes, GPU scarcity |
Before submitting, check the queue:
# See pending jobs and estimated start times
ssh cu_alpine "squeue -p amilan128c --start"
# Quick queue depth check
ssh cu_alpine "squeue -p amilan128c --state=PENDING | wc -l"Async Workflow (For Long Queue Times)
DON'T block waiting for jobs with multi-day queues. Instead:
from hpc_client import HPCClient
hpc = HPCClient()
hpc.connect()
# 1. Check queue before choosing partition
status = hpc.get_queue_status('amilan128c')
print(f"Estimated wait: {status['estimated_wait']}")
print(f"Pending jobs: {status['pending_jobs']}")
# 2. Compare partitions to choose wisely
for part in hpc.compare_partitions(['amilan', 'amilan128c', 'aa100']):
print(f"{part['partition']}: {part['estimated_wait']}, {part['pending_jobs']} pending")
# 3. Submit async (returns immediately, saves tracking file)
tracking = hpc.submit_async(f"{run_dir}/job.slurm")
print(f"Job {tracking['job_id']} submitted")
print(f"Estimated start: {tracking['estimated_start']}")
# Returns immediately - don't wait!
# 4. Later: Check on all submitted jobs
jobs = hpc.check_async_jobs()
for job in jobs:
print(f"Job {job['job_id']}: {job['current_status']}")
if job['is_finished']:
print(f" Completed! Success: {job['is_success']}")Workflow Strategy for Long-Running Studies
For multi-day queue scenarios:
Day 1: Submit jobs
├── Check queue status
├── Submit with submit_async()
├── Note estimated start times
└── Move on to other work
Day 2+: Check periodically
├── hpc.check_async_jobs()
├── If still PENDING: wait
├── If RUNNING: monitor progress
└── If COMPLETED: download results and analyzeSLURM Email Notifications (Recommended)
Add to your job scripts for automatic notifications:
#SBATCH --mail-type=BEGIN,END,FAIL # When to email
#SBATCH --mail-user=your@email.com # Your email
# Options: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# BEGIN = job started (left queue)
# END = job finished
# FAIL = job failedSmart Partition Selection
Decision tree:
Need GPU?
├── YES → Check aa100 queue
│ └── Long wait? Consider if job can run on CPU instead
└── NO → How many cores?
├── ≤64 cores → amilan (shorter queue, more nodes)
└── >64 cores or tightly-coupled →
└── Check amilan128c queue
└── Wait >24h? Consider splitting across amilan nodesCheck Job Progress
# One-time status check with start time estimates
ssh cu_alpine "squeue -u $CURC_USER --start"
# See job details
ssh cu_alpine "scontrol show job <jobid>"
# Check why job is pending
ssh cu_alpine "squeue -j <jobid> --format='%r'" # Shows REASONWait for Job Completion (Short Jobs Only)
Only use blocking wait for jobs expected to complete within minutes:
# Poll until job completes (ONLY for short jobs!)
JOB_ID=12345
while ssh cu_alpine "squeue -j $JOB_ID 2>/dev/null | grep -q $JOB_ID"; do
echo "Job $JOB_ID still running..."
sleep 60
done
echo "Job $JOB_ID completed"
# Check final status
ssh cu_alpine "sacct -j $JOB_ID --format=JobID,State,ExitCode"Key Principles
You Are a Researcher
You have the same access a human researcher has. You can:
- Create any job script you need
- Load any available module
- Debug failures by reading logs
- Adapt to different software versions
- Figure out problems through investigation
Don't Just Execute - Verify
After running on HPC:
- Check job completed successfully (not just submitted)
- Verify output files exist and have content
- Check for error messages in stderr
- Validate results are physically reasonable
Document Your Work
Leave breadcrumbs for yourself:
# In job script
echo "Job started at $(date)"
echo "Running on $(hostname)"
echo "Loaded modules: $(module list 2>&1)"