hpc-cluster

Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher.

fl-sean03 3 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add fl-sean03/agentic-science-worker/hpc-cluster

Install via the SkillsCat registry.

SKILL.md

CURC HPC Cluster Access (CU Boulder Alpine)

You have full SSH access to CU Boulder's Alpine HPC cluster. You can do everything a human researcher can do: submit jobs, debug failures, load modules, transfer files, and work autonomously.

Quick Reference

Item	Value
Login	`ssh $CURC_USER@login.rc.colorado.edu`
Filesystem	`/scratch/alpine/$CURC_USER/` (10TB, fast I/O)
Agent Workspace	`/scratch/alpine/$CURC_USER/Agent_Runs/`
Job Scheduler	SLURM
Default Partition	`amilan` (CPU), `aa100` (GPU)
Authentication	SSH key (pre-configured)
HPC Client	`.claude/skills/hpc-cluster/hpc_client.py`

Two Ways to Work

You have two approaches available:

1. Python HPC Client (Recommended for common operations)

A lightweight client that handles connection management and common patterns:

import sys
import os
# Add the skill directory to path (relative to project root)
skill_dir = os.path.join(os.environ.get('PROJECT_ROOT', '.'), '.claude/skills/hpc-cluster')
sys.path.insert(0, skill_dir)
from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# Create workspace, upload files, submit job, wait for completion
run_dir = hpc.create_run("argon-diffusion")
hpc.upload("input.lmp", f"{run_dir}/input.lmp")
hpc.upload("job.slurm", f"{run_dir}/job.slurm")
job_id = hpc.submit(f"{run_dir}/job.slurm")
status = hpc.wait_for_job(job_id, timeout=3600)

if status.is_success:
    hpc.download(f"{run_dir}/output.dat", "./results/")
else:
    # Debug: read error output
    print(hpc.read_file(f"{run_dir}/my_job_{job_id}.err"))

hpc.disconnect()

2. Direct SSH (For full control)

When you need to do something the client doesn't support, use raw SSH:

# Run any command
ssh $CURC_USER@login.rc.colorado.edu "your command here"

# Interactive debugging
ssh $CURC_USER@login.rc.colorado.edu

Use the client for: workspace setup, file transfer, job submission, job monitoring
Use raw SSH for: debugging, exploring, unusual operations, anything not covered

Connection

SSH Access

SSH is pre-configured with key-based authentication and connection multiplexing via ~/.ssh/config. Use the cu_alpine alias for simplicity:

# Connect to CURC login node (uses ~/.ssh/config)
ssh cu_alpine

# Run a single command
ssh cu_alpine "squeue -u $CURC_USER"

# Or use full address
ssh $CURC_USER@login.rc.colorado.edu "squeue -u $CURC_USER"

# Transfer files TO HPC
scp local_file.txt $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/

# Transfer files FROM HPC
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/results.dat ./

Connection multiplexing: The SSH config uses ControlMaster to reuse connections - the first connection is slower, but subsequent ones are instant.

Important: The login node is for submitting jobs and light tasks. Never run compute-intensive work directly on login nodes.

Workspace Structure

All agent work on HPC goes in the existing Agent_Runs directory:

/scratch/alpine/$CURC_USER/Agent_Runs/
├── argon-diffusion-20260118/
│   ├── inputs/
│   ├── outputs/
│   ├── job.slurm
│   └── README.md
├── water-tip4p-20260119/
├── shared/
│   ├── potentials/          # Downloaded force fields
│   ├── pseudopotentials/    # Downloaded pseudopotentials
│   └── scripts/             # Reusable analysis scripts
└── ...

Creating a New Run

# Create run directory with timestamp
RUN_NAME="project-name-$(date +%Y%m%d-%H%M%S)"
RUN_DIR="/scratch/alpine/$CURC_USER/Agent_Runs/$RUN_NAME"
ssh cu_alpine "mkdir -p $RUN_DIR/{inputs,outputs}"

SLURM Job Submission

Job Script Template

#!/bin/bash
#SBATCH --job-name=my_simulation
#SBATCH --partition=amilan          # CPU partition (or aa100 for GPU)
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Number of MPI tasks
#SBATCH --time=04:00:00             # Max runtime (HH:MM:SS)
#SBATCH --output=%x_%j.out          # stdout file
#SBATCH --error=%x_%j.err           # stderr file
#SBATCH --mail-type=END,FAIL        # Email notifications
#SBATCH --mail-user=your@email.com

# Load required modules
module purge
module load gcc/13.1.0
module load openmpi/4.1.6

# Change to run directory
cd $SLURM_SUBMIT_DIR

# Run your simulation
mpirun -np $SLURM_NTASKS ./your_program input.in

Key SLURM Commands

Command	Purpose
`sbatch job.slurm`	Submit batch job
`squeue -u $USER`	Check your job status
`squeue -j <jobid>`	Check specific job
`scancel <jobid>`	Cancel a job
`sinfo -p amilan`	Check partition status
`sacct -j <jobid>`	Job accounting info
`scontrol show job <jobid>`	Detailed job info

Job Status Codes

Code	Meaning
`PD`	Pending (waiting for resources)
`R`	Running
`CG`	Completing
`CD`	Completed
`F`	Failed
`TO`	Timeout
`CA`	Cancelled

Available Partitions

Partition Selection Strategy

CRITICAL: Always validate on testing partition first before production runs!

Workflow:
1. atesting / atesting_a100  →  Validate job script works (1 hour max)
2. amilan / aa100            →  Production runs (24 hour max)
3. amilan + qos=long         →  Extended runs (7 day max, lower priority)

Testing Partitions (Use First!)

Partition	Limits	Max Time	Purpose
`atesting`	2 nodes, 16 cores max	1h	Validate CPU jobs work before production
`atesting_a100`	1 GPU, 10 cores max	1h	Validate GPU jobs work before production
`atesting_mi100`	1 GPU, 10 cores max	1h	Validate AMD GPU jobs

Always run a short test on atesting first to catch:

Module loading issues
Path errors
Input file problems
Memory requirements

Production CPU Partitions

Partition	Nodes	Cores/Node	RAM/Node	Max Time	Use For
`amilan`	387	32-64	256 GB (3.75 GB/core)	24h	Default for production CPU jobs
`amilan128c`	16	128	256 GB (2 GB/core)	24h	High core count on single node (see below)
`amem`	24	48-128	up to 2 TB	24h	Memory-intensive (requires `--qos=mem`, must request 256GB+)

When to Use amilan128c vs amilan

Use amilan128c when:

Your job benefits from 128 cores on ONE node (vs spreading across multiple nodes)
Running OpenMP/shared-memory parallel codes
High inter-process communication (MPI with frequent small messages)
Tightly-coupled simulations where network latency hurts performance
Large LAMMPS/QE jobs that scale well but suffer from inter-node communication

Use regular amilan when:

Your job needs fewer than 64 cores
You need multiple nodes (amilan has 387 nodes vs only 16 for 128c)
Memory per core matters more (3.75 GB/core vs 2 GB/core on 128c)
Queue wait time is a concern (more nodes = shorter queue)

Example: 128-core single-node LAMMPS job

#SBATCH --partition=amilan128c
#SBATCH --nodes=1
#SBATCH --ntasks=128           # Use all 128 cores
#SBATCH --time=12:00:00

Production GPU Partitions

Partition	Nodes	GPUs/Node	GPU Type	Max Time	Use For
`aa100`	11	3	NVIDIA A100 (40GB)	24h	Best for CUDA, ML/DL, GPU-accelerated MD
`ami100`	7	3	AMD MI100	24h	ROCm/HIP workloads
`al40`	3	3	NVIDIA L40	24h	Newer architecture, visualization

Special Partitions

Partition	Max Time	Purpose
`acompile`	12h	Compiling software only (use via `acompile` command)
`csu`	24h	Colorado State contributed nodes
`amc`	24h	CU Anschutz contributed nodes

QoS (Quality of Service)

QoS	Max Time	Priority	When to Use
`normal`	24h	Normal	Default - use for most jobs
`long`	7 days	Lower	Extended simulations (will wait longer in queue)
`mem`	24h	Normal	Required for `amem` partition (high-memory jobs)

Partition Selection Examples

# 1. TESTING: Always start here to validate your job works
#SBATCH --partition=atesting
#SBATCH --time=00:30:00
#SBATCH --ntasks=4

# 2. PRODUCTION CPU: After testing passes
#SBATCH --partition=amilan
#SBATCH --time=04:00:00
#SBATCH --ntasks=32

# 3. PRODUCTION GPU: For GPU-accelerated codes
#SBATCH --partition=aa100
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# 4. LONG RUNS: When 24h isn't enough (lower priority)
#SBATCH --partition=amilan
#SBATCH --qos=long
#SBATCH --time=168:00:00   # 7 days

# 5. HIGH MEMORY: For memory-intensive jobs (256GB+ required)
#SBATCH --partition=amem
#SBATCH --qos=mem
#SBATCH --mem=512G
#SBATCH --time=12:00:00

Module System

Software is managed through environment modules. Always work from a compute node or compile node, not login.

Essential Commands

# List available modules
module avail

# Search for specific software
module spider lammps
module spider python

# Load modules
module load gcc/13.1.0
module load openmpi/4.1.6
module load lammps/20230802

# See what's loaded
module list

# Unload all modules
module purge

# Save/restore module sets
module save my_env
module restore my_env

Finding and Loading Software

Software on CURC is installed in /curc/sw/install/. To find what's available:

# List all installed software
ls /curc/sw/install/

# Check specific software versions
ls /curc/sw/install/lammps/    # LAMMPS versions (22July25, 2Sept25, etc.)
ls /curc/sw/install/QE/        # Quantum ESPRESSO (7.0, 7.2)
ls /curc/sw/install/gromacs/   # GROMACS versions

LAMMPS example (check exact paths for current versions):

# Find the binary
ls /curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin/

# In job script
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO example:

module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS pw.x < input.in > output.out

Note: Module dependencies matter. Load compiler first, then MPI. Check exact version paths as they may change.

Storage Filesystem

Paths and Quotas

Path	Quota	Purge	Use For
`/home/$USER`	2 GB	Never	Scripts, small configs
`/projects/$USER`	250 GB	Never	Code, small datasets
`/scratch/alpine/$USER`	10 TB	90 days	Job I/O, large files
`$SLURM_SCRATCH`	~300 GB	Job end	Node-local temp storage

Performance Rules

DO:

Run all job I/O on /scratch/alpine/
Use $SLURM_SCRATCH for intensive temporary files
Copy results back after job completes

DON'T:

Run I/O-intensive jobs on /home or /projects (will be killed)
Store important data only on /scratch (it's purged!)
Leave large files on login nodes

Example Workflows

Recommended Workflow: Test First, Then Production

Step 1: Create a testing job script (job_test.slurm)

#!/bin/bash
#SBATCH --job-name=argon_test
#SBATCH --partition=atesting        # <-- TEST PARTITION FIRST
#SBATCH --nodes=1
#SBATCH --ntasks=4                  # Small scale for testing
#SBATCH --time=00:30:00             # 30 min is plenty for testing
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

echo "=== Testing job script ==="
echo "Started at: $(date)"
echo "Running on: $(hostname)"

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
echo "Working directory: $(pwd)"
echo "Input files: $(ls -la)"

# Run short test (reduce timesteps in input for testing)
mpirun -np $SLURM_NTASKS lmp -in input.lmp

echo "Finished at: $(date)"

Step 2: If test passes, create production job (job_prod.slurm)

#!/bin/bash
#SBATCH --job-name=argon_prod
#SBATCH --partition=amilan          # <-- PRODUCTION PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Full scale
#SBATCH --time=04:00:00             # Appropriate for full run
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

LAMMPS MD Simulation (Full Example)

#!/bin/bash
#SBATCH --job-name=argon_md
#SBATCH --partition=amilan
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO DFT

#!/bin/bash
#SBATCH --job-name=si_scf
#SBATCH --partition=amilan
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS pw.x < si_scf.in > si_scf.out

GPU Job (Testing First)

Test on atesting_a100:

#!/bin/bash
#SBATCH --job-name=md_gpu_test
#SBATCH --partition=atesting_a100   # <-- GPU TESTING
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1
# Add LAMMPS GPU path here

cd $SLURM_SUBMIT_DIR
lmp -k on g 1 -sf kk -pk kokkos gpu/aware off -in input.lmp

Then production on aa100:

#!/bin/bash
#SBATCH --job-name=md_gpu_prod
#SBATCH --partition=aa100           # <-- GPU PRODUCTION
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:3                # Can use up to 3 GPUs per node
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1

cd $SLURM_SUBMIT_DIR
lmp -k on g 3 -sf kk -pk kokkos gpu/aware off -in input.lmp

Debugging Failed Jobs

When a job fails, investigate systematically:

1. Check Job Status

# See why it failed
sacct -j <jobid> --format=JobID,State,ExitCode,Reason

# Get detailed info
scontrol show job <jobid>

2. Read Output Files

# Check stdout
cat my_job_12345.out

# Check stderr (often has the real error)
cat my_job_12345.err

# Check application logs
cat log.lammps

3. Common Failure Reasons

Issue	Symptom	Solution
Timeout	State=TIMEOUT	Increase `--time` or optimize
Memory	State=OUT_OF_MEMORY	Increase nodes or use `amem`
Module not found	"command not found"	Check `module load` order
Bad path	"file not found"	Use absolute paths
Wrong partition	Job pending forever	Check partition resources

4. Interactive Debugging

# Get interactive session for debugging
sinteractive --partition=atesting --time=01:00:00 --ntasks=4

# Then run commands interactively to debug
module load lammps
lmp -in input.lmp  # See errors in real-time

File Transfer

Between Local and HPC

# Upload input files
scp -r ./inputs/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/

# Download results
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/output.dat ./

# Sync directories (rsync is more efficient for updates)
rsync -avz ./project/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/project/

Large File Transfers

For very large files, use Globus (web-based) or DTN nodes:

# Use data transfer node for large transfers
scp large_file.tar $CURC_USER@dtn.rc.colorado.edu:/scratch/alpine/$CURC_USER/

Queue Times and Async Job Management

Understanding Queue Wait Times

CRITICAL: HPC jobs don't start immediately. Queue times vary dramatically:

Partition	Typical Wait	Why
`atesting`	Minutes	Testing partition, low demand
`amilan`	Minutes to hours	Many nodes (387), high throughput
`amilan128c`	Hours to DAYS	Only 16 nodes, high demand
`aa100`	Hours to days	Only 11 nodes, GPU scarcity

Before submitting, check the queue:

# See pending jobs and estimated start times
ssh cu_alpine "squeue -p amilan128c --start"

# Quick queue depth check
ssh cu_alpine "squeue -p amilan128c --state=PENDING | wc -l"

Async Workflow (For Long Queue Times)

DON'T block waiting for jobs with multi-day queues. Instead:

from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# 1. Check queue before choosing partition
status = hpc.get_queue_status('amilan128c')
print(f"Estimated wait: {status['estimated_wait']}")
print(f"Pending jobs: {status['pending_jobs']}")

# 2. Compare partitions to choose wisely
for part in hpc.compare_partitions(['amilan', 'amilan128c', 'aa100']):
    print(f"{part['partition']}: {part['estimated_wait']}, {part['pending_jobs']} pending")

# 3. Submit async (returns immediately, saves tracking file)
tracking = hpc.submit_async(f"{run_dir}/job.slurm")
print(f"Job {tracking['job_id']} submitted")
print(f"Estimated start: {tracking['estimated_start']}")
# Returns immediately - don't wait!

# 4. Later: Check on all submitted jobs
jobs = hpc.check_async_jobs()
for job in jobs:
    print(f"Job {job['job_id']}: {job['current_status']}")
    if job['is_finished']:
        print(f"  Completed! Success: {job['is_success']}")

Workflow Strategy for Long-Running Studies

For multi-day queue scenarios:

Day 1: Submit jobs
├── Check queue status
├── Submit with submit_async()
├── Note estimated start times
└── Move on to other work

Day 2+: Check periodically
├── hpc.check_async_jobs()
├── If still PENDING: wait
├── If RUNNING: monitor progress
└── If COMPLETED: download results and analyze

SLURM Email Notifications (Recommended)

Add to your job scripts for automatic notifications:

#SBATCH --mail-type=BEGIN,END,FAIL    # When to email
#SBATCH --mail-user=your@email.com    # Your email

# Options: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# BEGIN = job started (left queue)
# END = job finished
# FAIL = job failed

Smart Partition Selection

Decision tree:

Need GPU?
├── YES → Check aa100 queue
│         └── Long wait? Consider if job can run on CPU instead
└── NO → How many cores?
         ├── ≤64 cores → amilan (shorter queue, more nodes)
         └── >64 cores or tightly-coupled →
             └── Check amilan128c queue
                 └── Wait >24h? Consider splitting across amilan nodes

Check Job Progress

# One-time status check with start time estimates
ssh cu_alpine "squeue -u $CURC_USER --start"

# See job details
ssh cu_alpine "scontrol show job <jobid>"

# Check why job is pending
ssh cu_alpine "squeue -j <jobid> --format='%r'"  # Shows REASON

Wait for Job Completion (Short Jobs Only)

Only use blocking wait for jobs expected to complete within minutes:

# Poll until job completes (ONLY for short jobs!)
JOB_ID=12345
while ssh cu_alpine "squeue -j $JOB_ID 2>/dev/null | grep -q $JOB_ID"; do
    echo "Job $JOB_ID still running..."
    sleep 60
done
echo "Job $JOB_ID completed"

# Check final status
ssh cu_alpine "sacct -j $JOB_ID --format=JobID,State,ExitCode"

Key Principles

You Are a Researcher

You have the same access a human researcher has. You can:

Create any job script you need
Load any available module
Debug failures by reading logs
Adapt to different software versions
Figure out problems through investigation

Don't Just Execute - Verify

After running on HPC:

Check job completed successfully (not just submitted)
Verify output files exist and have content
Check for error messages in stderr
Validate results are physically reasonable

Document Your Work

Leave breadcrumbs for yourself:

# In job script
echo "Job started at $(date)"
echo "Running on $(hostname)"
echo "Loaded modules: $(module list 2>&1)"

hpc-cluster

Resources

Install

CURC HPC Cluster Access (CU Boulder Alpine)

Quick Reference

Two Ways to Work

1. Python HPC Client (Recommended for common operations)

2. Direct SSH (For full control)

Connection

SSH Access

Workspace Structure

Creating a New Run

SLURM Job Submission

Job Script Template

Key SLURM Commands

Job Status Codes

Available Partitions

Partition Selection Strategy

Testing Partitions (Use First!)

Production CPU Partitions

When to Use amilan128c vs amilan

Production GPU Partitions

Special Partitions

QoS (Quality of Service)

Partition Selection Examples

Module System

Essential Commands

Finding and Loading Software

Storage Filesystem

Paths and Quotas

Performance Rules

Example Workflows

Recommended Workflow: Test First, Then Production

LAMMPS MD Simulation (Full Example)

Quantum ESPRESSO DFT

GPU Job (Testing First)

Debugging Failed Jobs

1. Check Job Status

2. Read Output Files

3. Common Failure Reasons

4. Interactive Debugging

File Transfer

Between Local and HPC

Large File Transfers

Queue Times and Async Job Management

Understanding Queue Wait Times

Async Workflow (For Long Queue Times)

Workflow Strategy for Long-Running Studies

SLURM Email Notifications (Recommended)

Smart Partition Selection

Check Job Progress

Wait for Job Completion (Short Jobs Only)

Key Principles

You Are a Researcher

Don't Just Execute - Verify

Document Your Work

Reference Links

Categories

Install

Recommended Skills