fl-sean03

hpc-cluster

Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher.

fl-sean03 3 Updated 3mo ago

Resources

2
GitHub

Install

npx skillscat add fl-sean03/agentic-science-worker/hpc-cluster

Install via the SkillsCat registry.

SKILL.md

CURC HPC Cluster Access (CU Boulder Alpine)

You have full SSH access to CU Boulder's Alpine HPC cluster. You can do everything a human researcher can do: submit jobs, debug failures, load modules, transfer files, and work autonomously.

Quick Reference

Item Value
Login ssh $CURC_USER@login.rc.colorado.edu
Filesystem /scratch/alpine/$CURC_USER/ (10TB, fast I/O)
Agent Workspace /scratch/alpine/$CURC_USER/Agent_Runs/
Job Scheduler SLURM
Default Partition amilan (CPU), aa100 (GPU)
Authentication SSH key (pre-configured)
HPC Client .claude/skills/hpc-cluster/hpc_client.py

Two Ways to Work

You have two approaches available:

1. Python HPC Client (Recommended for common operations)

A lightweight client that handles connection management and common patterns:

import sys
import os
# Add the skill directory to path (relative to project root)
skill_dir = os.path.join(os.environ.get('PROJECT_ROOT', '.'), '.claude/skills/hpc-cluster')
sys.path.insert(0, skill_dir)
from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# Create workspace, upload files, submit job, wait for completion
run_dir = hpc.create_run("argon-diffusion")
hpc.upload("input.lmp", f"{run_dir}/input.lmp")
hpc.upload("job.slurm", f"{run_dir}/job.slurm")
job_id = hpc.submit(f"{run_dir}/job.slurm")
status = hpc.wait_for_job(job_id, timeout=3600)

if status.is_success:
    hpc.download(f"{run_dir}/output.dat", "./results/")
else:
    # Debug: read error output
    print(hpc.read_file(f"{run_dir}/my_job_{job_id}.err"))

hpc.disconnect()

2. Direct SSH (For full control)

When you need to do something the client doesn't support, use raw SSH:

# Run any command
ssh $CURC_USER@login.rc.colorado.edu "your command here"

# Interactive debugging
ssh $CURC_USER@login.rc.colorado.edu

Use the client for: workspace setup, file transfer, job submission, job monitoring
Use raw SSH for: debugging, exploring, unusual operations, anything not covered


Connection

SSH Access

SSH is pre-configured with key-based authentication and connection multiplexing via ~/.ssh/config. Use the cu_alpine alias for simplicity:

# Connect to CURC login node (uses ~/.ssh/config)
ssh cu_alpine

# Run a single command
ssh cu_alpine "squeue -u $CURC_USER"

# Or use full address
ssh $CURC_USER@login.rc.colorado.edu "squeue -u $CURC_USER"

# Transfer files TO HPC
scp local_file.txt $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/

# Transfer files FROM HPC
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/results.dat ./

Connection multiplexing: The SSH config uses ControlMaster to reuse connections - the first connection is slower, but subsequent ones are instant.

Important: The login node is for submitting jobs and light tasks. Never run compute-intensive work directly on login nodes.


Workspace Structure

All agent work on HPC goes in the existing Agent_Runs directory:

/scratch/alpine/$CURC_USER/Agent_Runs/
├── argon-diffusion-20260118/
│   ├── inputs/
│   ├── outputs/
│   ├── job.slurm
│   └── README.md
├── water-tip4p-20260119/
├── shared/
│   ├── potentials/          # Downloaded force fields
│   ├── pseudopotentials/    # Downloaded pseudopotentials
│   └── scripts/             # Reusable analysis scripts
└── ...

Creating a New Run

# Create run directory with timestamp
RUN_NAME="project-name-$(date +%Y%m%d-%H%M%S)"
RUN_DIR="/scratch/alpine/$CURC_USER/Agent_Runs/$RUN_NAME"
ssh cu_alpine "mkdir -p $RUN_DIR/{inputs,outputs}"

SLURM Job Submission

Job Script Template

#!/bin/bash
#SBATCH --job-name=my_simulation
#SBATCH --partition=amilan          # CPU partition (or aa100 for GPU)
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Number of MPI tasks
#SBATCH --time=04:00:00             # Max runtime (HH:MM:SS)
#SBATCH --output=%x_%j.out          # stdout file
#SBATCH --error=%x_%j.err           # stderr file
#SBATCH --mail-type=END,FAIL        # Email notifications
#SBATCH --mail-user=your@email.com

# Load required modules
module purge
module load gcc/13.1.0
module load openmpi/4.1.6

# Change to run directory
cd $SLURM_SUBMIT_DIR

# Run your simulation
mpirun -np $SLURM_NTASKS ./your_program input.in

Key SLURM Commands

Command Purpose
sbatch job.slurm Submit batch job
squeue -u $USER Check your job status
squeue -j <jobid> Check specific job
scancel <jobid> Cancel a job
sinfo -p amilan Check partition status
sacct -j <jobid> Job accounting info
scontrol show job <jobid> Detailed job info

Job Status Codes

Code Meaning
PD Pending (waiting for resources)
R Running
CG Completing
CD Completed
F Failed
TO Timeout
CA Cancelled

Available Partitions

Partition Selection Strategy

CRITICAL: Always validate on testing partition first before production runs!

Workflow:
1. atesting / atesting_a100  →  Validate job script works (1 hour max)
2. amilan / aa100            →  Production runs (24 hour max)
3. amilan + qos=long         →  Extended runs (7 day max, lower priority)

Testing Partitions (Use First!)

Partition Limits Max Time Purpose
atesting 2 nodes, 16 cores max 1h Validate CPU jobs work before production
atesting_a100 1 GPU, 10 cores max 1h Validate GPU jobs work before production
atesting_mi100 1 GPU, 10 cores max 1h Validate AMD GPU jobs

Always run a short test on atesting first to catch:

  • Module loading issues
  • Path errors
  • Input file problems
  • Memory requirements

Production CPU Partitions

Partition Nodes Cores/Node RAM/Node Max Time Use For
amilan 387 32-64 256 GB (3.75 GB/core) 24h Default for production CPU jobs
amilan128c 16 128 256 GB (2 GB/core) 24h High core count on single node (see below)
amem 24 48-128 up to 2 TB 24h Memory-intensive (requires --qos=mem, must request 256GB+)

When to Use amilan128c vs amilan

Use amilan128c when:

  • Your job benefits from 128 cores on ONE node (vs spreading across multiple nodes)
  • Running OpenMP/shared-memory parallel codes
  • High inter-process communication (MPI with frequent small messages)
  • Tightly-coupled simulations where network latency hurts performance
  • Large LAMMPS/QE jobs that scale well but suffer from inter-node communication

Use regular amilan when:

  • Your job needs fewer than 64 cores
  • You need multiple nodes (amilan has 387 nodes vs only 16 for 128c)
  • Memory per core matters more (3.75 GB/core vs 2 GB/core on 128c)
  • Queue wait time is a concern (more nodes = shorter queue)

Example: 128-core single-node LAMMPS job

#SBATCH --partition=amilan128c
#SBATCH --nodes=1
#SBATCH --ntasks=128           # Use all 128 cores
#SBATCH --time=12:00:00

Production GPU Partitions

Partition Nodes GPUs/Node GPU Type Max Time Use For
aa100 11 3 NVIDIA A100 (40GB) 24h Best for CUDA, ML/DL, GPU-accelerated MD
ami100 7 3 AMD MI100 24h ROCm/HIP workloads
al40 3 3 NVIDIA L40 24h Newer architecture, visualization

Special Partitions

Partition Max Time Purpose
acompile 12h Compiling software only (use via acompile command)
csu 24h Colorado State contributed nodes
amc 24h CU Anschutz contributed nodes

QoS (Quality of Service)

QoS Max Time Priority When to Use
normal 24h Normal Default - use for most jobs
long 7 days Lower Extended simulations (will wait longer in queue)
mem 24h Normal Required for amem partition (high-memory jobs)

Partition Selection Examples

# 1. TESTING: Always start here to validate your job works
#SBATCH --partition=atesting
#SBATCH --time=00:30:00
#SBATCH --ntasks=4

# 2. PRODUCTION CPU: After testing passes
#SBATCH --partition=amilan
#SBATCH --time=04:00:00
#SBATCH --ntasks=32

# 3. PRODUCTION GPU: For GPU-accelerated codes
#SBATCH --partition=aa100
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# 4. LONG RUNS: When 24h isn't enough (lower priority)
#SBATCH --partition=amilan
#SBATCH --qos=long
#SBATCH --time=168:00:00   # 7 days

# 5. HIGH MEMORY: For memory-intensive jobs (256GB+ required)
#SBATCH --partition=amem
#SBATCH --qos=mem
#SBATCH --mem=512G
#SBATCH --time=12:00:00

Module System

Software is managed through environment modules. Always work from a compute node or compile node, not login.

Essential Commands

# List available modules
module avail

# Search for specific software
module spider lammps
module spider python

# Load modules
module load gcc/13.1.0
module load openmpi/4.1.6
module load lammps/20230802

# See what's loaded
module list

# Unload all modules
module purge

# Save/restore module sets
module save my_env
module restore my_env

Finding and Loading Software

Software on CURC is installed in /curc/sw/install/. To find what's available:

# List all installed software
ls /curc/sw/install/

# Check specific software versions
ls /curc/sw/install/lammps/    # LAMMPS versions (22July25, 2Sept25, etc.)
ls /curc/sw/install/QE/        # Quantum ESPRESSO (7.0, 7.2)
ls /curc/sw/install/gromacs/   # GROMACS versions

LAMMPS example (check exact paths for current versions):

# Find the binary
ls /curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin/

# In job script
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO example:

module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS pw.x < input.in > output.out

Note: Module dependencies matter. Load compiler first, then MPI. Check exact version paths as they may change.


Storage Filesystem

Paths and Quotas

Path Quota Purge Use For
/home/$USER 2 GB Never Scripts, small configs
/projects/$USER 250 GB Never Code, small datasets
/scratch/alpine/$USER 10 TB 90 days Job I/O, large files
$SLURM_SCRATCH ~300 GB Job end Node-local temp storage

Performance Rules

DO:

  • Run all job I/O on /scratch/alpine/
  • Use $SLURM_SCRATCH for intensive temporary files
  • Copy results back after job completes

DON'T:

  • Run I/O-intensive jobs on /home or /projects (will be killed)
  • Store important data only on /scratch (it's purged!)
  • Leave large files on login nodes

Example Workflows

Recommended Workflow: Test First, Then Production

Step 1: Create a testing job script (job_test.slurm)

#!/bin/bash
#SBATCH --job-name=argon_test
#SBATCH --partition=atesting        # <-- TEST PARTITION FIRST
#SBATCH --nodes=1
#SBATCH --ntasks=4                  # Small scale for testing
#SBATCH --time=00:30:00             # 30 min is plenty for testing
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

echo "=== Testing job script ==="
echo "Started at: $(date)"
echo "Running on: $(hostname)"

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
echo "Working directory: $(pwd)"
echo "Input files: $(ls -la)"

# Run short test (reduce timesteps in input for testing)
mpirun -np $SLURM_NTASKS lmp -in input.lmp

echo "Finished at: $(date)"

Step 2: If test passes, create production job (job_prod.slurm)

#!/bin/bash
#SBATCH --job-name=argon_prod
#SBATCH --partition=amilan          # <-- PRODUCTION PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Full scale
#SBATCH --time=04:00:00             # Appropriate for full run
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

LAMMPS MD Simulation (Full Example)

#!/bin/bash
#SBATCH --job-name=argon_md
#SBATCH --partition=amilan
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO DFT

#!/bin/bash
#SBATCH --job-name=si_scf
#SBATCH --partition=amilan
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS pw.x < si_scf.in > si_scf.out

GPU Job (Testing First)

Test on atesting_a100:

#!/bin/bash
#SBATCH --job-name=md_gpu_test
#SBATCH --partition=atesting_a100   # <-- GPU TESTING
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1
# Add LAMMPS GPU path here

cd $SLURM_SUBMIT_DIR
lmp -k on g 1 -sf kk -pk kokkos gpu/aware off -in input.lmp

Then production on aa100:

#!/bin/bash
#SBATCH --job-name=md_gpu_prod
#SBATCH --partition=aa100           # <-- GPU PRODUCTION
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:3                # Can use up to 3 GPUs per node
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1

cd $SLURM_SUBMIT_DIR
lmp -k on g 3 -sf kk -pk kokkos gpu/aware off -in input.lmp

Debugging Failed Jobs

When a job fails, investigate systematically:

1. Check Job Status

# See why it failed
sacct -j <jobid> --format=JobID,State,ExitCode,Reason

# Get detailed info
scontrol show job <jobid>

2. Read Output Files

# Check stdout
cat my_job_12345.out

# Check stderr (often has the real error)
cat my_job_12345.err

# Check application logs
cat log.lammps

3. Common Failure Reasons

Issue Symptom Solution
Timeout State=TIMEOUT Increase --time or optimize
Memory State=OUT_OF_MEMORY Increase nodes or use amem
Module not found "command not found" Check module load order
Bad path "file not found" Use absolute paths
Wrong partition Job pending forever Check partition resources

4. Interactive Debugging

# Get interactive session for debugging
sinteractive --partition=atesting --time=01:00:00 --ntasks=4

# Then run commands interactively to debug
module load lammps
lmp -in input.lmp  # See errors in real-time

File Transfer

Between Local and HPC

# Upload input files
scp -r ./inputs/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/

# Download results
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/output.dat ./

# Sync directories (rsync is more efficient for updates)
rsync -avz ./project/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/project/

Large File Transfers

For very large files, use Globus (web-based) or DTN nodes:

# Use data transfer node for large transfers
scp large_file.tar $CURC_USER@dtn.rc.colorado.edu:/scratch/alpine/$CURC_USER/

Queue Times and Async Job Management

Understanding Queue Wait Times

CRITICAL: HPC jobs don't start immediately. Queue times vary dramatically:

Partition Typical Wait Why
atesting Minutes Testing partition, low demand
amilan Minutes to hours Many nodes (387), high throughput
amilan128c Hours to DAYS Only 16 nodes, high demand
aa100 Hours to days Only 11 nodes, GPU scarcity

Before submitting, check the queue:

# See pending jobs and estimated start times
ssh cu_alpine "squeue -p amilan128c --start"

# Quick queue depth check
ssh cu_alpine "squeue -p amilan128c --state=PENDING | wc -l"

Async Workflow (For Long Queue Times)

DON'T block waiting for jobs with multi-day queues. Instead:

from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# 1. Check queue before choosing partition
status = hpc.get_queue_status('amilan128c')
print(f"Estimated wait: {status['estimated_wait']}")
print(f"Pending jobs: {status['pending_jobs']}")

# 2. Compare partitions to choose wisely
for part in hpc.compare_partitions(['amilan', 'amilan128c', 'aa100']):
    print(f"{part['partition']}: {part['estimated_wait']}, {part['pending_jobs']} pending")

# 3. Submit async (returns immediately, saves tracking file)
tracking = hpc.submit_async(f"{run_dir}/job.slurm")
print(f"Job {tracking['job_id']} submitted")
print(f"Estimated start: {tracking['estimated_start']}")
# Returns immediately - don't wait!

# 4. Later: Check on all submitted jobs
jobs = hpc.check_async_jobs()
for job in jobs:
    print(f"Job {job['job_id']}: {job['current_status']}")
    if job['is_finished']:
        print(f"  Completed! Success: {job['is_success']}")

Workflow Strategy for Long-Running Studies

For multi-day queue scenarios:

Day 1: Submit jobs
├── Check queue status
├── Submit with submit_async()
├── Note estimated start times
└── Move on to other work

Day 2+: Check periodically
├── hpc.check_async_jobs()
├── If still PENDING: wait
├── If RUNNING: monitor progress
└── If COMPLETED: download results and analyze

SLURM Email Notifications (Recommended)

Add to your job scripts for automatic notifications:

#SBATCH --mail-type=BEGIN,END,FAIL    # When to email
#SBATCH --mail-user=your@email.com    # Your email

# Options: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# BEGIN = job started (left queue)
# END = job finished
# FAIL = job failed

Smart Partition Selection

Decision tree:

Need GPU?
├── YES → Check aa100 queue
│         └── Long wait? Consider if job can run on CPU instead
└── NO → How many cores?
         ├── ≤64 cores → amilan (shorter queue, more nodes)
         └── >64 cores or tightly-coupled →
             └── Check amilan128c queue
                 └── Wait >24h? Consider splitting across amilan nodes

Check Job Progress

# One-time status check with start time estimates
ssh cu_alpine "squeue -u $CURC_USER --start"

# See job details
ssh cu_alpine "scontrol show job <jobid>"

# Check why job is pending
ssh cu_alpine "squeue -j <jobid> --format='%r'"  # Shows REASON

Wait for Job Completion (Short Jobs Only)

Only use blocking wait for jobs expected to complete within minutes:

# Poll until job completes (ONLY for short jobs!)
JOB_ID=12345
while ssh cu_alpine "squeue -j $JOB_ID 2>/dev/null | grep -q $JOB_ID"; do
    echo "Job $JOB_ID still running..."
    sleep 60
done
echo "Job $JOB_ID completed"

# Check final status
ssh cu_alpine "sacct -j $JOB_ID --format=JobID,State,ExitCode"

Key Principles

You Are a Researcher

You have the same access a human researcher has. You can:

  • Create any job script you need
  • Load any available module
  • Debug failures by reading logs
  • Adapt to different software versions
  • Figure out problems through investigation

Don't Just Execute - Verify

After running on HPC:

  1. Check job completed successfully (not just submitted)
  2. Verify output files exist and have content
  3. Check for error messages in stderr
  4. Validate results are physically reasonable

Document Your Work

Leave breadcrumbs for yourself:

# In job script
echo "Job started at $(date)"
echo "Running on $(hostname)"
echo "Loaded modules: $(module list 2>&1)"

Reference Links