Quick references
SLURM docs Quick start sbatch docs squeue docs

Clusters feel much less mysterious once you separate where you edit, where you submit, and where the job actually runs.

Mental model

The three places that matter:

PlaceWhat you do thereWhat not to do
Login nodeedit files, submit jobs, light checksheavy computation
Compute nodeactual CPU / memory-intensive worklong interactive editing
Shared storageinputs, outputs, logs, scriptsassume it is infinitely fast

Two common modes:

  • Batch job: you prepare a script and submit it with sbatch.
  • Interactive job: you request resources and work inside an allocated session with srun.

Important job resources:

  • time = wall-clock limit
  • mem = RAM requested
  • cpus-per-task = threads available to your program
  • partition = queue / hardware pool

Core commands

squeue -u $USER
sbatch run.sh
srun --pty bash
scancel 123456
sacct -j 123456
scontrol show job 123456
sinfo

What they are for:

CommandUse
squeue -u $USERsee pending and running jobs
sbatch script.shsubmit a batch script
srun --pty bashstart an interactive session
scancel JOBIDcancel a job
sacct -j JOBIDinspect finished job accounting
scontrol show job JOBIDdetailed job metadata and pending reasons
sinfoinspect partitions and node states

login node vs compute node

Use the login node for:

  • editing scripts
  • checking files
  • lightweight commands
  • submitting jobs

Use compute nodes for:

  • alignments
  • variant calling
  • long scripts
  • multithreaded analyses
  • memory-heavy jobs

Short rule:

  • if the command might annoy other users, it probably belongs in a job

Interactive vs batch jobs

Interactive

Use interactive jobs when:

  • debugging code
  • testing a workflow on a small input
  • checking environment/module issues

Example:

srun --partition=compute --cpus-per-task=4 --mem=8G --time=01:00:00 --pty bash

Batch

Use batch jobs when:

  • the workflow is stable
  • the run may take a while
  • you want reproducible logs and explicit resource requests

sbatch template

#!/bin/bash
#SBATCH --job-name=vcf_qc
#SBATCH --output=logs/%x.%j.out
#SBATCH --error=logs/%x.%j.err
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --partition=compute

set -euo pipefail

module load bcftools
# or: source ~/miniconda3/etc/profile.d/conda.sh
# conda activate popgen

bcftools view -i 'QUAL>30' cohort.vcf.gz -Oz -o cohort.filtered.vcf.gz
tabix -p vcf cohort.filtered.vcf.gz

Good defaults:

  • always write both stdout and stderr to files
  • always request time and memory explicitly
  • always use set -euo pipefail unless you have a reason not to

Checking jobs

What is running right now

squeue -u $USER

Look for:

  • R = running
  • PD = pending
  • assigned partition
  • elapsed time

Why is a job pending

scontrol show job 123456

Useful fields:

  • JobState
  • Reason
  • Partition
  • NumCPUs
  • MinMemory

Common pending reasons:

  • resources not available yet
  • partition is busy
  • requested memory / CPU / time is too large
  • account / priority limits

What happened after the job finished

sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,ExitCode

This is often the fastest way to diagnose memory problems or silent failures.

Logs

The two files that matter most are usually:

  • standard output
  • standard error

If you use:

#SBATCH --output=logs/%x.%j.out
#SBATCH --error=logs/%x.%j.err

then %x becomes the job name and %j becomes the job ID.

Very useful:

tail -f logs/vcf_qc.123456.out
tail -f logs/vcf_qc.123456.err

Failure checklist

When a job fails, check in this order:

  1. Did the script actually start?
  2. Did the expected module / conda environment load?
  3. Did the input path exist on the cluster filesystem?
  4. Did the job hit memory or walltime limits?
  5. Was the command using more threads than requested?

Fast inspection sequence:

sacct -j 123456 --format=JobID,State,Elapsed,MaxRSS,ExitCode
scontrol show job 123456
tail -n 50 logs/job.123456.err
tail -n 50 logs/job.123456.out

Practical habits

  • Test on a small dataset interactively first.
  • Submit the stable version with sbatch.
  • Do not run heavy tools on the login node.
  • Request realistic memory and time, not fantasy values.
  • Match tool threads to --cpus-per-task.
  • Keep logs in a dedicated logs/ directory.
  • Keep scripts in version control when possible.

PBS / Torque equivalents

Some clusters use PBS / Torque instead of SLURM.

Rough equivalents:

SLURMPBS / Torque
sbatchqsub
squeueqstat
scancelqdel

The exact flags differ, but the mental model is almost the same.