Quick references

SLURM docs Quick start sbatch docs squeue docs

Clusters feel much less mysterious once you separate where you edit, where you submit, and where the job actually runs.

Mental model

The three places that matter:

Place	What you do there	What not to do
Login node	edit files, submit jobs, light checks	heavy computation
Compute node	actual CPU / memory-intensive work	long interactive editing
Shared storage	inputs, outputs, logs, scripts	assume it is infinitely fast

Two common modes:

Batch job: you prepare a script and submit it with sbatch.
Interactive job: you request resources and work inside an allocated session with srun.

Important job resources:

time = wall-clock limit
mem = RAM requested
cpus-per-task = threads available to your program
partition = queue / hardware pool

Core commands

squeue -u $USER
sbatch run.sh
srun --pty bash
scancel 123456
sacct -j 123456
scontrol show job 123456
sinfo

What they are for:

Command	Use
`squeue -u $USER`	see pending and running jobs
`sbatch script.sh`	submit a batch script
`srun --pty bash`	start an interactive session
`scancel JOBID`	cancel a job
`sacct -j JOBID`	inspect finished job accounting
`scontrol show job JOBID`	detailed job metadata and pending reasons
`sinfo`	inspect partitions and node states

Use the login node for:

editing scripts
checking files
lightweight commands
submitting jobs

Use compute nodes for:

alignments
variant calling
long scripts
multithreaded analyses
memory-heavy jobs

Short rule:

if the command might annoy other users, it probably belongs in a job

Interactive vs batch jobs

Interactive

Use interactive jobs when:

debugging code
testing a workflow on a small input
checking environment/module issues

Example:

srun --partition=compute --cpus-per-task=4 --mem=8G --time=01:00:00 --pty bash

Batch

Use batch jobs when:

the workflow is stable
the run may take a while
you want reproducible logs and explicit resource requests

sbatch template

#!/bin/bash
#SBATCH --job-name=vcf_qc
#SBATCH --output=logs/%x.%j.out
#SBATCH --error=logs/%x.%j.err
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --partition=compute

set -euo pipefail

module load bcftools
# or: source ~/miniconda3/etc/profile.d/conda.sh
# conda activate popgen

bcftools view -i 'QUAL>30' cohort.vcf.gz -Oz -o cohort.filtered.vcf.gz
tabix -p vcf cohort.filtered.vcf.gz

Good defaults:

always write both stdout and stderr to files
always request time and memory explicitly
always use set -euo pipefail unless you have a reason not to

Checking jobs

What is running right now

squeue -u $USER

Look for:

R = running
PD = pending
assigned partition
elapsed time

Why is a job pending

scontrol show job 123456

Useful fields:

JobState
Reason
Partition
NumCPUs
MinMemory

Common pending reasons:

resources not available yet
partition is busy
requested memory / CPU / time is too large
account / priority limits

What happened after the job finished

sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,ExitCode

This is often the fastest way to diagnose memory problems or silent failures.

Logs

The two files that matter most are usually:

standard output
standard error

If you use:

#SBATCH --output=logs/%x.%j.out
#SBATCH --error=logs/%x.%j.err

then %x becomes the job name and %j becomes the job ID.

Very useful:

tail -f logs/vcf_qc.123456.out
tail -f logs/vcf_qc.123456.err

Failure checklist

When a job fails, check in this order:

Did the script actually start?
Did the expected module / conda environment load?
Did the input path exist on the cluster filesystem?
Did the job hit memory or walltime limits?
Was the command using more threads than requested?

Fast inspection sequence:

sacct -j 123456 --format=JobID,State,Elapsed,MaxRSS,ExitCode
scontrol show job 123456
tail -n 50 logs/job.123456.err
tail -n 50 logs/job.123456.out

Practical habits

Test on a small dataset interactively first.
Submit the stable version with sbatch.
Do not run heavy tools on the login node.
Request realistic memory and time, not fantasy values.
Match tool threads to --cpus-per-task.
Keep logs in a dedicated logs/ directory.
Keep scripts in version control when possible.

PBS / Torque equivalents

Some clusters use PBS / Torque instead of SLURM.

Rough equivalents:

SLURM	PBS / Torque
`sbatch`	`qsub`
`squeue`	`qstat`
`scancel`	`qdel`

The exact flags differ, but the mental model is almost the same.

Mental model#

Core commands#

login node vs compute node#

Interactive vs batch jobs#

Interactive#

Batch#

sbatch template#

Checking jobs#

What is running right now#

Why is a job pending#

What happened after the job finished#

Logs#

Failure checklist#

Practical habits#

PBS / Torque equivalents#