Introduction & Housekeeping
Laying the groundwork: nodes, tasks and other Slurm terminology
Understanding your jobs and the cluster
Lunch (30mins)
Basic job profiling + 5min break
Slurm scripting features + 5min break
Embarrasingly parallel workflows with Slurm job arrays
sbatch--ntasks and --cpus-per-task--mem and/or --memory-per-cpusbatchAwareness of "resources": CPUs, RAM/memory, Nodes, gres (GPUs)
Have used job submission commands
srun # executes a command/script/binary across taskssalloc # allocates resources to be used (interactively and/or via srun)sbatch # submits a script for later execution on requested resourcesawareness of resource request options
--ntasks= # "tasks" recognised by srun--nodes= # no. of nodes--ntasks-per-node= # tasks per node--cpus-per-task= # cpus per task--mem= # memory required for entire job--mem-per-cpu= # memory required for each CPU--gres= # "general resource" (i.e. GPUs)--time= # requested wall timeSlides + live coding
Live coding will be on Milton, so make sure you're connected to WEHI's VPN or staff network, or use RAP: https://rap.wehi.edu.au
Please follow along to reinforce learning!
Questions:
Material is available here: https://github.com/WEHI-ResearchComputing/intermediate-slurm-workshop
Reviewing cluster concepts and explaining tasks, and srun Slurm concepts
Nodes are essentially standalone computers with their own CPU cores, RAM, local storage, and maybe GPUs.
Note: Slurm calls CPU cores CPUs (e.g. cpus-per-tasks).

HPC clusters (or just clusters) will consist of multiple nodes connected together through a (sometimes fast) network.

Typically, HPC is organised with login nodes, compute nodes, and some nodes that perform scheduling and storage duties.


srun vs sbatch and salloc)¶TL;DR:
sbatch requests resources for use with a scriptsalloc requests resources to be used interactivelysrun runs programs/scripts using resources requested by sbatch and sallocsrun will execute ntasks instances of the same command/script/program.
Example:
salloc --ntasks=4 --cpus-per-task=2 --nodes=4hostname commandhostname, but with srun i.e. srun hostnamesalloc session and try run srun --ntasks=4 --cpus-per-task=2 --nodes=4 hostname

The BatchHost executes the commands in script.sh
A command/script/program will be executed on BatchHost only

To use the other nodes, execute srun hostname.
BatchHost will send the hostname command to the remaining tasks to be executed as well.

Without srun, only the BatchHost executes the commands/script
Using srun still doesn't guarantee the extra nodes will be used "properly"!
Nodes cannot collaborate on problems unless they are running a program designed that way.
It's like clicking your mouse on your PC, and expecting the click to register on a colleague's PC.
It's possible, but needs a special program/protocol to do so!
Biological sciences and statistics tend not to make use of multiple nodes to cooperate on a single problem.
Hence, we recommend passing --nodes=1.
Tasks are a collection of resources (CPU cores, GPUs) expected to perform the same "task", or used by a single program e.g., via threads, Python multiprocessing, or OpenMP.
Tasks not equivalent to no. of CPUs!

The Slurm task model was created with "traditional HPC" in mind
srun creates ntasks instances of a program which coordinate using MPITasks are not as relevant in bioinformatics, but Slurm nevertheless uses tasks for accounting/profiling purposes.
Therefore, it's useful to have an understanding of tasks in order to interpret some of Slurm's job accounting/profiling outputs.
A task can only be given resources co-located on a node.
Multiple tasks requested by sbatch or salloc can be spread across multiple nodes (unless --nodes= is specified).
For example, if we have two nodes with 4 CPU cores each:
requesting 1 task and 8 cpus-per-task won't work.
But requesting 2 tasks and 4 cpus-per-task will!
Most data science, statistics, bionformatics, health-science work will use --ntasks=1, and using --cpus-per-task.
If you see/hear anything to do with "distributed" or MPI (e.g. distributed ML), you may want to change these options.
Using Slurm and system tools to understand what your jobs are doing
Primary utilities discussed in this section:
squeue Live Job queue data (queries the Slurm controller directly)sacct Historical job data (queries the Slurm database)scontrol Live Singular job data (queries the Slurm controller directly)sinfo Live Cluster data (queries the Slurm controller directly)This section show you how to get more detailed information about:
WEHI also offers the HPC dashboard which provide visibility on the status of the cluster.
http://dashboards.hpc.wehi.edu.au/
Note: the dashboards' info is coarser than what the Slurm commands can provide, and is specific to WEHI.
squeue¶squeue shows everyone's job in the queue (passing -u <username>) shows only <username>'s jobs.
squeue | head -n 5
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8516030 gpuq interact bollands R 1:30:11 1 gpu-p100-n01
8515707 gpuq cryospar cryospar R 3:04:59 1 gpu-p100-n01
8511988 interacti sys/dash yan.a R 20:15:53 1 sml-n03
8516092 interacti work jackson. R 1:21:42 1 sml-n01
But what if we want even more information?
We have to make use of the formatting options!
$ squeue --Format field1,field2,...
OR use the environment variable SQUEUE_FORMAT2. Useful fields:
| Resources related | Time related | Scheduling |
|---|---|---|
NumCPUs |
starttime |
JobId |
NumNodes |
submittime |
name |
minmemory |
pendingtime |
partition |
tres-alloc |
timelimit |
priority |
minmemory |
timeleft |
reasonlist |
timeused |
workdir |
|
state |
You can always use man squeue to see the entire list of options.
So you don't have to type out the fields, I recommend aliasing the the command with your fields of choice in ~/.bashrc e.g.
alias sqv="squeue --Format=jobid:8,name:6' ',partition:10' ',statecompact:3,tres-alloc:60,timelimit:12,timeleft:12"
sqv | head -n 5
JOBID NAME PARTITION ST TRES_ALLOC TIME_LIMIT TIME_LEFT 8517002 R bigmem R cpu=22,mem=88G,node=1,billing=720984 1-00:00:00 23:35:18 8516030 intera gpuq R cpu=2,mem=20G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1 8:00:00 4:43:00 8515707 cryosp gpuq R cpu=8,mem=17G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1 2-00:00:00 1-19:08:12 8511988 sys/da interactiv R cpu=8,mem=16G,node=1,billing=112 1-00:00:00 1:57:18
sqv -u bedo.j | head -n 5
JOBID NAME PARTITION ST TRES_ALLOC TIME_LIMIT TIME_LEFT 8516851 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516850 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516849 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516848 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00
scontrol show job <jobid>
Useful if you care only about a specific job.
It's very useful when debugging jobs.
A lot of information without needing lots of input.
scontrol show job 8516360
JobId=8516360 JobName=Extr16S23S UserId=woodruff.c(2317) GroupId=allstaff(10908) MCS_label=N/A Priority=324 Nice=0 Account=wehi QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:21:53 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2022-10-20T11:37:49 EligibleTime=2022-10-20T11:37:49 AccrueTime=2022-10-20T11:37:49 StartTime=2022-10-20T14:28:03 EndTime=2022-10-22T14:28:03 Deadline=N/A PreemptEligibleTime=2022-10-20T14:28:03 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-20T14:28:03 Scheduler=Main Partition=regular AllocNode:Sid=vc7-shared:12938 ReqNodeList=(null) ExcNodeList=(null) NodeList=med-n24 BatchHost=med-n24 NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=48G,node=1,billing=128 Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=48G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/ribosomal_16S23S_extract_singlespecies.sh Staphylococcus epidermidis 32 WorkDir=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell StdErr=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/slurm-8516360.out StdIn=/dev/null StdOut=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/slurm-8516360.out Power= MailUser=woodruff.c@wehi.edu.au MailType=END,FAIL
sacct¶squeue and scontrol show job only show information on jobs that are in the queue i.e. jobs that are pending, running, or finishing up.
Once jobs complete, fail, or are cancelled, the job data is put into a Slurm job data base.
This database can be queried by sacct to get information about your jobs.
sacct
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 9041674 crest0.3 regular wehi 224 CANCELLED+ 0:0 9041674.bat+ batch wehi 56 CANCELLED 0:15 9041674.ext+ extern wehi 224 COMPLETED 0:0 9041674.0 orted wehi 168 FAILED 1:0 9170758 gatk-4.2.+ regular wehi 56 COMPLETED 0:0 9170758.ext+ extern wehi 56 COMPLETED 0:0 9170758.0 nix-user-+ wehi 56 COMPLETED 0:0 9221903 impute_1.+ regular wehi 2 COMPLETED 0:0 9221903.ext+ extern wehi 2 COMPLETED 0:0 9221903.0 nix-user-+ wehi 2 COMPLETED 0:0 9221905 lambda.r_+ regular wehi 2 COMPLETED 0:0 9221905.ext+ extern wehi 2 COMPLETED 0:0 9221905.0 nix-user-+ wehi 2 COMPLETED 0:0 9221907 limma_3.5+ regular wehi 2 COMPLETED 0:0 9221907.ext+ extern wehi 2 COMPLETED 0:0 9221907.0 nix-user-+ wehi 2 COMPLETED 0:0 9221909 listenv_0+ regular wehi 2 COMPLETED 0:0 9221909.ext+ extern wehi 2 COMPLETED 0:0 9221909.0 nix-user-+ wehi 2 COMPLETED 0:0 9221910 marray_1.+ regular wehi 2 COMPLETED 0:0 9221910.ext+ extern wehi 2 COMPLETED 0:0 9221910.0 nix-user-+ wehi 2 COMPLETED 0:0 9221911 matrixSta+ regular wehi 2 COMPLETED 0:0 9221911.ext+ extern wehi 2 COMPLETED 0:0 9221911.0 nix-user-+ wehi 2 COMPLETED 0:0 9221912 parallell+ regular wehi 2 COMPLETED 0:0 9221912.ext+ extern wehi 2 COMPLETED 0:0 9221912.0 nix-user-+ wehi 2 COMPLETED 0:0 9221913 r-BH-1.78+ regular wehi 56 COMPLETED 0:0 9221913.ext+ extern wehi 56 COMPLETED 0:0 9221913.0 nix-user-+ wehi 56 COMPLETED 0:0 9221930 sys/dashb+ interacti+ wehi 2 RUNNING 0:0 9221930.bat+ batch wehi 2 RUNNING 0:0 9221930.ext+ extern wehi 2 RUNNING 0:0 9221945 r-BiocGen+ regular wehi 56 COMPLETED 0:0 9221945.ext+ extern wehi 56 COMPLETED 0:0 9221945.0 nix-user-+ wehi 56 COMPLETED 0:0 9221946 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221946.ext+ extern wehi 56 COMPLETED 0:0 9221946.0 nix-user-+ wehi 56 COMPLETED 0:0 9221947 r-Biobase+ regular wehi 56 COMPLETED 0:0 9221947.ext+ extern wehi 56 COMPLETED 0:0 9221947.0 nix-user-+ wehi 56 COMPLETED 0:0 9221949 r-R.metho+ regular wehi 56 COMPLETED 0:0 9221949.ext+ extern wehi 56 COMPLETED 0:0 9221949.0 nix-user-+ wehi 56 COMPLETED 0:0 9221950 r-S4Vecto+ regular wehi 56 COMPLETED 0:0 9221950.ext+ extern wehi 56 COMPLETED 0:0 9221950.0 nix-user-+ wehi 56 COMPLETED 0:0 9221955 r-R.oo-1.+ regular wehi 56 COMPLETED 0:0 9221955.ext+ extern wehi 56 COMPLETED 0:0 9221955.0 nix-user-+ wehi 56 COMPLETED 0:0 9221956 r-BiocIO-+ regular wehi 56 COMPLETED 0:0 9221956.ext+ extern wehi 56 COMPLETED 0:0 9221956.0 nix-user-+ wehi 56 COMPLETED 0:0 9221957 r-IRanges+ regular wehi 56 COMPLETED 0:0 9221957.ext+ extern wehi 56 COMPLETED 0:0 9221957.0 nix-user-+ wehi 56 COMPLETED 0:0 9221958 r-R.utils+ regular wehi 56 COMPLETED 0:0 9221958.ext+ extern wehi 56 COMPLETED 0:0 9221958.0 nix-user-+ wehi 56 COMPLETED 0:0 9221964 r-XML-3.9+ regular wehi 56 COMPLETED 0:0 9221964.ext+ extern wehi 56 COMPLETED 0:0 9221964.0 nix-user-+ wehi 56 COMPLETED 0:0 9221970 r-bitops-+ regular wehi 56 COMPLETED 0:0 9221970.ext+ extern wehi 56 COMPLETED 0:0 9221970.0 nix-user-+ wehi 56 COMPLETED 0:0 9221972 r-formatR+ regular wehi 56 COMPLETED 0:0 9221972.ext+ extern wehi 56 COMPLETED 0:0 9221972.0 nix-user-+ wehi 56 COMPLETED 0:0 9221973 r-RCurl-1+ regular wehi 56 COMPLETED 0:0 9221973.ext+ extern wehi 56 COMPLETED 0:0 9221973.0 nix-user-+ wehi 56 COMPLETED 0:0 9221977 r-futile.+ regular wehi 56 COMPLETED 0:0 9221977.ext+ extern wehi 56 COMPLETED 0:0 9221977.0 nix-user-+ wehi 56 COMPLETED 0:0 9221978 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221978.ext+ extern wehi 56 COMPLETED 0:0 9221978.0 nix-user-+ wehi 56 COMPLETED 0:0 9221981 r-globals+ regular wehi 56 COMPLETED 0:0 9221981.ext+ extern wehi 56 COMPLETED 0:0 9221981.0 nix-user-+ wehi 56 COMPLETED 0:0 9221983 r-impute-+ regular wehi 56 COMPLETED 0:0 9221983.ext+ extern wehi 56 COMPLETED 0:0 9221983.0 nix-user-+ wehi 56 COMPLETED 0:0 9221985 r-lambda.+ regular wehi 56 COMPLETED 0:0 9221985.ext+ extern wehi 56 COMPLETED 0:0 9221985.0 nix-user-+ wehi 56 COMPLETED 0:0 9221986 r-limma-3+ regular wehi 56 COMPLETED 0:0 9221986.ext+ extern wehi 56 COMPLETED 0:0 9221986.0 nix-user-+ wehi 56 COMPLETED 0:0 9221989 r-futile.+ regular wehi 56 COMPLETED 0:0 9221989.ext+ extern wehi 56 COMPLETED 0:0 9221989.0 nix-user-+ wehi 56 COMPLETED 0:0 9221990 r-listenv+ regular wehi 56 COMPLETED 0:0 9221990.ext+ extern wehi 56 COMPLETED 0:0 9221990.0 nix-user-+ wehi 56 COMPLETED 0:0 9221992 r-marray-+ regular wehi 56 COMPLETED 0:0 9221992.ext+ extern wehi 56 COMPLETED 0:0 9221992.0 nix-user-+ wehi 56 COMPLETED 0:0 9221999 r-matrixS+ regular wehi 56 COMPLETED 0:0 9221999.ext+ extern wehi 56 COMPLETED 0:0 9221999.0 nix-user-+ wehi 56 COMPLETED 0:0 9222004 r-CGHbase+ regular wehi 56 COMPLETED 0:0 9222004.ext+ extern wehi 56 COMPLETED 0:0 9222004.0 nix-user-+ wehi 56 COMPLETED 0:0 9222005 r-MatrixG+ regular wehi 56 COMPLETED 0:0 9222005.ext+ extern wehi 56 COMPLETED 0:0 9222005.0 nix-user-+ wehi 56 COMPLETED 0:0 9222006 r-paralle+ regular wehi 56 COMPLETED 0:0 9222006.ext+ extern wehi 56 COMPLETED 0:0 9222006.0 nix-user-+ wehi 56 COMPLETED 0:0 9222007 r-Delayed+ regular wehi 56 COMPLETED 0:0 9222007.ext+ extern wehi 56 COMPLETED 0:0 9222007.0 nix-user-+ wehi 56 COMPLETED 0:0 9222009 r-future-+ regular wehi 56 COMPLETED 0:0 9222009.ext+ extern wehi 56 COMPLETED 0:0 9222009.0 nix-user-+ wehi 56 COMPLETED 0:0 9222010 restfulr_+ regular wehi 2 COMPLETED 0:0 9222010.ext+ extern wehi 2 COMPLETED 0:0 9222010.0 nix-user-+ wehi 2 COMPLETED 0:0 9222011 r-future.+ regular wehi 56 COMPLETED 0:0 9222011.ext+ extern wehi 56 COMPLETED 0:0 9222011.0 nix-user-+ wehi 56 COMPLETED 0:0 9222012 rjson_0.2+ regular wehi 2 COMPLETED 0:0 9222012.ext+ extern wehi 2 COMPLETED 0:0 9222012.0 nix-user-+ wehi 2 COMPLETED 0:0 9222014 rtracklay+ regular wehi 2 COMPLETED 0:0 9222014.ext+ extern wehi 2 COMPLETED 0:0 9222014.0 nix-user-+ wehi 2 COMPLETED 0:0 9222016 r-rjson-0+ regular wehi 56 COMPLETED 0:0 9222016.ext+ extern wehi 56 COMPLETED 0:0 9222016.0 nix-user-+ wehi 56 COMPLETED 0:0 9222019 snow_0.4-+ regular wehi 2 COMPLETED 0:0 9222019.ext+ extern wehi 2 COMPLETED 0:0 9222019.0 nix-user-+ wehi 2 COMPLETED 0:0 9222020 snowfall_+ regular wehi 2 COMPLETED 0:0 9222020.ext+ extern wehi 2 COMPLETED 0:0 9222020.0 nix-user-+ wehi 2 COMPLETED 0:0 9222022 r-snow-0.+ regular wehi 56 COMPLETED 0:0 9222022.ext+ extern wehi 56 COMPLETED 0:0 9222022.0 nix-user-+ wehi 56 COMPLETED 0:0 9222024 source regular wehi 2 COMPLETED 0:0 9222024.ext+ extern wehi 2 COMPLETED 0:0 9222024.0 nix-user-+ wehi 2 COMPLETED 0:0 9222183 r-BiocPar+ regular wehi 56 COMPLETED 0:0 9222183.ext+ extern wehi 56 COMPLETED 0:0 9222183.0 nix-user-+ wehi 56 COMPLETED 0:0 9222214 kent-404 regular wehi 56 COMPLETED 0:0 9222214.ext+ extern wehi 56 COMPLETED 0:0 9222214.0 nix-user-+ wehi 56 COMPLETED 0:0 9222256 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222256.ext+ extern wehi 6 COMPLETED 0:0 9222256.0 nix-user-+ wehi 6 COMPLETED 0:0 9222257 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222257.ext+ extern wehi 6 COMPLETED 0:0 9222257.0 nix-user-+ wehi 6 COMPLETED 0:0 9222258 nix-store+ regular wehi 6 COMPLETED 0:0 9222258.ext+ extern wehi 6 COMPLETED 0:0 9222258.0 nix-user-+ wehi 6 COMPLETED 0:0 9222259 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222259.ext+ extern wehi 6 COMPLETED 0:0 9222259.0 nix-user-+ wehi 6 COMPLETED 0:0 9222260 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222260.ext+ extern wehi 6 COMPLETED 0:0 9222260.0 nix-user-+ wehi 6 COMPLETED 0:0 9222261 libcap-st+ regular wehi 6 COMPLETED 0:0 9222261.ext+ extern wehi 6 COMPLETED 0:0 9222261.0 nix-user-+ wehi 6 COMPLETED 0:0 9222262 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222262.ext+ extern wehi 6 COMPLETED 0:0 9222262.0 nix-user-+ wehi 6 COMPLETED 0:0 9222264 bubblewra+ regular wehi 6 COMPLETED 0:0 9222264.ext+ extern wehi 6 COMPLETED 0:0 9222264.0 nix-user-+ wehi 6 COMPLETED 0:0 9222265 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222265.ext+ extern wehi 6 COMPLETED 0:0 9222265.0 nix-user-+ wehi 6 COMPLETED 0:0 9222271 r-snowfal+ regular wehi 56 COMPLETED 0:0 9222271.ext+ extern wehi 56 COMPLETED 0:0 9222271.0 nix-user-+ wehi 56 COMPLETED 0:0 9222331 splitFA regular wehi 56 COMPLETED 0:0 9222331.ext+ extern wehi 56 COMPLETED 0:0 9222331.0 nix-user-+ wehi 56 COMPLETED 0:0 9222334 r-CGHcall+ regular wehi 56 COMPLETED 0:0 9222334.ext+ extern wehi 56 COMPLETED 0:0 9222334.0 nix-user-+ wehi 56 COMPLETED 0:0 9222341 seed.txt regular wehi 56 COMPLETED 0:0 9222341.ext+ extern wehi 56 COMPLETED 0:0 9222341.0 nix-user-+ wehi 56 COMPLETED 0:0 9222374 strip-sto+ regular wehi 56 COMPLETED 0:0 9222374.ext+ extern wehi 56 COMPLETED 0:0 9222374.0 nix-user-+ wehi 56 COMPLETED 0:0 9222400 forgeBSge+ regular wehi 56 COMPLETED 0:0 9222400.ext+ extern wehi 56 COMPLETED 0:0 9222400.0 nix-user-+ wehi 56 COMPLETED 0:0 9222431 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222431.ext+ extern wehi 6 COMPLETED 0:0 9222431.0 nix-user-+ wehi 6 COMPLETED 0:0 9222486 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222486.ext+ extern wehi 6 COMPLETED 0:0 9222486.0 nix-user-+ wehi 6 COMPLETED 0:0 9222654 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222654.ext+ extern wehi 6 COMPLETED 0:0 9222654.0 nix-user-+ wehi 6 COMPLETED 0:0 9222700 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222700.ext+ extern wehi 6 COMPLETED 0:0 9222700.0 nix-user-+ wehi 6 COMPLETED 0:0 9222701 build-bun+ regular wehi 6 COMPLETED 0:0 9222701.ext+ extern wehi 6 COMPLETED 0:0 9222701.0 nix-user-+ wehi 6 COMPLETED 0:0 9222702 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222702.ext+ extern wehi 6 COMPLETED 0:0 9222702.0 nix-user-+ wehi 6 COMPLETED 0:0 9222704 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222704.ext+ extern wehi 6 COMPLETED 0:0 9222704.0 nix-user-+ wehi 6 COMPLETED 0:0 9222705 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222705.ext+ extern wehi 6 COMPLETED 0:0 9222705.0 nix-user-+ wehi 6 COMPLETED 0:0 9222706 nix-store+ regular wehi 6 COMPLETED 0:0 9222706.ext+ extern wehi 6 COMPLETED 0:0 9222706.0 nix-user-+ wehi 6 COMPLETED 0:0 9222707 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222707.ext+ extern wehi 6 COMPLETED 0:0 9222707.0 nix-user-+ wehi 6 COMPLETED 0:0 9222708 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222708.ext+ extern wehi 6 COMPLETED 0:0 9222708.0 nix-user-+ wehi 6 COMPLETED 0:0 9222709 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222709.ext+ extern wehi 6 COMPLETED 0:0 9222709.0 nix-user-+ wehi 6 COMPLETED 0:0 9222711 build-bun+ regular wehi 6 COMPLETED 0:0 9222711.ext+ extern wehi 6 COMPLETED 0:0 9222711.0 nix-user-+ wehi 6 COMPLETED 0:0 9222712 dorado-A1+ gpuq_large wehi 4 FAILED 127:0 9222712.bat+ batch wehi 4 FAILED 127:0 9222712.ext+ extern wehi 4 COMPLETED 0:0 9222712.0 time wehi 4 FAILED 127:0 9222714 dorado-A1+ gpuq_large wehi 4 COMPLETED 0:0 9222714.bat+ batch wehi 4 COMPLETED 0:0 9222714.ext+ extern wehi 4 COMPLETED 0:0 9222714.0 time wehi 4 COMPLETED 0:0 9222719 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222719.ext+ extern wehi 6 COMPLETED 0:0 9222719.0 nix-user-+ wehi 6 COMPLETED 0:0 9222723 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222723.ext+ extern wehi 6 COMPLETED 0:0 9222723.0 nix-user-+ wehi 6 COMPLETED 0:0 9222724 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222724.ext+ extern wehi 6 COMPLETED 0:0 9222724.0 nix-user-+ wehi 6 COMPLETED 0:0 9222725 nix-store+ regular wehi 6 COMPLETED 0:0 9222725.ext+ extern wehi 6 COMPLETED 0:0 9222725.0 nix-user-+ wehi 6 COMPLETED 0:0 9222726 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222726.ext+ extern wehi 6 COMPLETED 0:0 9222726.0 nix-user-+ wehi 6 COMPLETED 0:0 9222727 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222727.ext+ extern wehi 6 COMPLETED 0:0 9222727.0 nix-user-+ wehi 6 COMPLETED 0:0 9222728 libcap-st+ regular wehi 6 COMPLETED 0:0 9222728.ext+ extern wehi 6 COMPLETED 0:0 9222728.0 nix-user-+ wehi 6 COMPLETED 0:0 9222729 nix-2.5pr+ regular wehi 6 FAILED 2:0 9222729.ext+ extern wehi 6 COMPLETED 0:0 9222729.0 nix-user-+ wehi 6 FAILED 2:0 9222730 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222730.ext+ extern wehi 6 COMPLETED 0:0 9222730.0 nix-user-+ wehi 6 COMPLETED 0:0 9222733 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222733.ext+ extern wehi 6 COMPLETED 0:0 9222733.0 nix-user-+ wehi 6 COMPLETED 0:0 9222734 bubblewra+ regular wehi 6 COMPLETED 0:0 9222734.ext+ extern wehi 6 COMPLETED 0:0 9222734.0 nix-user-+ wehi 6 COMPLETED 0:0 9222751 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222751.ext+ extern wehi 6 COMPLETED 0:0 9222751.0 nix-user-+ wehi 6 COMPLETED 0:0 9222763 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222763.ext+ extern wehi 6 COMPLETED 0:0 9222763.0 nix-user-+ wehi 6 COMPLETED 0:0 9222764 interacti+ regular wehi 2 COMPLETED 0:0 9222764.int+ interacti+ wehi 2 COMPLETED 0:0 9222764.ext+ extern wehi 2 COMPLETED 0:0 9222764.0 echo wehi 2 COMPLETED 0:0 9222764.1 echo wehi 2 COMPLETED 0:0 9222764.2 echo wehi 2 COMPLETED 0:0 9222764.3 echo wehi 2 COMPLETED 0:0 9222787 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222787.ext+ extern wehi 6 COMPLETED 0:0 9222787.0 nix-user-+ wehi 6 COMPLETED 0:0 9222790 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222790.ext+ extern wehi 6 COMPLETED 0:0 9222790.0 nix-user-+ wehi 6 COMPLETED 0:0 9222796 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222796.ext+ extern wehi 2 COMPLETED 0:0 9222796.0 nix-user-+ wehi 2 COMPLETED 0:0 9222797 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222797.ext+ extern wehi 2 COMPLETED 0:0 9222797.0 nix-user-+ wehi 2 COMPLETED 0:0 9222798 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222798.ext+ extern wehi 2 COMPLETED 0:0 9222798.0 nix-user-+ wehi 2 COMPLETED 0:0 9222799 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222799.ext+ extern wehi 6 COMPLETED 0:0 9222799.0 nix-user-+ wehi 6 COMPLETED 0:0 9222800 build-bun+ regular wehi 6 COMPLETED 0:0 9222800.ext+ extern wehi 6 COMPLETED 0:0 9222800.0 nix-user-+ wehi 6 COMPLETED 0:0 9222812 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222812.ext+ extern wehi 2 COMPLETED 0:0 9222812.0 nix-user-+ wehi 2 COMPLETED 0:0 9222813 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222813.ext+ extern wehi 6 COMPLETED 0:0 9222813.0 nix-user-+ wehi 6 COMPLETED 0:0 9222816 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222816.ext+ extern wehi 6 COMPLETED 0:0 9222816.0 nix-user-+ wehi 6 COMPLETED 0:0 9222817 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222817.ext+ extern wehi 2 COMPLETED 0:0 9222817.0 nix-user-+ wehi 2 COMPLETED 0:0 9222820 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222820.ext+ extern wehi 2 COMPLETED 0:0 9222820.0 nix-user-+ wehi 2 COMPLETED 0:0 9222822 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222822.ext+ extern wehi 6 COMPLETED 0:0 9222822.0 nix-user-+ wehi 6 COMPLETED 0:0 9222826 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222826.ext+ extern wehi 6 COMPLETED 0:0 9222826.0 nix-user-+ wehi 6 COMPLETED 0:0 9222843 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222843.ext+ extern wehi 2 COMPLETED 0:0 9222843.0 nix-user-+ wehi 2 COMPLETED 0:0 9222845 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222845.ext+ extern wehi 2 COMPLETED 0:0 9222845.0 nix-user-+ wehi 2 COMPLETED 0:0 9222856 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222856.ext+ extern wehi 6 COMPLETED 0:0 9222856.0 nix-user-+ wehi 6 COMPLETED 0:0 9222860 interacti+ regular wehi 4 COMPLETED 0:0 9222860.int+ interacti+ wehi 4 COMPLETED 0:0 9222860.ext+ extern wehi 4 COMPLETED 0:0 9222861 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222861.ext+ extern wehi 6 COMPLETED 0:0 9222861.0 nix-user-+ wehi 6 COMPLETED 0:0 9222862 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222862.ext+ extern wehi 2 COMPLETED 0:0 9222862.0 nix-user-+ wehi 2 COMPLETED 0:0 9222864 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222864.ext+ extern wehi 2 COMPLETED 0:0 9222864.0 nix-user-+ wehi 2 COMPLETED 0:0 9222865 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222865.ext+ extern wehi 6 COMPLETED 0:0 9222865.0 nix-user-+ wehi 6 COMPLETED 0:0 9222872 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222872.ext+ extern wehi 6 COMPLETED 0:0 9222872.0 nix-user-+ wehi 6 COMPLETED 0:0 9222875 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222875.ext+ extern wehi 2 COMPLETED 0:0 9222875.0 nix-user-+ wehi 2 COMPLETED 0:0 9223289 bionix-sa+ regular wehi 2 COMPLETED 0:0 9223289.ext+ extern wehi 2 COMPLETED 0:0 9223289.0 nix-user-+ wehi 2 COMPLETED 0:0 9225143 bionix-sa+ regular wehi 6 COMPLETED 0:0 9225143.ext+ extern wehi 6 COMPLETED 0:0 9225143.0 nix-user-+ wehi 6 COMPLETED 0:0 9229643 bionix-wi+ regular wehi 2 COMPLETED 0:0 9229643.ext+ extern wehi 2 COMPLETED 0:0 9229643.0 nix-user-+ wehi 2 COMPLETED 0:0 9229733 bionix-sa+ regular wehi 2 COMPLETED 0:0 9229733.ext+ extern wehi 2 COMPLETED 0:0 9229733.0 nix-user-+ wehi 2 COMPLETED 0:0 9229738 bin.R regular wehi 56 PENDING 0:0
sacct is to return jobs running today.
Choose the time-window with -S <date-time> -E <date-time>
YYYY-MM-DDhh:mm:sssacct -S 2022-11 is acceptable too.-S: start date-time-E: end date-timeNote: big/frequent sacct queries can occupy and eventually overload the Slurm controller node.
sacct behaviour can be augmented by --format. See man sacct for more details.
-X can be used to group job steps together, but this prevents some statistics like IO and memory from being reported.
sacct -X
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 9041674 crest0.3 regular wehi 224 CANCELLED+ 0:0 9170758 gatk-4.2.+ regular wehi 56 COMPLETED 0:0 9221903 impute_1.+ regular wehi 2 COMPLETED 0:0 9221905 lambda.r_+ regular wehi 2 COMPLETED 0:0 9221907 limma_3.5+ regular wehi 2 COMPLETED 0:0 9221909 listenv_0+ regular wehi 2 COMPLETED 0:0 9221910 marray_1.+ regular wehi 2 COMPLETED 0:0 9221911 matrixSta+ regular wehi 2 COMPLETED 0:0 9221912 parallell+ regular wehi 2 COMPLETED 0:0 9221913 r-BH-1.78+ regular wehi 56 COMPLETED 0:0 9221930 sys/dashb+ interacti+ wehi 2 RUNNING 0:0 9221945 r-BiocGen+ regular wehi 56 COMPLETED 0:0 9221946 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221947 r-Biobase+ regular wehi 56 COMPLETED 0:0 9221949 r-R.metho+ regular wehi 56 COMPLETED 0:0 9221950 r-S4Vecto+ regular wehi 56 COMPLETED 0:0 9221955 r-R.oo-1.+ regular wehi 56 COMPLETED 0:0 9221956 r-BiocIO-+ regular wehi 56 COMPLETED 0:0 9221957 r-IRanges+ regular wehi 56 COMPLETED 0:0 9221958 r-R.utils+ regular wehi 56 COMPLETED 0:0 9221964 r-XML-3.9+ regular wehi 56 COMPLETED 0:0 9221970 r-bitops-+ regular wehi 56 COMPLETED 0:0 9221972 r-formatR+ regular wehi 56 COMPLETED 0:0 9221973 r-RCurl-1+ regular wehi 56 COMPLETED 0:0 9221977 r-futile.+ regular wehi 56 COMPLETED 0:0 9221978 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221981 r-globals+ regular wehi 56 COMPLETED 0:0 9221983 r-impute-+ regular wehi 56 COMPLETED 0:0 9221985 r-lambda.+ regular wehi 56 COMPLETED 0:0 9221986 r-limma-3+ regular wehi 56 COMPLETED 0:0 9221989 r-futile.+ regular wehi 56 COMPLETED 0:0 9221990 r-listenv+ regular wehi 56 COMPLETED 0:0 9221992 r-marray-+ regular wehi 56 COMPLETED 0:0 9221999 r-matrixS+ regular wehi 56 COMPLETED 0:0 9222004 r-CGHbase+ regular wehi 56 COMPLETED 0:0 9222005 r-MatrixG+ regular wehi 56 COMPLETED 0:0 9222006 r-paralle+ regular wehi 56 COMPLETED 0:0 9222007 r-Delayed+ regular wehi 56 COMPLETED 0:0 9222009 r-future-+ regular wehi 56 COMPLETED 0:0 9222010 restfulr_+ regular wehi 2 COMPLETED 0:0 9222011 r-future.+ regular wehi 56 COMPLETED 0:0 9222012 rjson_0.2+ regular wehi 2 COMPLETED 0:0 9222014 rtracklay+ regular wehi 2 COMPLETED 0:0 9222016 r-rjson-0+ regular wehi 56 COMPLETED 0:0 9222019 snow_0.4-+ regular wehi 2 COMPLETED 0:0 9222020 snowfall_+ regular wehi 2 COMPLETED 0:0 9222022 r-snow-0.+ regular wehi 56 COMPLETED 0:0 9222024 source regular wehi 2 COMPLETED 0:0 9222183 r-BiocPar+ regular wehi 56 COMPLETED 0:0 9222214 kent-404 regular wehi 56 COMPLETED 0:0 9222256 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222257 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222258 nix-store+ regular wehi 6 COMPLETED 0:0 9222259 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222260 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222261 libcap-st+ regular wehi 6 COMPLETED 0:0 9222262 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222264 bubblewra+ regular wehi 6 COMPLETED 0:0 9222265 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222271 r-snowfal+ regular wehi 56 COMPLETED 0:0 9222331 splitFA regular wehi 56 COMPLETED 0:0 9222334 r-CGHcall+ regular wehi 56 COMPLETED 0:0 9222341 seed.txt regular wehi 56 COMPLETED 0:0 9222374 strip-sto+ regular wehi 56 COMPLETED 0:0 9222400 forgeBSge+ regular wehi 56 COMPLETED 0:0 9222431 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222486 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222654 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222700 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222701 build-bun+ regular wehi 6 COMPLETED 0:0 9222702 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222704 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222705 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222706 nix-store+ regular wehi 6 COMPLETED 0:0 9222707 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222708 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222709 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222711 build-bun+ regular wehi 6 COMPLETED 0:0 9222712 dorado-A1+ gpuq_large wehi 4 FAILED 127:0 9222714 dorado-A1+ gpuq_large wehi 4 COMPLETED 0:0 9222719 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222723 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222724 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222725 nix-store+ regular wehi 6 COMPLETED 0:0 9222726 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222727 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222728 libcap-st+ regular wehi 6 COMPLETED 0:0 9222729 nix-2.5pr+ regular wehi 6 FAILED 2:0 9222730 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222733 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222734 bubblewra+ regular wehi 6 COMPLETED 0:0 9222751 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222763 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222764 interacti+ regular wehi 2 COMPLETED 0:0 9222787 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222790 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222796 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222797 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222798 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222799 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222800 build-bun+ regular wehi 6 COMPLETED 0:0 9222812 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222813 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222816 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222817 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222820 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222822 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222826 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222843 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222845 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222856 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222860 interacti+ regular wehi 4 COMPLETED 0:0 9222861 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222862 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222864 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222865 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222872 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222875 bionix-sa+ regular wehi 2 COMPLETED 0:0 9223289 bionix-sa+ regular wehi 2 COMPLETED 0:0 9225143 bionix-sa+ regular wehi 6 COMPLETED 0:0 9229643 bionix-wi+ regular wehi 2 COMPLETED 0:0 9229733 bionix-sa+ regular wehi 2 COMPLETED 0:0 9229738 bin.R regular wehi 56 CANCELLED+ 0:0 9242414 targets.i+ regular wehi 56 COMPLETED 0:0 9242415 bin.R regular wehi 56 COMPLETED 0:0 9242418 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242419 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242420 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242424 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9242425 chroot-wr+ regular wehi 6 COMPLETED 0:0 9242426 nix-store+ regular wehi 6 COMPLETED 0:0 9242427 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9242428 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9242429 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242430 slurm-nix+ regular wehi 6 COMPLETED 0:0 9242431 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242432 build-bun+ regular wehi 6 COMPLETED 0:0 9242433 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242434 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242435 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242436 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242438 bionix-In+ regular wehi 2 COMPLETED 0:0 9242439 bionix-In+ regular wehi 2 COMPLETED 0:0 9242440 bionix-In+ regular wehi 2 COMPLETED 0:0 9242443 bionix-In+ regular wehi 2 COMPLETED 0:0 9242445 bionix-In+ regular wehi 2 COMPLETED 0:0 9242449 bionix-In+ regular wehi 2 COMPLETED 0:0 9242473 bionix-In+ regular wehi 2 COMPLETED 0:0 9242480 bionix-In+ regular wehi 2 COMPLETED 0:0 9242482 bionix-In+ regular wehi 2 COMPLETED 0:0 9242489 yaml_2.2.+ regular wehi 2 COMPLETED 0:0 9242493 bionix-Co+ regular wehi 2 COMPLETED 0:0 9242495 r-yaml-2.+ regular wehi 56 COMPLETED 0:0 9242496 bionix-In+ regular wehi 2 COMPLETED 0:0 9242497 r-restful+ regular wehi 56 COMPLETED 0:0 9242498 bionix-Ge+ regular wehi 2 COMPLETED 0:0 9242499 zlibbioc_+ regular wehi 2 COMPLETED 0:0 9242505 r-zlibbio+ regular wehi 56 COMPLETED 0:0 9242506 r-Rhtslib+ regular wehi 56 COMPLETED 0:0 9242508 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9242509 chroot-wr+ regular wehi 6 COMPLETED 0:0 9242510 nix-store+ regular wehi 6 COMPLETED 0:0 9242511 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9242512 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9242513 libcap-st+ regular wehi 6 COMPLETED 0:0 9242514 nix-2.5pr+ regular wehi 6 RUNNING 0:0 9242517 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9242524 r-XVector+ regular wehi 56 COMPLETED 0:0 9242525 bubblewra+ regular wehi 6 COMPLETED 0:0 9242527 r-Biostri+ regular wehi 56 COMPLETED 0:0 9242535 r-Genomic+ regular wehi 56 COMPLETED 0:0 9242536 r-Rsamtoo+ regular wehi 56 COMPLETED 0:0 9242546 r-Summari+ regular wehi 56 COMPLETED 0:0 9242553 r-QDNAseq+ regular wehi 56 COMPLETED 0:0 9244975 r-Genomic+ regular wehi 56 COMPLETED 0:0 9247737 r-rtrackl+ regular wehi 56 COMPLETED 0:0 9247869 r-BSgenom+ regular wehi 56 COMPLETED 0:0 9247873 R-4.1.2-w+ regular wehi 56 PENDING 0:0
Being able to understand the state of the cluster, can help understand why your job might be waiting.
Or, you can use the information to your advantage to reduce wait times.
To view the state of the cluster, we're going to use the sinfo command.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST interactive up 1-00:00:00 4 mix med-n03,sml-n[01-03] interactive up 1-00:00:00 1 alloc med-n02 interactive up 1-00:00:00 1 idle med-n01 regular* up 2-00:00:00 42 mix lrg-n[02-03],med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24] regular* up 2-00:00:00 13 alloc lrg-n04,med-n[02,06,10-11,14-17,19,24,28],sml-n21 long up 14-00:00:0 40 mix med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24] long up 14-00:00:0 12 alloc med-n[02,06,10-11,14-17,19,24,28],sml-n21 bigmem up 2-00:00:00 3 mix lrg-n02,med-n[03-04] bigmem up 2-00:00:00 1 alloc med-n02 bigmem up 2-00:00:00 1 idle lrg-n01 gpuq up 2-00:00:00 1 mix gpu-p100-n01 gpuq up 2-00:00:00 11 idle gpu-a30-n[01-07],gpu-p100-n[02-05] gpuq_interactive up 12:00:00 1 mix gpu-a10-n01 gpuq_large up 2-00:00:00 3 idle gpu-a100-n[01-03]
-N orders information by nodes
sinfo -N | head -n 5
NODELIST NODES PARTITION STATE gpu-a10-n01 1 gpuq_interactive mix gpu-a30-n01 1 gpuq idle gpu-a30-n02 1 gpuq idle gpu-a30-n03 1 gpuq idle
We can add detail with formatting options as well.
| CPU | memory | gres (GPU) | node state | time |
|---|---|---|---|---|
CPUsState |
FreeMem |
GresUsed |
StateCompact |
Time |
AllocMem |
Gres |
|||
Memory |
sinfo -NO nodelist:11' ',partition:10' ',cpusstate:13' ',freemem:8' ',memory:8' ',gresused,gres:11,statecompact:8,time | head -n 5
NODELIST PARTITION CPUS(A/I/O/T) FREE_MEM MEMORY GRES_USED GRES STATE TIMELIMIT gpu-a10-n01 gpuq_inter 0/48/0/48 163914 257417 gpu:A10:0(IDX:N/A) gpu:A10:4 idle 12:00:00 gpu-a30-n01 gpuq 0/96/0/96 450325 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00 gpu-a30-n02 gpuq 0/96/0/96 436435 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00 gpu-a30-n03 gpuq 0/96/0/96 497816 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00
Using command-line tools to obtain visibility into how your job is performing.
This section will look at using command-line tools to obtain visibility into how your job is performing.
| type of data | Live | Historical |
|---|---|---|
| good for | debugging | debugging |
| evaluating utilization | profiling | |
| drawbacks | uses system tools, so requires some system understanding | Only provides data when jobs are completed |
We will look at:
htop for Live Process activity on nodesnvidia-smi and nvtop for Live GPU activity on nodesseff for Historical job CPU and memory usage datadcgmstats for Historical job GPU usage datasacct for Historical job dataSlurm can't provide accurate "live" data about jobs' activities
System tools must be used instead.
This requires matching jobs to processes on a node with squeue and ssh.

htop is a utility often installed on HPC clusters for monitoring processes.
It can be used to look at the CPU, memory, and IO utilization of a running process.
It's not a Slurm tool, but is nevertheless very useful in monitoring jobs' activity and diagnosing issues.
To show only your processes, execute htop -u $USER
htop shows the individual CPU core utilization on the top, followed by memory utilization and some misc. information.
The bottom panel shows the process information
Relevant Headings:
USER: User that owns the processPID: Process ID%CPU: % of a single core that a process is using e.g. 400% means process is using 4 cores%MEM: % of node's total RAM that process is usingVSZ: "Virtual" memory (bytes) - the memory a process "thinks" it's usingRSS: "Resident" memory (bytes) - the actual physical memory a process is usingS: "State" of the processD: "Uninterruptible" sleep - waiting for something else, often IOR: RunningS: SleepingT: "Traced" or stopped e.g. by a debugger or manually i.e., pausedZ: "Zombie" - process has completed and is waiting to clean up.F5
You can add IO information by
F2 (Setup)IO_READ_RATE and press enterIO_WRITE_RATE and press enterF10 to exit.
You should now be able to see read/write rates for processes that you have permissions for.Tips:
htop configurations are saved in ~/.config/htop. Delete this folder to reset your htop conifguration.ps and pidstat are useful alternatives which can be incorporated into scripts.htop installed, in which case top can be used insteadTo monitor activity of Milton's NVIDIA GPUs, we must rely on NVIDIA's nvidia-smi tool.
nvidia-smi shows information about the memory and compute utilization, process allocation and other details.
nvtop is a command also available on Milton GPU nodes. It works similarly to htop.
Note that nvtop is a third-party tool and is less common, whereas nvidia-smi will always be available wherever NVIDIA GPUs are used.
Like htop, nvidia-smi and nvtop only provides information on processes running on a GPU. If your job is occupying an entire node and all its GPUs, it should be straightforward to determine which GPUs you've been allocated.
But if your job is sharing a node with other jobs, you might not know straight away which GPU your job has been allocated. You can determine this by
squeue with extra formatting options as discussed previously.Note:
This tool is available only on GPU nodes where the CUDA drivers are installed, so you must ssh to a gpu node to try it.
Tip: Combine nvidia-smi with watch to automatically update the output.
Slurm tools and plugins are generally easier to use because they provide information on a per-job basis, meaning there's no need to match processes with jobs like previously discussed.
Tips: generally, results are more reliable when executing commands with srun.
The seff command summarizes memory and CPU utilization of a job.
It's mainly useful for job steps that have ended.
seff 8665813
Job ID: 8665813 Cluster: milton User/Group: yang.e/allstaff State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 4 CPU Utilized: 00:09:04 CPU Efficiency: 99.27% of 00:09:08 core-walltime Job Wall-clock time: 00:02:17 Memory Utilized: 1.95 GB (estimated maximum) Memory Efficiency: 48.83% of 4.00 GB (1.00 GB/core)
Note: seff results are not as useful for jobs that have failed or been cancelled.
In addition to general job information sacct can be used to retrieve IO and memory data about _past_jobs
Like squeue, the default output is a limited, but can be augmented by the --format option.
The following sacct command shows your job data for jobs since 1st Nov:
Note that the IO and memory values shown will be for the highest use task.
sacct -S 2022-11-01 -o jobid%14' ',jobname,ncpus%5' ',nodelist,elapsed,state,maxdiskread,maxdiskwrite,maxvmsize,maxrss | head -n5
JobID JobName NCPUS NodeList Elapsed State MaxDiskRead MaxDiskWrite MaxVMSize MaxRSS
-------------- ---------- ----- --------------- ---------- ---------- ------------ ------------ ---------- ----------
8664599 sys/dashb+ 2 sml-n01 1-00:00:22 TIMEOUT
8664599.batch batch 2 sml-n01 1-00:00:23 CANCELLED 102.64M 15.11M 1760920K 99812K
8664599.extern extern 2 sml-n01 1-00:00:22 COMPLETED 0.00M 0 146612K 68K
Slurm breaks jobs into steps. Jobs will have steps:
.extern: work done not part of the job i.e. overhead.<index>: work done with srun.batch: work inside an sbatch script, but not executed by srun.interactive: work done inside an interactive salloc session, but not executed by srun.By default, Slurm doesn't have the ability to produce stats on GPU usage.
WEHI's ITS have implemented the dcgmstats NVIDIA Slurm plugin which can produce these summary stats.
To use this plugin, pass the --comment=dcgmstats option to srun, salloc, or sbatch.
If your job requested at least one GPU, an extra output file will be generated in the working directory called dcgm-stats-<jobid>.out. The output file will contain a table for each GPU requested by the job.
htop for CPU, memory, and IO data (requires configuration)nvidia-smi for GPU activityseff command for simple CPU and memory utilization data for one jobsacct command for memory and IO data for multiple past jobsdcgmstats Slurm plugin for GPU stats for a single Slurm jobTaking advantage of lesser-known options and environment features to make life easier
This section will look at:
stdout and stderr filessbatch scripts without a scriptWe're going to start with our simple R script submitted by wrapper sbatch script
demo-scripts/matmul.rscript
demo-scripts/submit-matmul.sh
## matmul.rscript
# multiplies two matrices together and prints how long it takes.
print("starting the matmul R script!")
nrows = 1e3
paste0("elem: ", nrows, "*", nrows, " = ", nrows*nrows)
# generating matrices
M <- matrix(rnorm(nrows*nrows),nrow=nrows)
N <- matrix(rnorm(nrows*nrows),nrow=nrows)
# start matmul
start.time <- Sys.time()
invisible(M %*% N)
end.time <- Sys.time()
# Getting final time and writing to stdout
elapsed.time <- difftime(time1=end.time, time2=start.time, units="secs")
print(elapsed.time)
#!/bin/bash
## submit-matmul.sh
# Example sbatch script exeucting R script that does a matmul
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
Getting notifications about the status of your Slurm jobs remove the need to ssh onto Milton and running squeue to get the status of your jobs.
Instead, it will notify you when your job state has changed e.g. when it has started or ended.
To enable this behaviour, add the following options to your job scripts:
--mail-user=me@gmail.com
--mail-type=ALL
This sends emails to me@gmail.com when the job state changes.
If you only want to know when your job goes through certain states, e.g. if it fails or is pre-empted but not when it starts or finishes:
Excercise: add the --mail-user and --mail-type options to the submit-matmul.sh script
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev1 - email notifications
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
sbatch without a script¶In some cases, one may wish to submit singular commands to the scheduler. srun and salloc can do this, but they need a terminal attached i.e., if you close your terminal with the srun or salloc session, then the job fails.
the sbatch --wrap option allows you to submit a singular command instead of an entire script.
This can be useful for testing, or implementing sbatch inside a script that manages your workflow.
Note that sbatch --wrap infers which interpreter to use from your active environment.
The --wrap option could replace submit-matmul.sh by:
sbatch --ntasks=1 --cpus-per-task=2 --mem=8G --wrap="module load R/openBLAS/4.2.1; Rscript matmul.rscript"
stdout and stderr¶Linux uses has two main "channels" to send output messages to. One is "stdout" (standard out), and the other is "stderr" (standard error).
If you have ever used the | > or >> shell scripting features, then you've redirected stdout somewhere else e.g., to another command, a file, or the void (/dev/null).
$ ls dir-that-doesnt-exist
ls: cannot access dir-that-doesnt-exist: No such file or directory # this is a stderr output`
$ ls ~
bin cache Desktop Downloads ... # this is a stdout output!
stderr and stdout¶By default:
stdout is directed to slurm-<jobid>.out in the job's working directorystderr is directed to wherever stdout is directed toRedirect stderr and stdout with --error and --output options. They work with both relative and absolute paths, e.g.
--error=/dev/null
--output=path/to/output.out
where paths are resolved relative to the job's working directory.
Variables can be used, like:
%j: job ID%x: job name%u: username%t: task ID i.e., seperate file per task%N: node name i.e., seperate file per nodes in job#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev2 - added --output and --error options
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL
#SBATCH --output=logs/matmul-%j.out
#SBATCH --error=logs-debug/matmul-%j.err
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
Slurm allows for submitted jobs to wait for another job to start or finish before beginning. While probably not as effective as workflow managers like Nextflow, Slurm's job dependencies can still be useful for simple workflows.
Make a job dependant on another by passing the --dependency option with one of the following values:
afterok:jobid1:jobid2... waits for jobid1, jobid2 ... to complete successfullyafternotok:jobid1:... " to fail, timeout, or be cancelled.afterany:jobid1:... " " to finish (fail, complete, cancelled).after:jobid1:... " " to start or are cancelled.e.g. --dependency=afterok:12345678 will make the job wait for job 12345678 to complete successfully before starting.
Recursive jobs are one way to work with short QOS time limits.
Multiple Slurm jobs are submitted with a sequential dependancy pattern, i.e., the second job depends on the first, the third job depends on the second and so on...
Slurm script:
cat demo-scripts/restartable-job.sh
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --mem-per-cpu=2G #SBATCH --time=1 sleep 10
# cell to run recursive script
SCRIPT=demo-scripts/recursive-job.sh
# Initiate the loop
prereq_jobid=$(sbatch --parsable $SCRIPT)
echo $prereq_jobid
# Create 5 more dependant jobs with a loop
for i in {1..5}; do
prereq_jobid=$(sbatch --parsable --dependency=afterany:$prereq_jobid $SCRIPT)
echo $prereq_jobid
done
squeue -u $USER
8703619
8703620
8703621
8703622
8703623
8703624
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8703594 gpuq_larg interact yang.e R 21:18 1 gpu-a100-n01
8701908 interacti sys/dash yang.e R 2:07:28 1 sml-n01
8703624 regular test-rec yang.e PD 0:00 1 (Dependency)
8703623 regular test-rec yang.e PD 0:00 1 (Dependency)
8703622 regular test-rec yang.e PD 0:00 1 (Dependency)
8703621 regular test-rec yang.e PD 0:00 1 (Dependency)
8703620 regular test-rec yang.e PD 0:00 1 (Dependency)
8703619 regular test-rec yang.e R 0:00 1 sml-n05
8703616 regular test-rec yang.e R 0:22 1 sml-n02
prereq_jobid=$(sbatch --parsable $SCRIPT)--parsable option to get the job id from sbatchprereq_jobidfor i in {1..5}; dofor loop that loops through 1 to 5, where i is the looping variableprereq_jobid=$(sbatch --parsable --dependency=afterok:${prereq_jobid} demo-scripts/recursive-job.sh--dependency=afterok:${prereq_jobid} option to link jobsafterany may be preferred instead of afterokprereq_jobid variableInstead of submitting all the jobs ahead of time, you can have a single Slurm scripts that submits itself until all the work is done (or it fails).
#!/bin/bash
## recursive-job.sh
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G
#SBATCH --time=2-
#SBATCH --output=output-%j.log
#SBATCH --error=output-%j.log
#SBATCH --mail-user=me.m@wehi.edu.au
#SBATCH --mail-type=END,FAIL
# Submitting a new job that depends on this one
sbatch --dependency=afternotok:${SLURM_JOBID} recursive-job.sh
# srunning the command
srun flye [flags] --resume
This job:
afternotok means the dependant job will only start if the current job doesn't complete successfullyflye command is expected to run for as long as it can, up to the 2 day wall timemail-type=END,FAIL sends an email when the job either:By default, when you submit a Slurm job, Slurm copies all the environment variables in your environment and adds some extra for the job to use.
export VAR1="here is some text"
cat demo-scripts/env-vars1.sbatch
#!/bin/bash echo $VAR1
sbatch demo-scripts/env-vars1.sbatch
Submitted batch job 8681656
cat slurm-8681656.out
here is some text
Note: For reproducibility reasons, a Slurm script that relies on environment variables can be submitted inside a wrapper script which first exports the relevant variable.
--export option which allows you to set specific values
echo $VAR1
here is some text
sbatch --export=VAR1="this is some different text" demo-scripts/env-vars1.sbatch
Submitted batch job 8681761
cat slurm-8681761.out
this is some different text
This feature is especially useful when submitting jobs inside wrapper scripts.
You can also use the --export-file option to specify a file with a list of VAR=value pairs that you wish the script to use.
cat demo-scripts/env-vars2.sbatch
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
echo I am running on ${SLURM_NODELIST}
echo with ${SLURM_NTASKS} tasks
echo and ${SLURM_CPUS_PER_TASK} CPUs per task
sbatch demo-scripts/env-vars2.sbatch
Submitted batch job 8681710
cat slurm-8681710.out
I am running on sml-n03 with 1 tasks and 2 CPUs per task
These Slurm environment variables make it easy to supply parallelisation parameters to a program e.g. specifying number of threads.
Tip: scripts/programs executed by srun will have a SLURM_PROCID environment variable seperating slurm tasks (MPI-like programming).
Typically scripts submitted by sbatch use the bash or sh interpreter (e.g. #!/bin/bash), but it may be more convenient to use a different interpreter.
You can do this by changing the "hash bang" statement at the top of the script. To demonstrate this, we can take our original R matmul script, and add a "hash bang" statement to the top.
#!/usr/bin/env Rscript
## matmul.rscript
print("starting the matmul R script!")
nrows = 1e3
...
The statement in the above looks for the Rscript in your current environment. This statement only works because Slurm will copy your environment when a Slurm script is submitted.
python works similarly. Replace Rscript in the hash bang statement to python.
Alternatively, you can specify the absolute path to the interpreter.
e.g. #!/stornext/System/data/apps/R/openBLAS/R-4.2.1/lib64/R/bin/Rscript
Tip: you can use --export=R_LIBS_USER=... to point Rscript to your libraries (or PYTHONPATH for python)
R example:
slurmtasks <- Sys.getenv("SLURM_NTASKS")
Python example:
import os
slurmtasks = os.getenv('SLURM_NTASKS')
Excercise:
Add a "using \
The output should look something like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06340098 secs
Making life easier with job arrays
Embarrasingly parallel computation is computation that can occur in parallel with minimal coordination. This type of parallel computation is very common.
Examples are parameter scans, genomic sequencing, basecalling, folding@home ...
Embarassingly parallel problems are facilitated in Slurm by "array jobs". Array jobs allows you to use a single script to submit multiple jobs with similar functionality.
The main benefits to using an array job are:
Array jobs are created by adding the --array=start-end option. Slurm jobs, AKA "tasks", will be created with indices between start and end. e.g. --array=1-10 will create tasks with indices 1, 2, ..., 10.
start and end values can be within 0 and 1000 (inclusive). Note this is site specific.
Singular values or discrete lists can also be specific e.g. --array=1 or --array=1,3,5,7-10.
#!/usr/bin/env Rscript
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --array=1-10
## matmul.rscript
print("starting the matmul R script!")
paste("using", Sys.getenv("SLURM_NTASKS"), "tasks")
...
Slurm augments the default output behaviour of array jobs automatically.
If no --output option is provided, an array job will produce a an output file slurm-<jobid>-<arrayindex>.out for each index in the array.
If you specify --output and --error, then you can use %A and %a variables, which represent the job index and the array index, respectively.
e.g.
#!/usr/bin/env Rscript
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --array=1-10
#SBATCH --output=Rmatmul-times-%A-%a.out
#SBATCH --error=Rmatmul-times-%A-%a.err
## matmul.rscript
...
Each task in the array can make use of its index to enable parallelism. This is by making use of the SLURM_ARRAY_TASK_ID environment variable.
Other environment variables are accessible:
SLURM_ARRAY_JOB_ID the job Id of the entire job arraySLURM_ARRAY_TASK_COUNT the number of tasks in the arraySLURM_ARRAY_TASK_MAX the largest ID of tasks in the arraySLURM_ARRAY_TASK_MIN the smallest ID of tasks in the arrayExercise: Add a paste statement to the matmul R script that prints the task ID
The output of each job task should look something like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "I am job task 1 in an array of 10!"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06340098 secs
nrows variable to equal 10*taskID
hint: you will need the strtoi function
Your output from job task 1 should look like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "I am job task 1 in an array of 10!"
[1] "elem: 10*10 = 100"
Time difference of 0.06340098 secs
For workflows requiring input files or parameters, there are multiple ways you can use job arrays:
What you can't do:
What you can do:
--dependency=afterok:<jobid>--dependency=afterok:<jobid>_<taskid>--mail-type=ALL will send notifications only for the entire job (not for each job task)
passing ARRAY_TASKS will send emails for each array task. e.g. --mail-type=BEGIN,ARRAY_TASKS will send an email every time a job array task starts.
Thanks for attending WEHI's first intermediate Slurm workshop!
Please fill out our feedback form:
https://forms.office.com/r/rKku8yqR57
We use these forms to help decide which workshops to run in the future and improve our current workshops!
Contact us at research.computing@wehi.edu.au for any kind of help related to computing and research!