Introduction & Housekeeping
Laying the groundwork: nodes, tasks and other Slurm terminology
Understanding your jobs and the cluster
Lunch (30mins)
Basic job profiling + 5min break
Slurm scripting features + 5min break
Embarrasingly parallel workflows with Slurm job arrays
sbatch
--ntasks
and --cpus-per-task
--mem
and/or --memory-per-cpu
sbatch
Awareness of "resources": CPUs, RAM/memory, Nodes, gres (GPUs)
Have used job submission commands
srun # executes a command/script/binary across tasks
salloc # allocates resources to be used (interactively and/or via srun)
sbatch # submits a script for later execution on requested resources
awareness of resource request options
--ntasks= # "tasks" recognised by srun
--nodes= # no. of nodes
--ntasks-per-node= # tasks per node
--cpus-per-task= # cpus per task
--mem= # memory required for entire job
--mem-per-cpu= # memory required for each CPU
--gres= # "general resource" (i.e. GPUs)
--time= # requested wall time
Slides + live coding
Live coding will be on Milton, so make sure you're connected to WEHI's VPN or staff network, or use RAP: https://rap.wehi.edu.au
Please follow along to reinforce learning!
Questions:
Material is available here: https://github.com/WEHI-ResearchComputing/intermediate-slurm-workshop
Reviewing cluster concepts and explaining tasks, and srun Slurm concepts
Nodes are essentially standalone computers with their own CPU cores, RAM, local storage, and maybe GPUs.
Note: Slurm calls CPU cores CPUs (e.g. cpus-per-tasks
).
HPC clusters (or just clusters) will consist of multiple nodes connected together through a (sometimes fast) network.
Typically, HPC is organised with login nodes, compute nodes, and some nodes that perform scheduling and storage duties.
srun
vs sbatch
and salloc
)¶TL;DR:
sbatch
requests resources for use with a scriptsalloc
requests resources to be used interactivelysrun
runs programs/scripts using resources requested by sbatch
and salloc
srun
will execute ntasks
instances of the same command/script/program.
Example:
salloc --ntasks=4 --cpus-per-task=2 --nodes=4
hostname
commandhostname
, but with srun
i.e. srun hostname
salloc
session and try run srun --ntasks=4 --cpus-per-task=2 --nodes=4 hostname
The BatchHost
executes the commands in script.sh
sbatch/salloc
and srun
work together?¶A command/script/program will be executed on BatchHost
only
To use the other nodes, execute srun hostname
.
BatchHost
will send the hostname
command to the remaining tasks to be executed as well.
Without srun
, only the BatchHost
executes the commands/script
Using srun
still doesn't guarantee the extra nodes will be used "properly"!
Nodes cannot collaborate on problems unless they are running a program designed that way.
It's like clicking your mouse on your PC, and expecting the click to register on a colleague's PC.
It's possible, but needs a special program/protocol to do so!
Biological sciences and statistics tend not to make use of multiple nodes to cooperate on a single problem.
Hence, we recommend passing --nodes=1
.
Tasks are a collection of resources (CPU cores, GPUs) expected to perform the same "task", or used by a single program e.g., via threads, Python multiprocessing, or OpenMP.
Tasks not equivalent to no. of CPUs!
The Slurm task model was created with "traditional HPC" in mind
srun
creates ntasks
instances of a program which coordinate using MPITasks are not as relevant in bioinformatics, but Slurm nevertheless uses tasks for accounting/profiling purposes.
Therefore, it's useful to have an understanding of tasks in order to interpret some of Slurm's job accounting/profiling outputs.
A task can only be given resources co-located on a node.
Multiple tasks requested by sbatch
or salloc
can be spread across multiple nodes (unless --nodes=
is specified).
For example, if we have two nodes with 4 CPU cores each:
requesting 1 task and 8 cpus-per-task won't work.
But requesting 2 tasks and 4 cpus-per-task will!
Most data science, statistics, bionformatics, health-science work will use --ntasks=1
, and using --cpus-per-task
.
If you see/hear anything to do with "distributed" or MPI (e.g. distributed ML), you may want to change these options.
Using Slurm and system tools to understand what your jobs are doing
Primary utilities discussed in this section:
squeue
Live Job queue data (queries the Slurm controller directly)sacct
Historical job data (queries the Slurm database)scontrol
Live Singular job data (queries the Slurm controller directly)sinfo
Live Cluster data (queries the Slurm controller directly)This section show you how to get more detailed information about:
WEHI also offers the HPC dashboard which provide visibility on the status of the cluster.
http://dashboards.hpc.wehi.edu.au/
Note: the dashboards' info is coarser than what the Slurm commands can provide, and is specific to WEHI.
squeue
¶squeue
shows everyone's job in the queue (passing -u <username>
) shows only <username>
's jobs.
squeue | head -n 5
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8516030 gpuq interact bollands R 1:30:11 1 gpu-p100-n01 8515707 gpuq cryospar cryospar R 3:04:59 1 gpu-p100-n01 8511988 interacti sys/dash yan.a R 20:15:53 1 sml-n03 8516092 interacti work jackson. R 1:21:42 1 sml-n01
But what if we want even more information?
We have to make use of the formatting options!
$ squeue --Format field1,field2,...
OR use the environment variable SQUEUE_FORMAT2
. Useful fields:
Resources related | Time related | Scheduling |
---|---|---|
NumCPUs |
starttime |
JobId |
NumNodes |
submittime |
name |
minmemory |
pendingtime |
partition |
tres-alloc |
timelimit |
priority |
minmemory |
timeleft |
reasonlist |
timeused |
workdir |
|
state |
You can always use man squeue
to see the entire list of options.
So you don't have to type out the fields, I recommend aliasing the the command with your fields of choice in ~/.bashrc
e.g.
alias sqv="squeue --Format=jobid:8,name:6' ',partition:10' ',statecompact:3,tres-alloc:60,timelimit:12,timeleft:12"
sqv | head -n 5
JOBID NAME PARTITION ST TRES_ALLOC TIME_LIMIT TIME_LEFT 8517002 R bigmem R cpu=22,mem=88G,node=1,billing=720984 1-00:00:00 23:35:18 8516030 intera gpuq R cpu=2,mem=20G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1 8:00:00 4:43:00 8515707 cryosp gpuq R cpu=8,mem=17G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1 2-00:00:00 1-19:08:12 8511988 sys/da interactiv R cpu=8,mem=16G,node=1,billing=112 1-00:00:00 1:57:18
sqv -u bedo.j | head -n 5
JOBID NAME PARTITION ST TRES_ALLOC TIME_LIMIT TIME_LEFT 8516851 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516850 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516849 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00 8516848 bionix regular PD cpu=24,mem=90G,node=1,billing=204 2-00:00:00 2-00:00:00
scontrol show job <jobid>
Useful if you care only about a specific job.
It's very useful when debugging jobs.
A lot of information without needing lots of input.
scontrol show job 8516360
JobId=8516360 JobName=Extr16S23S UserId=woodruff.c(2317) GroupId=allstaff(10908) MCS_label=N/A Priority=324 Nice=0 Account=wehi QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:21:53 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2022-10-20T11:37:49 EligibleTime=2022-10-20T11:37:49 AccrueTime=2022-10-20T11:37:49 StartTime=2022-10-20T14:28:03 EndTime=2022-10-22T14:28:03 Deadline=N/A PreemptEligibleTime=2022-10-20T14:28:03 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-20T14:28:03 Scheduler=Main Partition=regular AllocNode:Sid=vc7-shared:12938 ReqNodeList=(null) ExcNodeList=(null) NodeList=med-n24 BatchHost=med-n24 NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=48G,node=1,billing=128 Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=48G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/ribosomal_16S23S_extract_singlespecies.sh Staphylococcus epidermidis 32 WorkDir=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell StdErr=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/slurm-8516360.out StdIn=/dev/null StdOut=/stornext/Bioinf/data/lab_speed/cjw/microbiome/scripts/shell/slurm-8516360.out Power= MailUser=woodruff.c@wehi.edu.au MailType=END,FAIL
sacct
¶squeue
and scontrol show job
only show information on jobs that are in the queue i.e. jobs that are pending, running, or finishing up.
Once jobs complete, fail, or are cancelled, the job data is put into a Slurm job data base.
This database can be queried by sacct
to get information about your jobs.
sacct
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 9041674 crest0.3 regular wehi 224 CANCELLED+ 0:0 9041674.bat+ batch wehi 56 CANCELLED 0:15 9041674.ext+ extern wehi 224 COMPLETED 0:0 9041674.0 orted wehi 168 FAILED 1:0 9170758 gatk-4.2.+ regular wehi 56 COMPLETED 0:0 9170758.ext+ extern wehi 56 COMPLETED 0:0 9170758.0 nix-user-+ wehi 56 COMPLETED 0:0 9221903 impute_1.+ regular wehi 2 COMPLETED 0:0 9221903.ext+ extern wehi 2 COMPLETED 0:0 9221903.0 nix-user-+ wehi 2 COMPLETED 0:0 9221905 lambda.r_+ regular wehi 2 COMPLETED 0:0 9221905.ext+ extern wehi 2 COMPLETED 0:0 9221905.0 nix-user-+ wehi 2 COMPLETED 0:0 9221907 limma_3.5+ regular wehi 2 COMPLETED 0:0 9221907.ext+ extern wehi 2 COMPLETED 0:0 9221907.0 nix-user-+ wehi 2 COMPLETED 0:0 9221909 listenv_0+ regular wehi 2 COMPLETED 0:0 9221909.ext+ extern wehi 2 COMPLETED 0:0 9221909.0 nix-user-+ wehi 2 COMPLETED 0:0 9221910 marray_1.+ regular wehi 2 COMPLETED 0:0 9221910.ext+ extern wehi 2 COMPLETED 0:0 9221910.0 nix-user-+ wehi 2 COMPLETED 0:0 9221911 matrixSta+ regular wehi 2 COMPLETED 0:0 9221911.ext+ extern wehi 2 COMPLETED 0:0 9221911.0 nix-user-+ wehi 2 COMPLETED 0:0 9221912 parallell+ regular wehi 2 COMPLETED 0:0 9221912.ext+ extern wehi 2 COMPLETED 0:0 9221912.0 nix-user-+ wehi 2 COMPLETED 0:0 9221913 r-BH-1.78+ regular wehi 56 COMPLETED 0:0 9221913.ext+ extern wehi 56 COMPLETED 0:0 9221913.0 nix-user-+ wehi 56 COMPLETED 0:0 9221930 sys/dashb+ interacti+ wehi 2 RUNNING 0:0 9221930.bat+ batch wehi 2 RUNNING 0:0 9221930.ext+ extern wehi 2 RUNNING 0:0 9221945 r-BiocGen+ regular wehi 56 COMPLETED 0:0 9221945.ext+ extern wehi 56 COMPLETED 0:0 9221945.0 nix-user-+ wehi 56 COMPLETED 0:0 9221946 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221946.ext+ extern wehi 56 COMPLETED 0:0 9221946.0 nix-user-+ wehi 56 COMPLETED 0:0 9221947 r-Biobase+ regular wehi 56 COMPLETED 0:0 9221947.ext+ extern wehi 56 COMPLETED 0:0 9221947.0 nix-user-+ wehi 56 COMPLETED 0:0 9221949 r-R.metho+ regular wehi 56 COMPLETED 0:0 9221949.ext+ extern wehi 56 COMPLETED 0:0 9221949.0 nix-user-+ wehi 56 COMPLETED 0:0 9221950 r-S4Vecto+ regular wehi 56 COMPLETED 0:0 9221950.ext+ extern wehi 56 COMPLETED 0:0 9221950.0 nix-user-+ wehi 56 COMPLETED 0:0 9221955 r-R.oo-1.+ regular wehi 56 COMPLETED 0:0 9221955.ext+ extern wehi 56 COMPLETED 0:0 9221955.0 nix-user-+ wehi 56 COMPLETED 0:0 9221956 r-BiocIO-+ regular wehi 56 COMPLETED 0:0 9221956.ext+ extern wehi 56 COMPLETED 0:0 9221956.0 nix-user-+ wehi 56 COMPLETED 0:0 9221957 r-IRanges+ regular wehi 56 COMPLETED 0:0 9221957.ext+ extern wehi 56 COMPLETED 0:0 9221957.0 nix-user-+ wehi 56 COMPLETED 0:0 9221958 r-R.utils+ regular wehi 56 COMPLETED 0:0 9221958.ext+ extern wehi 56 COMPLETED 0:0 9221958.0 nix-user-+ wehi 56 COMPLETED 0:0 9221964 r-XML-3.9+ regular wehi 56 COMPLETED 0:0 9221964.ext+ extern wehi 56 COMPLETED 0:0 9221964.0 nix-user-+ wehi 56 COMPLETED 0:0 9221970 r-bitops-+ regular wehi 56 COMPLETED 0:0 9221970.ext+ extern wehi 56 COMPLETED 0:0 9221970.0 nix-user-+ wehi 56 COMPLETED 0:0 9221972 r-formatR+ regular wehi 56 COMPLETED 0:0 9221972.ext+ extern wehi 56 COMPLETED 0:0 9221972.0 nix-user-+ wehi 56 COMPLETED 0:0 9221973 r-RCurl-1+ regular wehi 56 COMPLETED 0:0 9221973.ext+ extern wehi 56 COMPLETED 0:0 9221973.0 nix-user-+ wehi 56 COMPLETED 0:0 9221977 r-futile.+ regular wehi 56 COMPLETED 0:0 9221977.ext+ extern wehi 56 COMPLETED 0:0 9221977.0 nix-user-+ wehi 56 COMPLETED 0:0 9221978 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221978.ext+ extern wehi 56 COMPLETED 0:0 9221978.0 nix-user-+ wehi 56 COMPLETED 0:0 9221981 r-globals+ regular wehi 56 COMPLETED 0:0 9221981.ext+ extern wehi 56 COMPLETED 0:0 9221981.0 nix-user-+ wehi 56 COMPLETED 0:0 9221983 r-impute-+ regular wehi 56 COMPLETED 0:0 9221983.ext+ extern wehi 56 COMPLETED 0:0 9221983.0 nix-user-+ wehi 56 COMPLETED 0:0 9221985 r-lambda.+ regular wehi 56 COMPLETED 0:0 9221985.ext+ extern wehi 56 COMPLETED 0:0 9221985.0 nix-user-+ wehi 56 COMPLETED 0:0 9221986 r-limma-3+ regular wehi 56 COMPLETED 0:0 9221986.ext+ extern wehi 56 COMPLETED 0:0 9221986.0 nix-user-+ wehi 56 COMPLETED 0:0 9221989 r-futile.+ regular wehi 56 COMPLETED 0:0 9221989.ext+ extern wehi 56 COMPLETED 0:0 9221989.0 nix-user-+ wehi 56 COMPLETED 0:0 9221990 r-listenv+ regular wehi 56 COMPLETED 0:0 9221990.ext+ extern wehi 56 COMPLETED 0:0 9221990.0 nix-user-+ wehi 56 COMPLETED 0:0 9221992 r-marray-+ regular wehi 56 COMPLETED 0:0 9221992.ext+ extern wehi 56 COMPLETED 0:0 9221992.0 nix-user-+ wehi 56 COMPLETED 0:0 9221999 r-matrixS+ regular wehi 56 COMPLETED 0:0 9221999.ext+ extern wehi 56 COMPLETED 0:0 9221999.0 nix-user-+ wehi 56 COMPLETED 0:0 9222004 r-CGHbase+ regular wehi 56 COMPLETED 0:0 9222004.ext+ extern wehi 56 COMPLETED 0:0 9222004.0 nix-user-+ wehi 56 COMPLETED 0:0 9222005 r-MatrixG+ regular wehi 56 COMPLETED 0:0 9222005.ext+ extern wehi 56 COMPLETED 0:0 9222005.0 nix-user-+ wehi 56 COMPLETED 0:0 9222006 r-paralle+ regular wehi 56 COMPLETED 0:0 9222006.ext+ extern wehi 56 COMPLETED 0:0 9222006.0 nix-user-+ wehi 56 COMPLETED 0:0 9222007 r-Delayed+ regular wehi 56 COMPLETED 0:0 9222007.ext+ extern wehi 56 COMPLETED 0:0 9222007.0 nix-user-+ wehi 56 COMPLETED 0:0 9222009 r-future-+ regular wehi 56 COMPLETED 0:0 9222009.ext+ extern wehi 56 COMPLETED 0:0 9222009.0 nix-user-+ wehi 56 COMPLETED 0:0 9222010 restfulr_+ regular wehi 2 COMPLETED 0:0 9222010.ext+ extern wehi 2 COMPLETED 0:0 9222010.0 nix-user-+ wehi 2 COMPLETED 0:0 9222011 r-future.+ regular wehi 56 COMPLETED 0:0 9222011.ext+ extern wehi 56 COMPLETED 0:0 9222011.0 nix-user-+ wehi 56 COMPLETED 0:0 9222012 rjson_0.2+ regular wehi 2 COMPLETED 0:0 9222012.ext+ extern wehi 2 COMPLETED 0:0 9222012.0 nix-user-+ wehi 2 COMPLETED 0:0 9222014 rtracklay+ regular wehi 2 COMPLETED 0:0 9222014.ext+ extern wehi 2 COMPLETED 0:0 9222014.0 nix-user-+ wehi 2 COMPLETED 0:0 9222016 r-rjson-0+ regular wehi 56 COMPLETED 0:0 9222016.ext+ extern wehi 56 COMPLETED 0:0 9222016.0 nix-user-+ wehi 56 COMPLETED 0:0 9222019 snow_0.4-+ regular wehi 2 COMPLETED 0:0 9222019.ext+ extern wehi 2 COMPLETED 0:0 9222019.0 nix-user-+ wehi 2 COMPLETED 0:0 9222020 snowfall_+ regular wehi 2 COMPLETED 0:0 9222020.ext+ extern wehi 2 COMPLETED 0:0 9222020.0 nix-user-+ wehi 2 COMPLETED 0:0 9222022 r-snow-0.+ regular wehi 56 COMPLETED 0:0 9222022.ext+ extern wehi 56 COMPLETED 0:0 9222022.0 nix-user-+ wehi 56 COMPLETED 0:0 9222024 source regular wehi 2 COMPLETED 0:0 9222024.ext+ extern wehi 2 COMPLETED 0:0 9222024.0 nix-user-+ wehi 2 COMPLETED 0:0 9222183 r-BiocPar+ regular wehi 56 COMPLETED 0:0 9222183.ext+ extern wehi 56 COMPLETED 0:0 9222183.0 nix-user-+ wehi 56 COMPLETED 0:0 9222214 kent-404 regular wehi 56 COMPLETED 0:0 9222214.ext+ extern wehi 56 COMPLETED 0:0 9222214.0 nix-user-+ wehi 56 COMPLETED 0:0 9222256 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222256.ext+ extern wehi 6 COMPLETED 0:0 9222256.0 nix-user-+ wehi 6 COMPLETED 0:0 9222257 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222257.ext+ extern wehi 6 COMPLETED 0:0 9222257.0 nix-user-+ wehi 6 COMPLETED 0:0 9222258 nix-store+ regular wehi 6 COMPLETED 0:0 9222258.ext+ extern wehi 6 COMPLETED 0:0 9222258.0 nix-user-+ wehi 6 COMPLETED 0:0 9222259 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222259.ext+ extern wehi 6 COMPLETED 0:0 9222259.0 nix-user-+ wehi 6 COMPLETED 0:0 9222260 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222260.ext+ extern wehi 6 COMPLETED 0:0 9222260.0 nix-user-+ wehi 6 COMPLETED 0:0 9222261 libcap-st+ regular wehi 6 COMPLETED 0:0 9222261.ext+ extern wehi 6 COMPLETED 0:0 9222261.0 nix-user-+ wehi 6 COMPLETED 0:0 9222262 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222262.ext+ extern wehi 6 COMPLETED 0:0 9222262.0 nix-user-+ wehi 6 COMPLETED 0:0 9222264 bubblewra+ regular wehi 6 COMPLETED 0:0 9222264.ext+ extern wehi 6 COMPLETED 0:0 9222264.0 nix-user-+ wehi 6 COMPLETED 0:0 9222265 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222265.ext+ extern wehi 6 COMPLETED 0:0 9222265.0 nix-user-+ wehi 6 COMPLETED 0:0 9222271 r-snowfal+ regular wehi 56 COMPLETED 0:0 9222271.ext+ extern wehi 56 COMPLETED 0:0 9222271.0 nix-user-+ wehi 56 COMPLETED 0:0 9222331 splitFA regular wehi 56 COMPLETED 0:0 9222331.ext+ extern wehi 56 COMPLETED 0:0 9222331.0 nix-user-+ wehi 56 COMPLETED 0:0 9222334 r-CGHcall+ regular wehi 56 COMPLETED 0:0 9222334.ext+ extern wehi 56 COMPLETED 0:0 9222334.0 nix-user-+ wehi 56 COMPLETED 0:0 9222341 seed.txt regular wehi 56 COMPLETED 0:0 9222341.ext+ extern wehi 56 COMPLETED 0:0 9222341.0 nix-user-+ wehi 56 COMPLETED 0:0 9222374 strip-sto+ regular wehi 56 COMPLETED 0:0 9222374.ext+ extern wehi 56 COMPLETED 0:0 9222374.0 nix-user-+ wehi 56 COMPLETED 0:0 9222400 forgeBSge+ regular wehi 56 COMPLETED 0:0 9222400.ext+ extern wehi 56 COMPLETED 0:0 9222400.0 nix-user-+ wehi 56 COMPLETED 0:0 9222431 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222431.ext+ extern wehi 6 COMPLETED 0:0 9222431.0 nix-user-+ wehi 6 COMPLETED 0:0 9222486 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222486.ext+ extern wehi 6 COMPLETED 0:0 9222486.0 nix-user-+ wehi 6 COMPLETED 0:0 9222654 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222654.ext+ extern wehi 6 COMPLETED 0:0 9222654.0 nix-user-+ wehi 6 COMPLETED 0:0 9222700 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222700.ext+ extern wehi 6 COMPLETED 0:0 9222700.0 nix-user-+ wehi 6 COMPLETED 0:0 9222701 build-bun+ regular wehi 6 COMPLETED 0:0 9222701.ext+ extern wehi 6 COMPLETED 0:0 9222701.0 nix-user-+ wehi 6 COMPLETED 0:0 9222702 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222702.ext+ extern wehi 6 COMPLETED 0:0 9222702.0 nix-user-+ wehi 6 COMPLETED 0:0 9222704 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222704.ext+ extern wehi 6 COMPLETED 0:0 9222704.0 nix-user-+ wehi 6 COMPLETED 0:0 9222705 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222705.ext+ extern wehi 6 COMPLETED 0:0 9222705.0 nix-user-+ wehi 6 COMPLETED 0:0 9222706 nix-store+ regular wehi 6 COMPLETED 0:0 9222706.ext+ extern wehi 6 COMPLETED 0:0 9222706.0 nix-user-+ wehi 6 COMPLETED 0:0 9222707 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222707.ext+ extern wehi 6 COMPLETED 0:0 9222707.0 nix-user-+ wehi 6 COMPLETED 0:0 9222708 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222708.ext+ extern wehi 6 COMPLETED 0:0 9222708.0 nix-user-+ wehi 6 COMPLETED 0:0 9222709 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222709.ext+ extern wehi 6 COMPLETED 0:0 9222709.0 nix-user-+ wehi 6 COMPLETED 0:0 9222711 build-bun+ regular wehi 6 COMPLETED 0:0 9222711.ext+ extern wehi 6 COMPLETED 0:0 9222711.0 nix-user-+ wehi 6 COMPLETED 0:0 9222712 dorado-A1+ gpuq_large wehi 4 FAILED 127:0 9222712.bat+ batch wehi 4 FAILED 127:0 9222712.ext+ extern wehi 4 COMPLETED 0:0 9222712.0 time wehi 4 FAILED 127:0 9222714 dorado-A1+ gpuq_large wehi 4 COMPLETED 0:0 9222714.bat+ batch wehi 4 COMPLETED 0:0 9222714.ext+ extern wehi 4 COMPLETED 0:0 9222714.0 time wehi 4 COMPLETED 0:0 9222719 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222719.ext+ extern wehi 6 COMPLETED 0:0 9222719.0 nix-user-+ wehi 6 COMPLETED 0:0 9222723 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222723.ext+ extern wehi 6 COMPLETED 0:0 9222723.0 nix-user-+ wehi 6 COMPLETED 0:0 9222724 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222724.ext+ extern wehi 6 COMPLETED 0:0 9222724.0 nix-user-+ wehi 6 COMPLETED 0:0 9222725 nix-store+ regular wehi 6 COMPLETED 0:0 9222725.ext+ extern wehi 6 COMPLETED 0:0 9222725.0 nix-user-+ wehi 6 COMPLETED 0:0 9222726 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222726.ext+ extern wehi 6 COMPLETED 0:0 9222726.0 nix-user-+ wehi 6 COMPLETED 0:0 9222727 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222727.ext+ extern wehi 6 COMPLETED 0:0 9222727.0 nix-user-+ wehi 6 COMPLETED 0:0 9222728 libcap-st+ regular wehi 6 COMPLETED 0:0 9222728.ext+ extern wehi 6 COMPLETED 0:0 9222728.0 nix-user-+ wehi 6 COMPLETED 0:0 9222729 nix-2.5pr+ regular wehi 6 FAILED 2:0 9222729.ext+ extern wehi 6 COMPLETED 0:0 9222729.0 nix-user-+ wehi 6 FAILED 2:0 9222730 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222730.ext+ extern wehi 6 COMPLETED 0:0 9222730.0 nix-user-+ wehi 6 COMPLETED 0:0 9222733 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222733.ext+ extern wehi 6 COMPLETED 0:0 9222733.0 nix-user-+ wehi 6 COMPLETED 0:0 9222734 bubblewra+ regular wehi 6 COMPLETED 0:0 9222734.ext+ extern wehi 6 COMPLETED 0:0 9222734.0 nix-user-+ wehi 6 COMPLETED 0:0 9222751 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222751.ext+ extern wehi 6 COMPLETED 0:0 9222751.0 nix-user-+ wehi 6 COMPLETED 0:0 9222763 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222763.ext+ extern wehi 6 COMPLETED 0:0 9222763.0 nix-user-+ wehi 6 COMPLETED 0:0 9222764 interacti+ regular wehi 2 COMPLETED 0:0 9222764.int+ interacti+ wehi 2 COMPLETED 0:0 9222764.ext+ extern wehi 2 COMPLETED 0:0 9222764.0 echo wehi 2 COMPLETED 0:0 9222764.1 echo wehi 2 COMPLETED 0:0 9222764.2 echo wehi 2 COMPLETED 0:0 9222764.3 echo wehi 2 COMPLETED 0:0 9222787 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222787.ext+ extern wehi 6 COMPLETED 0:0 9222787.0 nix-user-+ wehi 6 COMPLETED 0:0 9222790 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222790.ext+ extern wehi 6 COMPLETED 0:0 9222790.0 nix-user-+ wehi 6 COMPLETED 0:0 9222796 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222796.ext+ extern wehi 2 COMPLETED 0:0 9222796.0 nix-user-+ wehi 2 COMPLETED 0:0 9222797 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222797.ext+ extern wehi 2 COMPLETED 0:0 9222797.0 nix-user-+ wehi 2 COMPLETED 0:0 9222798 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222798.ext+ extern wehi 2 COMPLETED 0:0 9222798.0 nix-user-+ wehi 2 COMPLETED 0:0 9222799 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222799.ext+ extern wehi 6 COMPLETED 0:0 9222799.0 nix-user-+ wehi 6 COMPLETED 0:0 9222800 build-bun+ regular wehi 6 COMPLETED 0:0 9222800.ext+ extern wehi 6 COMPLETED 0:0 9222800.0 nix-user-+ wehi 6 COMPLETED 0:0 9222812 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222812.ext+ extern wehi 2 COMPLETED 0:0 9222812.0 nix-user-+ wehi 2 COMPLETED 0:0 9222813 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222813.ext+ extern wehi 6 COMPLETED 0:0 9222813.0 nix-user-+ wehi 6 COMPLETED 0:0 9222816 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222816.ext+ extern wehi 6 COMPLETED 0:0 9222816.0 nix-user-+ wehi 6 COMPLETED 0:0 9222817 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222817.ext+ extern wehi 2 COMPLETED 0:0 9222817.0 nix-user-+ wehi 2 COMPLETED 0:0 9222820 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222820.ext+ extern wehi 2 COMPLETED 0:0 9222820.0 nix-user-+ wehi 2 COMPLETED 0:0 9222822 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222822.ext+ extern wehi 6 COMPLETED 0:0 9222822.0 nix-user-+ wehi 6 COMPLETED 0:0 9222826 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222826.ext+ extern wehi 6 COMPLETED 0:0 9222826.0 nix-user-+ wehi 6 COMPLETED 0:0 9222843 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222843.ext+ extern wehi 2 COMPLETED 0:0 9222843.0 nix-user-+ wehi 2 COMPLETED 0:0 9222845 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222845.ext+ extern wehi 2 COMPLETED 0:0 9222845.0 nix-user-+ wehi 2 COMPLETED 0:0 9222856 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222856.ext+ extern wehi 6 COMPLETED 0:0 9222856.0 nix-user-+ wehi 6 COMPLETED 0:0 9222860 interacti+ regular wehi 4 COMPLETED 0:0 9222860.int+ interacti+ wehi 4 COMPLETED 0:0 9222860.ext+ extern wehi 4 COMPLETED 0:0 9222861 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222861.ext+ extern wehi 6 COMPLETED 0:0 9222861.0 nix-user-+ wehi 6 COMPLETED 0:0 9222862 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222862.ext+ extern wehi 2 COMPLETED 0:0 9222862.0 nix-user-+ wehi 2 COMPLETED 0:0 9222864 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222864.ext+ extern wehi 2 COMPLETED 0:0 9222864.0 nix-user-+ wehi 2 COMPLETED 0:0 9222865 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222865.ext+ extern wehi 6 COMPLETED 0:0 9222865.0 nix-user-+ wehi 6 COMPLETED 0:0 9222872 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222872.ext+ extern wehi 6 COMPLETED 0:0 9222872.0 nix-user-+ wehi 6 COMPLETED 0:0 9222875 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222875.ext+ extern wehi 2 COMPLETED 0:0 9222875.0 nix-user-+ wehi 2 COMPLETED 0:0 9223289 bionix-sa+ regular wehi 2 COMPLETED 0:0 9223289.ext+ extern wehi 2 COMPLETED 0:0 9223289.0 nix-user-+ wehi 2 COMPLETED 0:0 9225143 bionix-sa+ regular wehi 6 COMPLETED 0:0 9225143.ext+ extern wehi 6 COMPLETED 0:0 9225143.0 nix-user-+ wehi 6 COMPLETED 0:0 9229643 bionix-wi+ regular wehi 2 COMPLETED 0:0 9229643.ext+ extern wehi 2 COMPLETED 0:0 9229643.0 nix-user-+ wehi 2 COMPLETED 0:0 9229733 bionix-sa+ regular wehi 2 COMPLETED 0:0 9229733.ext+ extern wehi 2 COMPLETED 0:0 9229733.0 nix-user-+ wehi 2 COMPLETED 0:0 9229738 bin.R regular wehi 56 PENDING 0:0
sacct
is to return jobs running today.
Choose the time-window with -S <date-time> -E <date-time>
YYYY-MM-DDhh:mm:ss
sacct -S 2022-11
is acceptable too.-S
: start date-time-E
: end date-timeNote: big/frequent sacct
queries can occupy and eventually overload the Slurm controller node.
sacct
behaviour can be augmented by --format
. See man sacct
for more details.
-X
can be used to group job steps together, but this prevents some statistics like IO and memory from being reported.
sacct -X
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 9041674 crest0.3 regular wehi 224 CANCELLED+ 0:0 9170758 gatk-4.2.+ regular wehi 56 COMPLETED 0:0 9221903 impute_1.+ regular wehi 2 COMPLETED 0:0 9221905 lambda.r_+ regular wehi 2 COMPLETED 0:0 9221907 limma_3.5+ regular wehi 2 COMPLETED 0:0 9221909 listenv_0+ regular wehi 2 COMPLETED 0:0 9221910 marray_1.+ regular wehi 2 COMPLETED 0:0 9221911 matrixSta+ regular wehi 2 COMPLETED 0:0 9221912 parallell+ regular wehi 2 COMPLETED 0:0 9221913 r-BH-1.78+ regular wehi 56 COMPLETED 0:0 9221930 sys/dashb+ interacti+ wehi 2 RUNNING 0:0 9221945 r-BiocGen+ regular wehi 56 COMPLETED 0:0 9221946 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221947 r-Biobase+ regular wehi 56 COMPLETED 0:0 9221949 r-R.metho+ regular wehi 56 COMPLETED 0:0 9221950 r-S4Vecto+ regular wehi 56 COMPLETED 0:0 9221955 r-R.oo-1.+ regular wehi 56 COMPLETED 0:0 9221956 r-BiocIO-+ regular wehi 56 COMPLETED 0:0 9221957 r-IRanges+ regular wehi 56 COMPLETED 0:0 9221958 r-R.utils+ regular wehi 56 COMPLETED 0:0 9221964 r-XML-3.9+ regular wehi 56 COMPLETED 0:0 9221970 r-bitops-+ regular wehi 56 COMPLETED 0:0 9221972 r-formatR+ regular wehi 56 COMPLETED 0:0 9221973 r-RCurl-1+ regular wehi 56 COMPLETED 0:0 9221977 r-futile.+ regular wehi 56 COMPLETED 0:0 9221978 r-GenomeI+ regular wehi 56 COMPLETED 0:0 9221981 r-globals+ regular wehi 56 COMPLETED 0:0 9221983 r-impute-+ regular wehi 56 COMPLETED 0:0 9221985 r-lambda.+ regular wehi 56 COMPLETED 0:0 9221986 r-limma-3+ regular wehi 56 COMPLETED 0:0 9221989 r-futile.+ regular wehi 56 COMPLETED 0:0 9221990 r-listenv+ regular wehi 56 COMPLETED 0:0 9221992 r-marray-+ regular wehi 56 COMPLETED 0:0 9221999 r-matrixS+ regular wehi 56 COMPLETED 0:0 9222004 r-CGHbase+ regular wehi 56 COMPLETED 0:0 9222005 r-MatrixG+ regular wehi 56 COMPLETED 0:0 9222006 r-paralle+ regular wehi 56 COMPLETED 0:0 9222007 r-Delayed+ regular wehi 56 COMPLETED 0:0 9222009 r-future-+ regular wehi 56 COMPLETED 0:0 9222010 restfulr_+ regular wehi 2 COMPLETED 0:0 9222011 r-future.+ regular wehi 56 COMPLETED 0:0 9222012 rjson_0.2+ regular wehi 2 COMPLETED 0:0 9222014 rtracklay+ regular wehi 2 COMPLETED 0:0 9222016 r-rjson-0+ regular wehi 56 COMPLETED 0:0 9222019 snow_0.4-+ regular wehi 2 COMPLETED 0:0 9222020 snowfall_+ regular wehi 2 COMPLETED 0:0 9222022 r-snow-0.+ regular wehi 56 COMPLETED 0:0 9222024 source regular wehi 2 COMPLETED 0:0 9222183 r-BiocPar+ regular wehi 56 COMPLETED 0:0 9222214 kent-404 regular wehi 56 COMPLETED 0:0 9222256 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222257 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222258 nix-store+ regular wehi 6 COMPLETED 0:0 9222259 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222260 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222261 libcap-st+ regular wehi 6 COMPLETED 0:0 9222262 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222264 bubblewra+ regular wehi 6 COMPLETED 0:0 9222265 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222271 r-snowfal+ regular wehi 56 COMPLETED 0:0 9222331 splitFA regular wehi 56 COMPLETED 0:0 9222334 r-CGHcall+ regular wehi 56 COMPLETED 0:0 9222341 seed.txt regular wehi 56 COMPLETED 0:0 9222374 strip-sto+ regular wehi 56 COMPLETED 0:0 9222400 forgeBSge+ regular wehi 56 COMPLETED 0:0 9222431 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222486 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222654 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222700 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222701 build-bun+ regular wehi 6 COMPLETED 0:0 9222702 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222704 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222705 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222706 nix-store+ regular wehi 6 COMPLETED 0:0 9222707 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222708 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222709 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222711 build-bun+ regular wehi 6 COMPLETED 0:0 9222712 dorado-A1+ gpuq_large wehi 4 FAILED 127:0 9222714 dorado-A1+ gpuq_large wehi 4 COMPLETED 0:0 9222719 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222723 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9222724 chroot-wr+ regular wehi 6 COMPLETED 0:0 9222725 nix-store+ regular wehi 6 COMPLETED 0:0 9222726 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9222727 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9222728 libcap-st+ regular wehi 6 COMPLETED 0:0 9222729 nix-2.5pr+ regular wehi 6 FAILED 2:0 9222730 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222733 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9222734 bubblewra+ regular wehi 6 COMPLETED 0:0 9222751 nix-2.5pr+ regular wehi 6 COMPLETED 0:0 9222763 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222764 interacti+ regular wehi 2 COMPLETED 0:0 9222787 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222790 bionix-bw+ regular wehi 6 COMPLETED 0:0 9222796 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222797 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222798 bionix-ge+ regular wehi 2 COMPLETED 0:0 9222799 slurm-nix+ regular wehi 6 COMPLETED 0:0 9222800 build-bun+ regular wehi 6 COMPLETED 0:0 9222812 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222813 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222816 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222817 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222820 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222822 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222826 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222843 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222845 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222856 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222860 interacti+ regular wehi 4 COMPLETED 0:0 9222861 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222862 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222864 bionix-sa+ regular wehi 2 COMPLETED 0:0 9222865 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222872 bionix-sa+ regular wehi 6 COMPLETED 0:0 9222875 bionix-sa+ regular wehi 2 COMPLETED 0:0 9223289 bionix-sa+ regular wehi 2 COMPLETED 0:0 9225143 bionix-sa+ regular wehi 6 COMPLETED 0:0 9229643 bionix-wi+ regular wehi 2 COMPLETED 0:0 9229733 bionix-sa+ regular wehi 2 COMPLETED 0:0 9229738 bin.R regular wehi 56 CANCELLED+ 0:0 9242414 targets.i+ regular wehi 56 COMPLETED 0:0 9242415 bin.R regular wehi 56 COMPLETED 0:0 9242418 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242419 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242420 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242424 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9242425 chroot-wr+ regular wehi 6 COMPLETED 0:0 9242426 nix-store+ regular wehi 6 COMPLETED 0:0 9242427 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9242428 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9242429 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242430 slurm-nix+ regular wehi 6 COMPLETED 0:0 9242431 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242432 build-bun+ regular wehi 6 COMPLETED 0:0 9242433 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242434 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242435 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242436 bionix-Ha+ regular wehi 2 COMPLETED 0:0 9242438 bionix-In+ regular wehi 2 COMPLETED 0:0 9242439 bionix-In+ regular wehi 2 COMPLETED 0:0 9242440 bionix-In+ regular wehi 2 COMPLETED 0:0 9242443 bionix-In+ regular wehi 2 COMPLETED 0:0 9242445 bionix-In+ regular wehi 2 COMPLETED 0:0 9242449 bionix-In+ regular wehi 2 COMPLETED 0:0 9242473 bionix-In+ regular wehi 2 COMPLETED 0:0 9242480 bionix-In+ regular wehi 2 COMPLETED 0:0 9242482 bionix-In+ regular wehi 2 COMPLETED 0:0 9242489 yaml_2.2.+ regular wehi 2 COMPLETED 0:0 9242493 bionix-Co+ regular wehi 2 COMPLETED 0:0 9242495 r-yaml-2.+ regular wehi 56 COMPLETED 0:0 9242496 bionix-In+ regular wehi 2 COMPLETED 0:0 9242497 r-restful+ regular wehi 56 COMPLETED 0:0 9242498 bionix-Ge+ regular wehi 2 COMPLETED 0:0 9242499 zlibbioc_+ regular wehi 2 COMPLETED 0:0 9242505 r-zlibbio+ regular wehi 56 COMPLETED 0:0 9242506 r-Rhtslib+ regular wehi 56 COMPLETED 0:0 9242508 bwrap-wra+ regular wehi 6 COMPLETED 0:0 9242509 chroot-wr+ regular wehi 6 COMPLETED 0:0 9242510 nix-store+ regular wehi 6 COMPLETED 0:0 9242511 nix-wrapp+ regular wehi 6 COMPLETED 0:0 9242512 ssh-wrapp+ regular wehi 6 COMPLETED 0:0 9242513 libcap-st+ regular wehi 6 COMPLETED 0:0 9242514 nix-2.5pr+ regular wehi 6 RUNNING 0:0 9242517 arx-0.3.2 regular wehi 6 COMPLETED 0:0 9242524 r-XVector+ regular wehi 56 COMPLETED 0:0 9242525 bubblewra+ regular wehi 6 COMPLETED 0:0 9242527 r-Biostri+ regular wehi 56 COMPLETED 0:0 9242535 r-Genomic+ regular wehi 56 COMPLETED 0:0 9242536 r-Rsamtoo+ regular wehi 56 COMPLETED 0:0 9242546 r-Summari+ regular wehi 56 COMPLETED 0:0 9242553 r-QDNAseq+ regular wehi 56 COMPLETED 0:0 9244975 r-Genomic+ regular wehi 56 COMPLETED 0:0 9247737 r-rtrackl+ regular wehi 56 COMPLETED 0:0 9247869 r-BSgenom+ regular wehi 56 COMPLETED 0:0 9247873 R-4.1.2-w+ regular wehi 56 PENDING 0:0
Being able to understand the state of the cluster, can help understand why your job might be waiting.
Or, you can use the information to your advantage to reduce wait times.
To view the state of the cluster, we're going to use the sinfo
command.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST interactive up 1-00:00:00 4 mix med-n03,sml-n[01-03] interactive up 1-00:00:00 1 alloc med-n02 interactive up 1-00:00:00 1 idle med-n01 regular* up 2-00:00:00 42 mix lrg-n[02-03],med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24] regular* up 2-00:00:00 13 alloc lrg-n04,med-n[02,06,10-11,14-17,19,24,28],sml-n21 long up 14-00:00:0 40 mix med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24] long up 14-00:00:0 12 alloc med-n[02,06,10-11,14-17,19,24,28],sml-n21 bigmem up 2-00:00:00 3 mix lrg-n02,med-n[03-04] bigmem up 2-00:00:00 1 alloc med-n02 bigmem up 2-00:00:00 1 idle lrg-n01 gpuq up 2-00:00:00 1 mix gpu-p100-n01 gpuq up 2-00:00:00 11 idle gpu-a30-n[01-07],gpu-p100-n[02-05] gpuq_interactive up 12:00:00 1 mix gpu-a10-n01 gpuq_large up 2-00:00:00 3 idle gpu-a100-n[01-03]
-N
orders information by nodes
sinfo -N | head -n 5
NODELIST NODES PARTITION STATE gpu-a10-n01 1 gpuq_interactive mix gpu-a30-n01 1 gpuq idle gpu-a30-n02 1 gpuq idle gpu-a30-n03 1 gpuq idle
We can add detail with formatting options as well.
CPU | memory | gres (GPU) | node state | time |
---|---|---|---|---|
CPUsState |
FreeMem |
GresUsed |
StateCompact |
Time |
AllocMem |
Gres |
|||
Memory |
sinfo -NO nodelist:11' ',partition:10' ',cpusstate:13' ',freemem:8' ',memory:8' ',gresused,gres:11,statecompact:8,time | head -n 5
NODELIST PARTITION CPUS(A/I/O/T) FREE_MEM MEMORY GRES_USED GRES STATE TIMELIMIT gpu-a10-n01 gpuq_inter 0/48/0/48 163914 257417 gpu:A10:0(IDX:N/A) gpu:A10:4 idle 12:00:00 gpu-a30-n01 gpuq 0/96/0/96 450325 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00 gpu-a30-n02 gpuq 0/96/0/96 436435 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00 gpu-a30-n03 gpuq 0/96/0/96 497816 511362 gpu:A30:0(IDX:N/A) gpu:A30:4 idle 2-00:00:00
Using command-line tools to obtain visibility into how your job is performing.
This section will look at using command-line tools to obtain visibility into how your job is performing.
type of data | Live | Historical |
---|---|---|
good for | debugging | debugging |
evaluating utilization | profiling | |
drawbacks | uses system tools, so requires some system understanding | Only provides data when jobs are completed |
We will look at:
htop
for Live Process activity on nodesnvidia-smi
and nvtop
for Live GPU activity on nodesseff
for Historical job CPU and memory usage datadcgmstats
for Historical job GPU usage datasacct
for Historical job dataSlurm can't provide accurate "live" data about jobs' activities
System tools must be used instead.
This requires matching jobs to processes on a node with squeue
and ssh
.
htop
is a utility often installed on HPC clusters for monitoring processes.
It can be used to look at the CPU, memory, and IO utilization of a running process.
It's not a Slurm tool, but is nevertheless very useful in monitoring jobs' activity and diagnosing issues.
To show only your processes, execute htop -u $USER
htop
shows the individual CPU core utilization on the top, followed by memory utilization and some misc. information.
The bottom panel shows the process information
Relevant Headings:
USER
: User that owns the processPID
: Process ID%CPU
: % of a single core that a process is using e.g. 400% means process is using 4 cores%MEM
: % of node's total RAM that process is usingVSZ
: "Virtual" memory (bytes) - the memory a process "thinks" it's usingRSS
: "Resident" memory (bytes) - the actual physical memory a process is usingS
: "State" of the processD
: "Uninterruptible" sleep - waiting for something else, often IOR
: RunningS
: SleepingT
: "Traced" or stopped e.g. by a debugger or manually i.e., pausedZ
: "Zombie" - process has completed and is waiting to clean up.F5
You can add IO information by
F2
(Setup)IO_READ_RATE
and press enterIO_WRITE_RATE
and press enterF10
to exit.
You should now be able to see read/write rates for processes that you have permissions for.Tips:
htop
configurations are saved in ~/.config/htop
. Delete this folder to reset your htop
conifguration.ps
and pidstat
are useful alternatives which can be incorporated into scripts.htop
installed, in which case top
can be used insteadTo monitor activity of Milton's NVIDIA GPUs, we must rely on NVIDIA's nvidia-smi
tool.
nvidia-smi
shows information about the memory and compute utilization, process allocation and other details.
nvtop
is a command also available on Milton GPU nodes. It works similarly to htop
.
Note that nvtop
is a third-party tool and is less common, whereas nvidia-smi
will always be available wherever NVIDIA GPUs are used.
Like htop
, nvidia-smi
and nvtop
only provides information on processes running on a GPU. If your job is occupying an entire node and all its GPUs, it should be straightforward to determine which GPUs you've been allocated.
But if your job is sharing a node with other jobs, you might not know straight away which GPU your job has been allocated. You can determine this by
squeue
with extra formatting options as discussed previously.Note:
This tool is available only on GPU nodes where the CUDA drivers are installed, so you must ssh
to a gpu
node to try it.
Tip: Combine nvidia-smi
with watch
to automatically update the output.
Slurm tools and plugins are generally easier to use because they provide information on a per-job basis, meaning there's no need to match processes with jobs like previously discussed.
Tips: generally, results are more reliable when executing commands with srun
.
The seff
command summarizes memory and CPU utilization of a job.
It's mainly useful for job steps that have ended.
seff 8665813
Job ID: 8665813 Cluster: milton User/Group: yang.e/allstaff State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 4 CPU Utilized: 00:09:04 CPU Efficiency: 99.27% of 00:09:08 core-walltime Job Wall-clock time: 00:02:17 Memory Utilized: 1.95 GB (estimated maximum) Memory Efficiency: 48.83% of 4.00 GB (1.00 GB/core)
Note: seff
results are not as useful for jobs that have failed or been cancelled.
In addition to general job information sacct
can be used to retrieve IO and memory data about _past_jobs
Like squeue
, the default output is a limited, but can be augmented by the --format
option.
The following sacct
command shows your job data for jobs since 1st Nov:
Note that the IO and memory values shown will be for the highest use task.
sacct -S 2022-11-01 -o jobid%14' ',jobname,ncpus%5' ',nodelist,elapsed,state,maxdiskread,maxdiskwrite,maxvmsize,maxrss | head -n5
JobID JobName NCPUS NodeList Elapsed State MaxDiskRead MaxDiskWrite MaxVMSize MaxRSS -------------- ---------- ----- --------------- ---------- ---------- ------------ ------------ ---------- ---------- 8664599 sys/dashb+ 2 sml-n01 1-00:00:22 TIMEOUT 8664599.batch batch 2 sml-n01 1-00:00:23 CANCELLED 102.64M 15.11M 1760920K 99812K 8664599.extern extern 2 sml-n01 1-00:00:22 COMPLETED 0.00M 0 146612K 68K
Slurm breaks jobs into steps. Jobs will have steps:
.extern
: work done not part of the job i.e. overhead.<index>
: work done with srun
.batch
: work inside an sbatch
script, but not executed by srun
.interactive
: work done inside an interactive salloc
session, but not executed by srun
.By default, Slurm doesn't have the ability to produce stats on GPU usage.
WEHI's ITS have implemented the dcgmstats
NVIDIA Slurm plugin which can produce these summary stats.
To use this plugin, pass the --comment=dcgmstats
option to srun
, salloc
, or sbatch
.
If your job requested at least one GPU, an extra output file will be generated in the working directory called dcgm-stats-<jobid>.out
. The output file will contain a table for each GPU requested by the job.
htop
for CPU, memory, and IO data (requires configuration)nvidia-smi
for GPU activityseff
command for simple CPU and memory utilization data for one jobsacct
command for memory and IO data for multiple past jobsdcgmstats
Slurm plugin for GPU stats for a single Slurm jobTaking advantage of lesser-known options and environment features to make life easier
This section will look at:
stdout
and stderr
filessbatch
scripts without a scriptWe're going to start with our simple R script submitted by wrapper sbatch script
demo-scripts/matmul.rscript
demo-scripts/submit-matmul.sh
## matmul.rscript
# multiplies two matrices together and prints how long it takes.
print("starting the matmul R script!")
nrows = 1e3
paste0("elem: ", nrows, "*", nrows, " = ", nrows*nrows)
# generating matrices
M <- matrix(rnorm(nrows*nrows),nrow=nrows)
N <- matrix(rnorm(nrows*nrows),nrow=nrows)
# start matmul
start.time <- Sys.time()
invisible(M %*% N)
end.time <- Sys.time()
# Getting final time and writing to stdout
elapsed.time <- difftime(time1=end.time, time2=start.time, units="secs")
print(elapsed.time)
#!/bin/bash
## submit-matmul.sh
# Example sbatch script exeucting R script that does a matmul
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
Getting notifications about the status of your Slurm jobs remove the need to ssh
onto Milton and running squeue
to get the status of your jobs.
Instead, it will notify you when your job state has changed e.g. when it has started or ended.
To enable this behaviour, add the following options to your job scripts:
--mail-user=me@gmail.com
--mail-type=ALL
This sends emails to me@gmail.com
when the job state changes.
If you only want to know when your job goes through certain states, e.g. if it fails or is pre-empted but not when it starts or finishes:
Excercise: add the --mail-user
and --mail-type
options to the submit-matmul.sh
script
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev1 - email notifications
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
sbatch
without a script¶In some cases, one may wish to submit singular commands to the scheduler. srun
and salloc
can do this, but they need a terminal attached i.e., if you close your terminal with the srun
or salloc
session, then the job fails.
the sbatch --wrap
option allows you to submit a singular command instead of an entire script.
This can be useful for testing, or implementing sbatch
inside a script that manages your workflow.
Note that sbatch --wrap
infers which interpreter to use from your active environment.
The --wrap
option could replace submit-matmul.sh
by:
sbatch --ntasks=1 --cpus-per-task=2 --mem=8G --wrap="module load R/openBLAS/4.2.1; Rscript matmul.rscript"
stdout
and stderr
¶Linux uses has two main "channels" to send output messages to. One is "stdout" (standard out), and the other is "stderr" (standard error).
If you have ever used the |
>
or >>
shell scripting features, then you've redirected stdout
somewhere else e.g., to another command, a file, or the void (/dev/null
).
$ ls dir-that-doesnt-exist
ls: cannot access dir-that-doesnt-exist: No such file or directory # this is a stderr output`
$ ls ~
bin cache Desktop Downloads ... # this is a stdout output!
stderr
and stdout
¶By default:
stdout
is directed to slurm-<jobid>.out
in the job's working directorystderr
is directed to wherever stdout
is directed toRedirect stderr
and stdout
with --error
and --output
options. They work with both relative and absolute paths, e.g.
--error=/dev/null
--output=path/to/output.out
where paths are resolved relative to the job's working directory.
Variables can be used, like:
%j
: job ID%x
: job name%u
: username%t
: task ID i.e., seperate file per task%N
: node name i.e., seperate file per nodes in job#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev2 - added --output and --error options
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL
#SBATCH --output=logs/matmul-%j.out
#SBATCH --error=logs-debug/matmul-%j.err
# loading module for R
module load R/openBLAS/4.2.1
Rscript matmul.rscript
Slurm allows for submitted jobs to wait for another job to start or finish before beginning. While probably not as effective as workflow managers like Nextflow, Slurm's job dependencies can still be useful for simple workflows.
Make a job dependant on another by passing the --dependency
option with one of the following values:
afterok:jobid1:jobid2...
waits for jobid1
, jobid2
... to complete successfullyafternotok:jobid1:...
" to fail, timeout, or be cancelled.afterany:jobid1:...
" " to finish (fail, complete, cancelled).after:jobid1:...
" " to start or are cancelled.e.g. --dependency=afterok:12345678
will make the job wait for job 12345678
to complete successfully before starting.
Recursive jobs are one way to work with short QOS time limits.
Multiple Slurm jobs are submitted with a sequential dependancy pattern, i.e., the second job depends on the first, the third job depends on the second and so on...
Slurm script:
cat demo-scripts/restartable-job.sh
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --mem-per-cpu=2G #SBATCH --time=1 sleep 10
# cell to run recursive script
SCRIPT=demo-scripts/recursive-job.sh
# Initiate the loop
prereq_jobid=$(sbatch --parsable $SCRIPT)
echo $prereq_jobid
# Create 5 more dependant jobs with a loop
for i in {1..5}; do
prereq_jobid=$(sbatch --parsable --dependency=afterany:$prereq_jobid $SCRIPT)
echo $prereq_jobid
done
squeue -u $USER
8703619 8703620 8703621 8703622 8703623 8703624 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8703594 gpuq_larg interact yang.e R 21:18 1 gpu-a100-n01 8701908 interacti sys/dash yang.e R 2:07:28 1 sml-n01 8703624 regular test-rec yang.e PD 0:00 1 (Dependency) 8703623 regular test-rec yang.e PD 0:00 1 (Dependency) 8703622 regular test-rec yang.e PD 0:00 1 (Dependency) 8703621 regular test-rec yang.e PD 0:00 1 (Dependency) 8703620 regular test-rec yang.e PD 0:00 1 (Dependency) 8703619 regular test-rec yang.e R 0:00 1 sml-n05 8703616 regular test-rec yang.e R 0:22 1 sml-n02
prereq_jobid=$(sbatch --parsable $SCRIPT)
--parsable
option to get the job id from sbatch
prereq_jobid
for i in {1..5}; do
for
loop that loops through 1 to 5, where i is the looping variableprereq_jobid=$(sbatch --parsable --dependency=afterok:${prereq_jobid} demo-scripts/recursive-job.sh
--dependency=afterok:${prereq_jobid}
option to link jobsafterany
may be preferred instead of afterok
prereq_jobid
variableInstead of submitting all the jobs ahead of time, you can have a single Slurm scripts that submits itself until all the work is done (or it fails).
#!/bin/bash
## recursive-job.sh
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G
#SBATCH --time=2-
#SBATCH --output=output-%j.log
#SBATCH --error=output-%j.log
#SBATCH --mail-user=me.m@wehi.edu.au
#SBATCH --mail-type=END,FAIL
# Submitting a new job that depends on this one
sbatch --dependency=afternotok:${SLURM_JOBID} recursive-job.sh
# srunning the command
srun flye [flags] --resume
This job:
afternotok
means the dependant job will only start if the current job doesn't complete successfullyflye
command is expected to run for as long as it can, up to the 2 day wall timemail-type=END,FAIL
sends an email when the job either:By default, when you submit a Slurm job, Slurm copies all the environment variables in your environment and adds some extra for the job to use.
export VAR1="here is some text"
cat demo-scripts/env-vars1.sbatch
#!/bin/bash echo $VAR1
sbatch demo-scripts/env-vars1.sbatch
Submitted batch job 8681656
cat slurm-8681656.out
here is some text
Note: For reproducibility reasons, a Slurm script that relies on environment variables can be submitted inside a wrapper script which first exports the relevant variable.
--export
option which allows you to set specific values
echo $VAR1
here is some text
sbatch --export=VAR1="this is some different text" demo-scripts/env-vars1.sbatch
Submitted batch job 8681761
cat slurm-8681761.out
this is some different text
This feature is especially useful when submitting jobs inside wrapper scripts.
You can also use the --export-file
option to specify a file with a list of VAR=value
pairs that you wish the script to use.
cat demo-scripts/env-vars2.sbatch
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 echo I am running on ${SLURM_NODELIST} echo with ${SLURM_NTASKS} tasks echo and ${SLURM_CPUS_PER_TASK} CPUs per task
sbatch demo-scripts/env-vars2.sbatch
Submitted batch job 8681710
cat slurm-8681710.out
I am running on sml-n03 with 1 tasks and 2 CPUs per task
These Slurm environment variables make it easy to supply parallelisation parameters to a program e.g. specifying number of threads.
Tip: scripts/programs executed by srun
will have a SLURM_PROCID
environment variable seperating slurm tasks (MPI-like programming).
Typically scripts submitted by sbatch
use the bash
or sh
interpreter (e.g. #!/bin/bash
), but it may be more convenient to use a different interpreter.
You can do this by changing the "hash bang" statement at the top of the script. To demonstrate this, we can take our original R matmul script, and add a "hash bang" statement to the top.
#!/usr/bin/env Rscript
## matmul.rscript
print("starting the matmul R script!")
nrows = 1e3
...
The statement in the above looks for the Rscript in your current environment. This statement only works because Slurm will copy your environment when a Slurm script is submitted.
python
works similarly. Replace Rscript
in the hash bang statement to python
.
Alternatively, you can specify the absolute path to the interpreter.
e.g. #!/stornext/System/data/apps/R/openBLAS/R-4.2.1/lib64/R/bin/Rscript
Tip: you can use --export=R_LIBS_USER=...
to point Rscript to your libraries (or PYTHONPATH
for python)
R example:
slurmtasks <- Sys.getenv("SLURM_NTASKS")
Python example:
import os
slurmtasks = os.getenv('SLURM_NTASKS')
Excercise:
Add a "using \
The output should look something like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06340098 secs
Making life easier with job arrays
Embarrasingly parallel computation is computation that can occur in parallel with minimal coordination. This type of parallel computation is very common.
Examples are parameter scans, genomic sequencing, basecalling, folding@home ...
Embarassingly parallel problems are facilitated in Slurm by "array jobs". Array jobs allows you to use a single script to submit multiple jobs with similar functionality.
The main benefits to using an array job are:
Array jobs are created by adding the --array=start-end
option. Slurm jobs, AKA "tasks", will be created with indices between start
and end
. e.g. --array=1-10
will create tasks with indices 1, 2, ..., 10.
start
and end
values can be within 0 and 1000 (inclusive). Note this is site specific.
Singular values or discrete lists can also be specific e.g. --array=1
or --array=1,3,5,7-10
.
#!/usr/bin/env Rscript
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --array=1-10
## matmul.rscript
print("starting the matmul R script!")
paste("using", Sys.getenv("SLURM_NTASKS"), "tasks")
...
Slurm augments the default output behaviour of array jobs automatically.
If no --output
option is provided, an array job will produce a an output file slurm-<jobid>-<arrayindex>.out
for each index in the array.
If you specify --output
and --error
, then you can use %A
and %a
variables, which represent the job index and the array index, respectively.
e.g.
#!/usr/bin/env Rscript
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --array=1-10
#SBATCH --output=Rmatmul-times-%A-%a.out
#SBATCH --error=Rmatmul-times-%A-%a.err
## matmul.rscript
...
Each task in the array can make use of its index to enable parallelism. This is by making use of the SLURM_ARRAY_TASK_ID
environment variable.
Other environment variables are accessible:
SLURM_ARRAY_JOB_ID
the job Id of the entire job arraySLURM_ARRAY_TASK_COUNT
the number of tasks in the arraySLURM_ARRAY_TASK_MAX
the largest ID of tasks in the arraySLURM_ARRAY_TASK_MIN
the smallest ID of tasks in the arrayExercise: Add a paste statement to the matmul R script that prints the task ID
The output of each job task should look something like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "I am job task 1 in an array of 10!"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06340098 secs
nrows
variable to equal 10*taskID
hint: you will need the strtoi
function
Your output from job task 1 should look like:
[1] "starting the matmul R script!"
[1] "using 1 tasks and 2 CPUs per task"
[1] "I am job task 1 in an array of 10!"
[1] "elem: 10*10 = 100"
Time difference of 0.06340098 secs
For workflows requiring input files or parameters, there are multiple ways you can use job arrays:
What you can't do:
What you can do:
--dependency=afterok:<jobid>
--dependency=afterok:<jobid>_<taskid>
--mail-type=ALL
will send notifications only for the entire job (not for each job task)
passing ARRAY_TASKS
will send emails for each array task. e.g. --mail-type=BEGIN,ARRAY_TASKS
will send an email every time a job array task starts.
Thanks for attending WEHI's first intermediate Slurm workshop!
Please fill out our feedback form:
https://forms.office.com/r/rKku8yqR57
We use these forms to help decide which workshops to run in the future and improve our current workshops!
Contact us at research.computing@wehi.edu.au for any kind of help related to computing and research!