Organising dependent Slurm jobs
Last updated on 2024-05-07 | Edit this page
Overview
Questions
- How can I organise jobs that depend on each other?
Objectives
- Know how to use the
--dependency
sbatch
option
A pipeline to visualize pi-cpu
results
Inside the example-programs, there is the folder
job-depends
which contains files used in this episode.
In the previous episode, you learnt how to create job arrays to
investigate how the accuracy of pi-cpu
changes with the
number of trials used. You now want to present these results to your
supervisor graphically, but you still need to aggregate and process the
results further to create something presentable. You decide that you
want to show a bar chart where the X-axis corresponds to the number of
trials, and the Y-axis is the average absolute error calculated
determined from multiple calculations of \(\pi\).
You want to test 7 different number of trials: 1E2 (100), 1E3
(1,000), … , up to 1E8 (100,000,000). These values are in
pi-process-input.txt
. These values are used inside
pi-submit.sh
, which is a script in
example-programs
that is a Slurm job array script like in
the job array episode.
You know that you need to write a bit more code to complete the pipeline:
- For each number of trials, the results from
pi-cpu
need to be averaged. - The individual averaged results need to be aggregated into a single file.
A diagram of the process looks like:
n = 1E2 1E3 ... 1E7 1E8
| | | |
| | | |
pi-cpu pi-cpu pi-cpu pi-cpu
| | | |
| | | |
avg avg avg avg
| | | |
|______|___combine____|_______|
results
|
|
final
results
You decide that you want to do this with Slurm jobs because each step requires different amounts of resources. But to set this up, you need to make use of:
Slurm job dependencies
To setup Slurm jobs that are dependent on each other, you can make
use of the Slurm --dependency
, or -d
option.
The syntax of the option is:
This will ask Slurm to ensure that the job that you’re submitting
doesn’t start until condition
is satisfied for job
jobid
. For example,
is requesting for the job you’re submitting to start, only if job
1234
completes successfully. This allows you to chain a
series of jobs together, perhaps with different resource requirements,
without needing to monitor progress.
Other conditions include:
-
after
: start this job after the specified job starts -
afternotok
: start this job only if the specified job fails -
afterany
: start this job after the specified job fails, succeeds, or is cancelled -
singleton
: Start this job if there are no other running jobs with the same name. -
aftercorr
: Start this array task when the corresponding array task in the specified job completes successfully.
In your \(\pi\) calculation
scenario, submitting the second lot of array jobs (i.e., the
avg
jobs) can be submitted using
--dependency=afterok:<jobid>
. However, this will mean
that none of the avg
job array tasks will start
until all of the pi-cpu
jobs finish. This results
in lower job throughput as tasks wait around.
We can achieve better job throughput by making use of the
aftercorr
condition (short for “after corresponding”),
which tells Slurm that each task in the second job can start once the
same task in the first job completes successfully.
Setting up the avg
array job
Each pi-cpu
job output will have multiple attempts to
calculate \(\pi\). So, the goal of each
avg
job array task is to calculate the average error.
pi-avg.py
in the example-programs/job-depends
directory, is a script made specifically for this purpose. For example,
if you execute pi-cpu
and redirect the output to a
file:
and then run
You should get output similar to:
OUTPUT
Average Absolute Error: 0.00016350984148000002
This program can be placed into a Slurm script like so:
BASH
#!/bin/bash
# pi-avg.sh
#SBATCH --job-name=pi-avg
#SBATCH --array=0-6
#SBATCH --output=%x-%a.out
#SBATCH --cpus-per-task=1
id=$SLURM_ARRAY_TASK_ID
python3 pi-avg.py pi-submit-${id}.out
The pi-avg.sh
script line-by-line
is the hash-bang statement. Required by Slurm, but good practice to include in any script.
is a comment with the name of the script.
is a Slurm option specifying the job name. In this case, it is
pi-avg
.
is telling Slurm that this job is a job array with task IDs 0 to 6 (inclusive).
is telling Slurm where to write the output of the script to.
%x
is a placeholder for the job name, and %a
is a placeholder for the job task array. For example, the job task with
ID = 0, will have the output file: pi-avg-0.out
.
is specifying the number of CPUs each job array task needs.
sets the id
variable to the value stored in the
SLURM_ARRAY_TASK_ID
environment variable.
is running the pi-avg.py
program on the file
pi-submit-${id}.out
, which would be the output from the
pi-submit.sh
job tasks.
Notably, this script doesn’t have the --dependency
option set, but this will be discussed later.
Submitting the jobs
Before we submit pi-submit.sh
, we need to make sure that
pi-cpu
is in the correct place. pi-submit.sh
assumes that pi-cpu
is in the working directory. So, if
you’re in the job-depends
directory, you must first
cp ../pi-cpu .
, which makes a copy of pi-cpu
in the job-depends
directory.
Now that we have pi-submit.sh
and pi-avg.sh
we can try submitting them such that pi-avg.sh
job tasks
depend on pi-submit.sh
job tasks. We can start this process
by
OUTPUT
Submitted batch job 12271504
And then submit the next job using the
--dependency=aftercorr:<jobid>
flag:
OUTPUT
Submitted batch job 12271511
Reminder: -d
is the short-form of
--dependency
! Depending on how fast you were with
submitting the jobs and setting up the dependency, when you run
squeue -u $USER
, you might see output similar to:
OUTPUT
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12271511_6 regular pi-avg yang.e PD 0:00 1 (Dependency)
12271511_5 regular pi-avg yang.e PD 0:00 1 (None)
12271511_4 regular pi-avg yang.e PD 0:00 1 (None)
12271511_3 regular pi-avg yang.e PD 0:00 1 (None)
12271511_2 regular pi-avg yang.e PD 0:00 1 (None)
12271511_1 regular pi-avg yang.e PD 0:00 1 (None)
12271511_0 regular pi-avg yang.e PD 0:00 1 (None)
12271504_6 regular pi-submi yang.e R 0:12 1 sml-n02
The pi-submit
job (job ID of 12271504) only has task ID
6 running and all the pi-avg
jobs are waiting. To confirm
that the other pi-submit
tasks have completed, you can use
the command sacct -Xj <pi-submit jobid>
e.g.:
OUTPUT
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12271504_0 pi-submit regular wehi 2 COMPLETED 0:0
12271504_1 pi-submit regular wehi 2 COMPLETED 0:0
12271504_2 pi-submit regular wehi 2 COMPLETED 0:0
12271504_3 pi-submit regular wehi 2 COMPLETED 0:0
12271504_4 pi-submit regular wehi 2 COMPLETED 0:0
12271504_5 pi-submit regular wehi 2 COMPLETED 0:0
12271504_6 pi-submit regular wehi 2 RUNNING 0:0
After a bit of time, you should see the pi-avg
job tasks
begin to go through (should be quite quickly as the job is quite
short)
OUTPUT
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12271511_3 regular pi-avg yang.e CG 0:00 1 sml-n02
12271511_4 regular pi-avg yang.e CG 0:00 1 sml-n02
12271511_5 regular pi-avg yang.e CG 0:00 1 sml-n02
Depending on how fast you were submitting the jobs, you likely weren’t able to see all the jobs wait in the queue. However, later we’ll develop a “driver” script which will automate setting up the dependencies.
Setting up the finalcombine
job
Now that you’ve learnt about how to use the
--dependency/-d
option, try and setup the final job that
combines the results from pi-avg.sh
! Call it
pi-combine.sh
Reminder: this job should be a single job, that takes the results of
all the pi-avg.sh
and organises the results into a single
file.
Your output should look something like
OUTPUT
100 Average Absolute Error: 0.18431854820252003
1000 Average Absolute Error: 0.034800000000000005
10000 Average Absolute Error: 0.014562903594960003
100000 Average Absolute Error: 0.00271745179748
1000000 Average Absolute Error: 0.0009526517974799999
10000000 Average Absolute Error: 0.00037933179748
100000000 Average Absolute Error: 0.00014778020251999998
HINT1: use pi-process-input.txt
to get the data for the
first column.
HINT2: you will probably need to use: readarray
, a
for
loop, and cat
.
Script could look like:
BASH
#!/bin/bash
# pi-combine.sh
#SBATCH --job-name=pi-combine
#SBATCH --output=%x.out
# read ntrial data into niterations bash array
readarray -t niterations < pi-process-input.txt
for i in {0..6}
do
# get label from niterations array
label=${niterations[$i]}
# get average error from pi-avg-${i}.out
data=$(cat pi-avg-${i}.out)
# print labal with data
echo "$label $data"
done
You should now have pi-submit.sh
,
pi-avg.sh
, and pi-combine.sh
Slurm scripts
which you can now combine into a pipeline! We’ll do this with one more
script, which will be referred to as the “driver” script as it is
“driving” the pipeline. This driver script will submit all the jobs and
set the dependencies.
This script could look like:
BASH
#!/bin/bash
# pi-driver.sh
# submit the first pi-submit job
# saves the job ID into the jobid1 variable
jobid1=$(sbatch --parsable pi-submit.sh)
# submit the second pi-avg job.
# due to aftercorr condition, each task
# depends on the same task from pi-submit.sh.
# saves the job ID into the jobid2 variable
jobid2=$(sbatch --parsable --dependency=aftercorr:$jobid1 pi-avg.sh)
# submit the last pi-combine job
# this job waits for all tasks from jobid2
# to complete successfully (afterok).
# job ID is not saved as it is not needed.
sbatch --dependency=afterok:$jobid2 pi-combine.sh
The driver script makes use of the --parsable
option
that can be used with sbatch
. This makes
sbatch
return the job ID only, instead of the
statement Submitted batch job <job ID>
. This enables
“parsing” the job ID. In this case, the job IDs are saved into
jobidN
bash variables.
The driver script does not need to be submitted, it can be run as a normal script:
OUTPUT
Submitted batch job 12271538
When exeucting the driver script, you should only see one
Submitted batch job
statement printed to the terminal, as
the first two jobs are submitted with the --parsable
option
and their Job IDs being saved to variables, instead of being printed to
the terminal.
If you run squeue -u $USER
immediately after
the completion of the script, you should see your jobs pending in the
queue, similar to:
OUTPUT
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12272749 regular pi-combi yang.e PD 0:00 1 (Dependency)
12272748_[0-6] regular pi-avg yang.e PD 0:00 1 (None)
12272747_[0-6] regular pi-submi yang.e PD 0:00 1 (None)
and as you continue checking the queue, you should be able to see Slurm processing your jobs:
OUTPUT
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12272748_6 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_5 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_4 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_3 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_2 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_1 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272749 regular pi-combi yang.e PD 0:00 1 (Dependency)
12272748_0 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272747_[0-6] regular pi-submi yang.e PD 0:00 1 (None)
This shows Slurm “unrolling” the pi-avg
job tasks,
and
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12272748_6 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_5 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_4 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_3 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_2 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272748_1 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272749 regular pi-combi yang.e PD 0:00 1 (Dependency)
12272748_0 regular pi-avg yang.e PD 0:00 1 (Dependency)
12272747_0 regular pi-submi yang.e R 0:01 1 sml-n10
12272747_1 regular pi-submi yang.e R 0:01 1 sml-n10
12272747_2 regular pi-submi yang.e R 0:01 1 sml-n10
12272747_3 regular pi-submi yang.e R 0:01 1 sml-n10
12272747_4 regular pi-submi yang.e R 0:01 1 sml-n10
12272747_5 regular pi-submi yang.e R 0:01 1 sml-n07
12272747_6 regular pi-submi yang.e R 0:01 1 sml-n07
shows the pi-submit
jobs running and the
pi-avg
jobs waiting.
If all goes well, you should have a pi-combine.out
(or
similar) file with your outputs!
100 Average Absolute Error: 0.156
1000 Average Absolute Error: 0.05336290359496
10000 Average Absolute Error: 0.00736
100000 Average Absolute Error: 0.00457345179748
1000000 Average Absolute Error: 0.00152585179748
10000000 Average Absolute Error: 0.00048161179748
100000000 Average Absolute Error: 0.00013798799999999997
You can then plot this in Excel or whichever tool you prefer!