Introducing Slurm

Last updated on 2023-05-16 | Edit this page

Overview

Questions

What is a scheduler and why does a cluster need one?
What is a partition?

Objectives

Explain what is a scheduler
Explain how Milton’s Slurm works
Identify the essential options to set for a job

Job Scheduler

An HPC system might have thousands of nodes and thousands of users. A scheduler is a special piece of software that decides which jobs run where and when. It also ensures that a task is run with the resources it requested.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your job do not start instantly as in your laptop.

Compare a job scheduler to a waiter in a restaurant

Milton uses a scheduler (batch system) called Slurm. WEHI has 3500 physical cores, 44TB of memory, 58 GPUs and 90 nodes accessible by Slurm.

The user describes the work to be done and resources required in a script or at the command line, then submits the script to the batch system. The work is scheduled when resources are available and consistent with policy set by administrators.

Slurm

Simple Linux Utility for Resource Management

Slurm development has been a joint effort of many companies and organizations around the world. Over 200 individuals have contributed to Slurm. Its development is lead by SchedMD. Its staff of developers and support personnel maintain the canonical Slurm releases, and are responsible for the majority of the development work for each new Slurm release. Slurm’s design is very modular with about 100 optional plugins. It is used at Spartan, Massive, Pawsey, Peter Mac and Milton, as well as HPC facilities world-wide.

A cluster is a shared environment and when there is more work than resources available, there needs to be a mechanism to resolve contention. Policies ensure that everyone has a “fair share” of the resources.

Milton uses a multifactor job priority policy. It uses nine factors that influence job priority.

It is set such that:

age: as the length of time a job has been waiting in the queue increases, the job priority increases.
job size: the more resources (CPUs, GPUs, and/or memory), the higher priority.
fair-share: the difference between the portion of the computing resource that has been requested and the amount of resources that has been consumed, i.e. the more resources your jobs have already used, the lower the priority of your next jobs.

In addition, no single user can have more than 8% of total CPUs or memory, which is 450 CPUs and 3TB memory.

Backfilling

Milton uses a back-filling algorithm to improve system utilisation and maximise job throughput.

When more resource intensive jobs are running it is possible that gaps ends up in the resource allocation. To fill these gaps a best effort is made for low-resource jobs to slot into these spaces.

For example, on an 8-core node, an 8 core job is running, a 4 core job is launched, then an 8 core job, then another 4 core job. The two 4 core jobs will run before the second 8 core job.

if we have 2 8-core nodes, we receive:

Job 1 request 4-cores and 5 hours limit
Job 2 request 8-cores and 2 hours limit
Job 3 request 4-cores and 4 hours limit

Without back filling, Job 2 will block the queue and Job 3 will have to wait until Job 2 is completed. With back filling, when Job 1 has been allocated and Job 2 pending for resources, Slurm will look through the queue, searching for jobs that are small enough to fill the idle node. In this example, this means that Job 3 will start before Job 2 to “back-fill” the 4 CPUs that will be available for 5 hours while Job 1 is running.

Slurm Partitions

Partitions in Slurm group nodes into logical (possibly overlapping) sets. A partition configuration defines job limits or access controls for a group of nodes. Slurm allocates resources to jobs within the selected partition by taking into consideration the resources you request for your job and the partition’s available resources and restrictions.

Partition	Purpose	Max submitted jobs/user	Max CPUs/user	Max mem (GB)/user	Max wall time/job	Max GPUs/user
interactive	interactive jobs	1	16	64	24 hours	0
regular	most of the batch work	5000	454	3000	48 hours	0
long	long-running jobs		96	500	14-days	0
gpuq	jobs that require GPUs		192	998	48 hours	8 GPUs on 2 nodes
gpuq_large	jobs that require A100 GPUs		96	1000	48 hours	1 A100
bigmem	jobs that require large amounts of memory	500	128	1400	48 hours	0

The main parameters to set for any job script

time: the maximum time for the job execution.
cpus: number of CPUs
partition: the partition in which your job is placed
memory: the amount of physical memory
special resources such as GPUs.

We will discuss this more in the next episode.

Key Points

The scheduler handles how compute resources are shared between users.
A job is just a shell script.
Request slightly more resources than you will need.
Backfilling improves system utilisation and maximises job throughput. You can take advantage of backfilling by requesting only what you need.
Milton Slurm has multiple partitions with different specification that fit the different types of jobs.