Evaluating Jobs
Last updated on 2023-05-16 | Edit this page
Overview
Questions
- How to evaluate a completed job?
- How to set event notification for your jobs?
Objectives
- Explain Slurm environment variables.
- Demonstrate how to evaluate jobs and make use of multiple threads options.
Evaluating your Job
After a job has completed, you will need to evaluate how efficient it was, if it ran successfully, or investigate why it failed.
The seff
command provides a summary of any job.
The jobs completes fast but not successfully
OUTPUT
Job ID: 11793501
Cluster: milton
User/Group: iskander.j/allstaff
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:01
CPU Efficiency: 50.00% of 00:00:02 core-walltime
Job Wall-clock time: 00:00:01
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 20.00 MB (10.00 MB/core)
Also, checking output
OUTPUT
.........................<other output>
slurmstepd: error: Detected 1 oom_kill event in StepId=11793501.batch. Some of the step tasks have been OOM Killed.
This shows that the job was “OOM Killed”. OOM is an abbreviation of Out Of Memory, meaning the memory requested was not enough, increase memory and try again until job finishes successfully.
Slurm Environment Variables
Slurm passes information about the running job e.g what its working directory, or what nodes were allocated for it, to the job via environmental variables. In addition to being available to your job, these are also used by programs to set options like number of threads to run based on the cpus available.
The following is a list of commonly used variables that are set by Slurm for each job
-
$SLURM_JOBID
: Job id -
$SLURM_SUBMIT_DIR
: Submission directory -
$SLURM_SUBMIT_HOST
: Host submitted from -
$SLURM_JOB_NODELIST
: list of nodes where cores are allocated -
$SLURM_CPUS_PER_TASK
: number of cores per task allocated -
$SLURM_NTASKS
: number of tasks assigned to job
use $SLURM_CPUS_PER_TASK
with -p
option
instead of setting a number.