ToolParameteriser

Tool Parameteriser

A tool to make benchmarking and parameterising HPC tools easier and streamlined.

When running a new tool on a HPC, researchers find it challenging to chose the resources required for the job and also how the resources scale with the size of the input dataset. This tool helps researchers answer these kind of questions and also helps Research Computing teams recommed best combination of resources for specific tool.

Install

Requires Python 3.11

pip install git+ssh://git@github.com/WEHI-ResearchComputing/ToolParametriser.git

Example configuration files are located in the examples directory of this repo.

To run a test

toolparameteriser -c <configfile> -R run

Test config file examples are found in examples

Extra run options

$ toolparameteriser --help
usage: toolparameteriser [-h] -c path [-D] -R str [-d]

Run/analyse a tool test

options:
  -h, --help            show this help message and exit
  -c path, --config_path path
                        the path to configuration file
  -D, --dryrun          if present jobs will not run
  -R str, --runtype str
                        can be either [run, analyse]
  -d, --debug           Sets logging level to Debug

To collect test results

Another config file needs to be created with the [output] table and the jobs_details_path and results_file keys. For example, configAnalysis.toml in the examples directory:

[output]
jobs_details_path = "/vast/scratch/users/iskander.j/test2/jobs_dev.csv" 
results_file="/vast/scratch/users/iskander.j/test2/devresults.csv"

Collect the results with

toolparameteriser -c <configfile> -R analyse

The reuslts file will be a CSV file where each row corresponds to a job found in the jobs_detail_path CSV file. It will add CPU and memory effeciency data, and elapsed wall time retrieved from the seff command along with other periferal information about the job. For example:

JobId,JobType,NumFiles,Threads,Extra,Nodes,CPUs Requested,CPUs Used,CPUs Efficiency,Memory Requested,Memory Used,Memory Efficiency,GPUs Used,Time,WorkingDir,Cluster,Constraints
10829380,diamondblast_32t,1,32,type=,1,32,24.32,76.63,200.0,151.04,75.52,0,2000,nan,milton,Broadwell
10829375,diamondblast_32t,1,32,type=,1,32,22.4,70.87,200.0,77.36,38.68,0,2151,nan,milton,Broadwell
10829381,diamondblast_32t,1,32,type=,1,32,22.4,70.79,200.0,169.1,84.55,0,2171,nan,milton,Broadwell
...

How it works

The user have two prepare two files

Such that each test will run a number of slurm jobs each with parameters specifies in the Jobs Parameters File.

When the test starts, the test output directory is created and for each job the following happens:

  1. Job directory created
  2. If input files are specified they will be copied to the job directory
  3. Submission script will be created from a template that will include the cmd specified in the config file. The template is saved into the test output directory.
  4. The job is submitted to SLURM using the job directory as the working directory.

Preparing Config File

Config files are TOML files. Each config file describes parameters that are the same across all jobs. Example config file are found in examples

For running test, the following sections can be specified

Sets where the pool of input files are located (path key) and type of input files whether dir or file (type key). If [input] is provided, the path directory/file will be copied to each job’s working directory.

[input]
type="dir"
path = "<full path>"

A non-compulsory list of modules and each have use and name fields. use is optional if the required module is visible by default.

[[modules]]
use="/stornext/System/data/modulefiles/bioinf/its"
name="bwa/0.7.17"
[[modules]]
name="gatk/4.2.5.0"

Sets where the output files will be saved. All files related to the tests will be placed in this directory.

[output]
path = "<full output dir path>" 

To set fields related to the slurm jobs to be submitted. The fields include

The following are parameters supplied to Slurm and must be included in either the config file or the jobs profile file. If included in both, the values in the jobs profile CSV file take precedence. To use the default values, supply an empty string i.e., “”.

Using the configBWA.toml example found in the examples folder:

[jobs]
cmd="bwa mem -t ${threads} -K 10000000 -R '@RG\\tID:sample_rg1\\tLB:lib1\\tPL:bar\\tSM:sample\\tPU:sample_rg1' ${reference} ${input_path} | gatk SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM 250000 -I /dev/stdin -O out.bam --SORT_ORDER coordinate --TMPDIR $TMPDIR"
num_reps = 1 
params_path = "/vast/scratch/users/yang.e/ToolParametriser/examples/IGenricbenchmarking-profiles.csv" 
tool_type="bwa"
run_type=""
email=""
qos="preempt"
environment="TMPDIR=/vast/scratch/useres/yang.e"

A list of command placeholders and each have name and path fields. These are placeholders that are defined in the cmd field defined inside jobs.

[[cmd_placeholder]]
name="reference"
path="/vast/projects/RCP/23-02-new-nodes-testing/bwa-gatk/bwa-test-files/Homo_sapiens_assembly38.fasta"
[[cmd_placeholder]]
name="input_path"
path="samples/*"

Preparing Jobs Profile

The jobs’ profiles are stored in a CSV file, and must be linked to in the config file under the [job] table and params_path key:

[jobs]
params_path="/path/to/jobsprofile.csv"

Column headers in the jobs profile CSV file correspond to job Slurm parameters or command placeholders. If a column exists in both the jobs profile CSV and the config TOML file, then the former takes precedence. An example of this scenario is in examples/configBWA2.toml and examples/BWA-profile2.csv.