hpc-docs

Bunya user guide

General HPC information

General HPC training is available via the RCC and QCIF Training resources.

To get a basic understanding of what you need to be aware of when using HPC for your research please listen to the following videos:

Connecting to HPC via putty
Where does my data go on HPC
Directories (folders) I should be aware of on HPC
Message of the day (info on current status and problems)
Relative and absolute path (common problem in user scripts)
How to load installed software
Why no calculation should be run on the login nodes

For UQ users and QCIF users with a QRIScloud collection please also listen to

General overview of Q RDM
Q RDM on HPC

Bunya Hardware

Guide

Applying for Access

Access to Bunya HPC is not automatic. You need to apply by emailing rcc-support@uq.edu.au or using this form and noting the following instructions.

[!NOTE] Importantly, there are 8 questions and all are mandatory. Please try and copy the questions into the box on the form answer each of them.

Applications under staff account preferred. Those applying under a student account will be asked to provide an end date for their access.

You will need to provide current SIX (6) DIGIT FoR code(s). If you do not know what a FoR code is or do not have one please contact your supervisor.

It can take up to one week for a new user access to be processed. Incomplete applications will be rejected and applicants have to fill in a new form.

Introduction to HPC Training

If you would like to attend Intro to HPC training then the sessions are generally scheduled on the final Tuesday morning of each month (except for December). You should apply by emailing rcc-support@uq.edu.au. You should have your access to Bunya HPC organised prior to attending (noting the point above that it may take up to a week for that to be completed)

Connecting

[!CAUTION] Do not share passwords, ssh keys or your multifactor authentication.

This is a violation of UQ’s cyber security policy and UQ’s conditions of access to RCC infrastructure policy to which every Bunya user agrees to when applying for Bunya access.

Violation of UQ’s policies and conditions of access may lead to the suspension of access to RCC infrastructure (temporarily or permanently) and in some cases a report to the Integrity Unit for misconduct.

Here are just some examples that are a violation of UQ’s policies (there are others):

How to connect

[!NOTE] For onBunya Users

Even if you want to use onBunya exclusively, you will need to login using this direct ssh method when you access Bunya for the first time. Doing this will trigger the proper setup of your account.

Set 1 of the Training resources explains how to use Putty to connect to a HPC with the basics found here. To connect to Bunya please use:

Hostname: bunya.rcc.uq.edu.au
Port: 22

For those using command line ssh:
ssh username@bunya.rcc.uq.edu.au

Bunya enforces MFA (multi factor authentication)

For UQ users, this will use their DUO MFA that is used for all other access to UQ resources.

QCIF users are required to go here https://services.qriscloud.org.au/credential and set up an Authenticator App.

Users will first be asked for their password (or key). After users have entered this, they will see one or more options given. Choose the option you wish to use and type the respective number on the command line. If you have only set up one MFA option for your account, it will use this option automatically.

For UQ users, you will be asekd to enter the DUO passcode (6 numbers with no spaces).

For QCIF users, you will be asked to enter the one-time-authentication code (6 numbers with no spaces).

After this you will be logged into Bunya.

Note for QCIF users

QCIF users (non UQ) will use their QSAC username and password.https://services.qriscloud.org.au/credential

Note for those using MobaXTerm Software

This SSH/X11 client has an experimental feature called “Remote monitoring”. Please disable it by modifying default behaviour for SSH connections via the main menu Settings … SSH … uncheck Remote-monitoring (Experimental) and please don’t activate it manually.

What to do when you can’t login ?

Email rcc-support@uq.edu.au and try to provide as much detail about the situation. At the minimum, your Bunya username, where you were trying to connect from (e.g. campus, home, VPN) and the tool you are using to connect with.

File Transfer

We recommend to use command line scp and sftp. The are accessible to all users, via a shell for Linux and Mac users and via WSL and cmd for Windows users.

scp file username@bunya.rcc.uq.edu.au:/path-to-place-for-file/

For example

scp test.dat username@bunya.rcc.uq.edu.au:/scratch/user/username/

will copy the file test.dat to the user’s scratch directory.

sftp username@bunya.rcc.uq.edu.au

You will see sftp> as prompt once you have logged in.

ls and cd work as usual on the Bunya end. Use lls to list files on your desktop/laptop and lcd to change directories on your desktop/laptop.

Use get to pull files and directories from Bunya to your desktop/laptop and use put to move files and directories from your desktop/laptop to Bunya.

Windows users can also use WinSCP if they require a graphical SFTP client. WinSCP allows the MFA authentication without extra setup. WinSCP also allows mutiple file and directory transfer without having to re-enter the MFA passcode.

FileZilla is no longer recommeneded.

Accessing Compute Nodes from Login Nodes

On Bunya, you are able to login to any compute node that is currently running your job(s).
If you do not have a running job on a compute node you will be unable to connect.

Before you can do that, you will need to set up an SSH keypair on Bunya for use within Bunya.
This keypair should only be used within Bunya HPC and NOT for connecting from outside

On your Bunya login node run command
ssh-keygen -b 2048 -t rsa
Press enter for the default filename and location, then press enter (twice) for no password to be set on the private key (see warning above about keeping this key for use within Bunya only).
This should have generated a key pair: $HOME/.ssh/id_rsa (the private key) and $HOME/.ssh/id_rsa.pub (the public key)

Append the public key contents to your authorized keys file.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Use squeue --me to figure out the compute node that is running your job of interest.
Then you can use the ssh command to connect to that compute node.

Software

The training resources have a short video on how to use software modules to load installed software on HPC.

The basic commands are:

module overview - shows a brief list of all available modules, and the number of distinct versions
module avail - shows all available main modules
module --show_hidden avail - shows all available modules, including those that are normally hidden
module -t avail - shows a terse single column list of the available modules
module avail [SOFTWARE-NAME or KEYWORD] - shows all modules for SOFTWARE-NAME or KEYWORD
module spider [SOFTWARE-NAME or KEYWORD] - shows all possible modules for SOFTWARE-NAME or KEYWORD
module load [SOFTWARE-NAME/VERSION] - loads a specific software version
module unload [SOFTWARE-NAME/VERSION] - unloads a specific software version
module list - lists all currently loaded software modules
module purge - unloads ALL currently loaded software modules

Bunya uses EasyBuild to build and install software and modules. Modules on Bunya are self-contained which means users do not need to load any dependencies for the module to work. This is similar to how modules worked on Tinaroo and FlashLite but different to Wiener.

Using module avail will show only the main software modules installed. It will not show all the different dependency modules that are also available. To show ALL modules including hidden modules use

module --show_hidden avail

Ordinarily, the module command will only search standard pre-configured paths, but you can modify the search path within your current session. If you have created some personal module files and would like to use them, then you need to add the directory containing those personal software modules to the search path using the following command

module use path_to_where_you_keep_your_modules

The -a option can be used to append the search path instead of pre-pending it. The command module unuse path_to_where_you_keep_your_modules will reverse this temporary change (or you could login again). You can make sure it is always set by modifying your $HOME/.bashrc file.

Please note:

Please note:

Compilers

How to build your own software

IMPORTANT If you are building your own software, especially if you are compiling your own software, you need to be aware of the different architectures, epyc3 and epyc4. Software built on an epyc3 compute node will run on a epyc4 compute node BUT software built on a epyc4 compute node will not run on a epyc3 compute node. If you want ease of use you need to make sure to compile on a epyc3 compute node. If you want best performance you should compile for a specific architecture but then need to request that architecture for your jobs.

These AMD guides (in PDF) EPYC3 EPYC4 will be useful to understanding the options you may need to consider for optimising the performance of your code on Bunya.

Please note ALL software builds should be done on a compute node. Processes running on the login nodes, including software builds (conda environments, make-make install, EasyBuild) will most likley be killed if found on the login nodes.

Building additional packages for R

The installation of R on Bunya comes with many packages provided (see module help r) that can be loaded using the library() function. Additional R packages (ones that you need but don’t have provided already need to be built using the install.package() function within R. Note: R packages that were built for Tinaroo, or for a different version of R will NOT work properly, or at all, on Bunya. We recommend that you delete all packages built with/for/on Tinaroo and FlashLite and run install.packages() again.

Additionally, R packages built on the newer EPYC4 CPU nodes are known not to work on the older EPYC3 CPU nodes. To ensure you build R packages that can be run on both CPU types you must run install.packages() on the older EPYC3. See the section on interactive jobs for how to target the older nodes for building R packages.

Building Python and Conda environments

Similary to information about for R packages. Python environments built with/for/on Tinaroo and FlashLite need to be reinstalled on Bunya.

Please see here for more information on how to build conda environments on Bunya.

Building software using EasyBuild

Users can use EasyBuild to build their own software against existing modules on Bunya.

https://docs.easybuild.io/en/latest/index.html

EasyBuild recipes can be found for a very wide range of software. Some might need tweaking for newer versions, but it often is relatively easy. You can also write your own.

Users can build into their own home directory but use all exisiting software and software tool chains that are already available. Users need to load the EasyBuild module first:

module load easybuild

Please note: We usually advise to use the full module, name/version, to load software modules. The easybuild module is an exception as you want to make sure to always have the lastest version to have access to all available recipes.

For example, if you create a folder called EasyBuild in your home directory and have a recipe located in this directory you can build the software via this command. Make sure you are on a compute node before doing this.

eb --prefix=/home/YourUsername/EasyBuild --installpath=/home/YourUsername/EasyBuild --buildpath=/home/YourUsername/EasyBuild/build --robot=/home/YourUsername/EasyBuild ./EasyBuild-recipe-file.eb

If you add the -D option, it will do a dry run first. Please use eb -H to get the help manual.

There are currently over 16,000 sample easybuild (.eb) recipe scripts available after you load the easybuild module. The eb -S searchtext will return all .eb scripts with a case insensitive match. You may need to refine your search.

The names of sample easy build scripts include one of the following labels that represent the toolchain to be used when building the software. As you can see, the toolchains are built upon a specific version of compiler. The vast majority of the software for Bunya has been built with one of the Solid GCC based toolchains.

Toolchain Module Compiler Base Status on Bunya
foss/2023a GCC 12.3.0 Solid
gfbf/2023a GCC 12.3.0 Solid
gcc/12.3 GCC 12.3.0 Solid
foss/2022a GCC 11.3.0 Solid
gfbf/2022a GCC 11.3.0 Solid
gcc/11.3 GCC 11.3.0 Solid
foss/2021a GCC 10.3.0 Solid
gcc/10.3 GCC 10.3.0 Solid
     
intel/2023a Intel 2023.1.0 Solid
intel/2022a Intel 2022.1.0 Solid
intel/2021a Intel 2021.2.0 Solid

For more information about the toolchains, refer to the EasyBuild documentation

Many older versions and toolchains may be present amongst the sample eb scripts (recipes). If you choose to build software with a non-Solid toolchain on Bunya, you may find the task quite time consuming as you could end up building the entire toolchain and every dependency from source code. You have been warned ;-)

The specific version of the software you need to build may have an eb script available for one of our “Solid” toolchains available on Bunya. That makes building software quicker and more straightforward because the build can rely on pre-existing components.

If a “Solid” version doesn’t exist then you have two choices and both may involve extra work. You could proceed with attempting to build it using the .eb file without modification. This will build everything that it needs (i.e. all dependencies) for that software so can take many hours to complete. It may entail minor fixes along the way to get it all successfully built.
Alternatively, you could adapt the eb script to make it compatible with one of the Solid toolchains listed above. This involves tracing dependencies (exact versions matter).

Users who have a working EasyBuild recipe and have tested that the software installed as such is working on Bunya can offer their EasyBuild recipe to be uploaded to the suite of cluster wide installed software and it would then be available via modules. It is preferable that the recipe be for one of the “Solid” toolchains, unless there is a strong reason because of compatibility with other software.

Using software containers on Bunya

The software build management system on Bunya is better equipped to support a range of software on the “bare-metal”. There is, still, support on Bunya for software container technology to provide greater flexibility for Bunya users.

Background

Software containers is a generic technology term. It provides a mechanism for different operating systems and software that are not available in the host operating system so they can be used safely on the HPC platform. Software containers can be built from a “recipe” or downloaded as pre-built images, often as a stack that is assembled on the fly.

The names Docker, Shifter, Singularity/Apptainer are like different “brands” that support software containers.

Bunya uses Apptainer. Apptainer was created when Singularity was rebranded when it joined the Linux Foundation. Currently, the version of Apptainer installed on Bunya is version 1.3.0-1.el8 which is actually newer than the latest Singularity release (v3.8.7).

Apptainer is not installed on the Bunya login nodes. Apptainer is installed into the operating system on every compute node. You must use an interactive job via the batch system (or an onBunya session) to be able to reach compute nodes and use apptainer. You do not need to load a software module to use the apptainer command.

How to run a software container on Bunya

You can

To make the magic happen you need to

Of course, you can also use software containers within a regular batch job, noting the points above about cache and tmp directories and bind mounts.

Where and how to build a software container

Singularity/Apptainer software containers do not need to be built on systems where the user has system admin access. So, regular users are able to build software containers on Bunya. There is, however, one complication. The GPFS filesystem that underpins your /scratch and /home, as well as $TMPDIR, does not support “sandbox” mode. Sandbox mode allows you to add additional functionality to a container created with just a rudimentary installation. The sandbox needs to be converted to a portable image file once it has been completely built. If you need to use sandbox mode on Bunya, please contact RCC Support.

You can build your own on a suitable external system and bring it onto Bunya.

You cannot build a container on Bunya directly from a Dockerfile prescription. Instead, you will need to use the docker software to create a container image and upload it to a suitable repository. Then you will be able to “pull” a copy of it and it will be converted to apptainer format.

Fair Share

Bunya employs fair share to ensure that each user is able to use Bunya resources.

Interactive jobs

Do not run on the login nodes

Users are reminded that no calculation, no matter how quick or small, should be run on the login nodes. So no, the quick python or R or bash script or similar should NOT be just quickly run from the command line as it is so much more convenient. All calculations are required to be done on the compute nodes.

This also includes software installations. Conda create, pip installs and R install.packages should be run via an interactive job not on the login nodes. Software installations using make and then make install (or cmake) especially are not suitable to be done on a login node. For these users should be careful choosing the correct architecture, see here

Users can use interactive jobs which will give them that command line feel and flexibility and allow the use of graphical user interfaces. Users who need a Graphical User Interface (GUI) should consult the onBunya User Guide.

Users have access to a debug QoS for quick testing of new jobs and codes etc.

Interactive jobs

User should use interactive jobs to do quick testing and if they need to use a graphical user interface (GUI) to run their calculations. This could include jupyter, spider, etc. salloc is used to submit an interactive job and you should specify the required resources via the command line. IMPORTANT: Interactive jobs should be limited to a single node. Multinode jobs are required to be submitted as a batch job.

Jobs need to request all resources they need or they will fail. This includes GPUs (default is zero) and walltime.

Use this full command line to create an interactive session on a compute node.
You must combine salloc and srun to ensure that your processing happens on a Bunya compute node and not on the login node.

salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=5G --job-name=TinyInteractive --time=01:00:00 --partition=general --qos=debug --account=AccountString srun --export=PATH,TERM,HOME,LANG --pty /bin/bash -l

You can use the command
hostname
to see if you are on a compute node or not. If this shows bunya1, bunya2, or bunya3 you are still on a login node. Do not start your calculation, compile or environment install on a login node. Make sure you are on a compute node.

Please use --partition=general unless you need access to GPUs. The general partition has epyc3 and epyc4 architecture CPUs. The --qos=debug has a higher priority but has a walltime limit of 1 hour and limits number of jobs per user. Use --qos=normal to submit standard jobs. The normal QoS does not allow GPUs.

Use the groups command to list your groups- Bunya Account Strings will begin a_ . Replace AccountString with your actual accounting group in the --account= option. This is the AccountString for your research or accounting group. All AccountStrings start with a_. Use the groups command to list your group memberships and grab the one that begins with a_ characters.

To target an epyc3 compute node add --constraint=epyc3 to the salloc part. To target an epyc4 compute node add --constraint=epyc4 to the salloc part.

If you need to run a GUI then add the option --x11 to the salloc part.

For an interactive session on the gpu_rocm or gpu_cuda partitions you will need to add --gres=gpu:[type]:[number] to the salloc request and use --qos=gpu instead of --qos=normal. It is important that you use a [type] to get the correct GPU card for your job.

To target a particular GPU RAM in the gpu_cuda partition, especially an A100 MIG slice add --constraint=cuda10gb, or --constraint=cuda80gb to target a card with the full GPU RAM to the salloc part.

See here for a full list of partitions, QoS, GPU types and other features.

This will log you onto a node. To run a job just type as you would usually do on the command line. As srun was already used in the above command there is no need to use srun to run your executables, it will just mess things up.

Once you are done type exit on the command line which will stop any processes still running and will release the allocation for the job.

Alternatively, if you use just an salloc on the login node, then you must use srun to run your command otherwise it will start running on login node and that is not fair on other users.

Interactive MPI jobs only

Instructions for interactive MPI jobs can be found here

Available partitions

The available partitions on Bunya are

general
gpu_rocm
gpu_cuda
gpu_sxm
gpu_viz

Available Quality of Service (QoS)

normal
debug
mig
sxm
gpu
viz


QoS use and limits

QoS are used to control access to resources and apply sustainable limits.

Important:
mig requires the request of at least 1 MIG slice: gres=gpu:nvidia_a100_80gb_pcie_1g.10gb:1
sxm requires the request of at least 1 H100:gres=gpu:h100:1
viz for onBunya jobs only
onBunya Accelerated Desktops with 2 or 3 GPUs will submitted with the debug QoS.
gpu still requires that at least one GPU is requested for the job as the default for number of GPUs is zero.

| QOS | Partitions | Access| Priority | All User Group limit | User limits | |:—|:—|:—:|:—:|:—|:—| ||||||| | normal | general | open | 10 | 20000 CPUs,
200 T CPU memory | 1536 CPUs,
16 T CPU memory,
0 GPUs,
5000 jobs submitted | | debug | general,
gpu_rocm,
gpu_cuda,
gpu_viz | open | 20 | none | 1 hour,
1536 CPUs,
16 T CPU memory,
4 GPUs,
2 jobs running,
20 jobs submitted | | gpu | gpu_rocm,
gpu_cuda,
gpu_viz | open | 10 | none | 256 CPUs,
2 T of CPU memory,
4 GPUs,
4 jobs running,
100 jobs submitted | | mig | gpu_cuda | open | 10 | none | 441 CPUs,
1932 GB CPU memory,
21 GPUs,
1000 jobs submitted | | sxm | gpu_sxm | approved users | 10 | none | 192 CPUs,
1 T CPU memory,
4 GPUs,
4 jobs running,
50 jobs submitted | | viz | general,
gpu_viz| onBunya only | 20 | none | 1 day,
192 CPUs (96 CPU per job),
500G per job,
2 GPUs (1 GPU per job),
2 running jobs,
20 jobs submitted |

Available partitions and nodes

The available compute nodes on Bunya are listed in the table below. Please note while some feature the same CPU or GPU type, the available memory and/or number of CPUs or architecture can differ. Be mindful of this when requesting resources for jobs.

Maximum walltimes:
general: 2 weeks (14 days, 336 hours)
gpu_cuda, gpu_viz, gpu_rocm, gpu_sxm: 1 week (7 days, 168 hours)

Default walltime for all partitons: 30 minutes

Default number of GPUs for all partitons: zero

gpu_viz is used exclusively by onBunya. Users should not be submitting batch jobs via sbatch to the gpu_viz partition. The L40 and L40s GPUs are available through the gpu_cuda partition.

Partition Hostnames Count CPU Memory (MB) CPUS FEATURES GRES Charge Multiplier
general bun[006-008] 3 4000000 192 epyc3 (null) 1
general bun[009-067] 59 2000000 192 epyc3 (null) 1
general bun[083-115,126-143] 51 1500000 192 epyc4 (null) 1
               
gpu_cuda bun003 1 2000000 256 epyc3,
cuda,
cuda80gb
gpu:a100:3 50
gpu_cuda bun[004-005] 2 2000000 256 epyc3,
cuda,
cuda10gb
gpu:nvidia_a100_80gb_pcie_1g.10gb:21 6
gpu_cuda bun068 1 2000000 192 epyc3,
cuda,
cuda80gb
gpu:a100:2 50
               
gpu_cuda bun[071-076,116] 7 2000000 192 epyc3,
cuda,
cuda80gb
gpu:h100:3 100
gpu_sxm bun[117-120] 4 1000000 192 xeonsp4,
cuda,
cuda80gb,
sxm
gpu:h100:4 100
               
gpu_cuda
gpu_viz
bun[077-082] 6 2000000 192 epyc3,
cuda,
cuda48gb
gpu:l40:3 40
gpu_cuda
gpu_viz
bun[124-125] 2 750000 192 epyc4,
cuda,
cuda48gb
gpu:l40s:3 42
gpu_viz bun[121-123] 3 750000 192 epyc4,
cuda
gpu:a16:12 6
               
gpu_rocm bun[001-002] 2 500000 192 epyc3,
rocm
gpu:mi210:2 50
gpu_rocm bun070 1 380000 64 epyc4,
rocm
gpu:mi210:2 50


Slurm scripts

Users should keep in mind that Bunya has 96 cores (192 threads) per node. 96 cores (--ntask-per-node=96) 192 threads (--cpu-per-task=192) is therefore the maximum a multi thread job can request. Please note not all calculations scale well with cores, so before requesting all 96/192 cores/threads do some testing first.

The Pawsey Centre has an excellent guide on how to migrate from PBS to SLURM. The Pawsey Centre also provides a good general overview of job scheduling with Slurm and examples workflows like array jobs.

How to submit a job

You would usually write a slurm script to subimit your jobs. Once you have a script you use sbatch to submit this script. For example if you have a script called first-job-script then you use

sbatch first-job-script

to submit the slurm script and your job.

Jobs need to request all resources they need or they will fail. This includes GPUs (default is zero) and walltime.

Below are examples for single thread, single node but multiple threads, MPI, and array job submission scripts. The different request flags mean the following:
#SBATCH --nodes=[number] - how many nodes the job will use
#SBATCH --ntasks-per-node=[number] - This is 1 for single thread jobs and multi thread jobs. This is 96 (or less if single node) for MPI jobs.
#SBATCH --ntasks=[number] - total number of tasks of the job. Relevant to MPI jobs (it is usually 1 for non-MPI jobs) and should be set to the total number of tasks for the job (what you would use with the -np or -n option for mpirun). This should be used instead of requesting number of nodes and tasks per node to enable faster scheduling of MPI jobs.
#SBATCH --ntasks-per-core=[number] - maximum ntasks on each core. Use with --ntasks=[number] for MPI jobs and set to --ntasks-per-core=1. #SBATCH --cpus-per-task=[number] - This is 1 for single thread jobs, number of threads for multi thread jobs. --cpus-per-task can be undertstood as OMP_NUM_THREADS. Do not use for MPI jobs.
#SBATCH --hint nomultithread - This option may help in situations where your parallelisation (single node multicore or hybrid OpenMP+MPI) is confused by the numbers of cores/threads.

#SBATCH --mem=[number M|G|T] - RAM per job given in megabytes (M), gigabytes (G), or terabytes (T). The full memory of 1.5 TB r 2TB or 4TB is not available to jobs, therefore jobs asking for 1.5TB or 2TB or 4TB (1500G or 2000G or 4000G) will NOT run. Ask for 1500000M to get the maximum on an epyc4 standard node. Ask for 2000000M to get the maximum memory on a epyc3 standard node. Ask for 4000000M to get the maximum memory on a high memory node. See note below why.
#SBATCH --mem-per-cpu=[number M|G|T] - alternative to the request above, only relevant to MPI jobs.

#SBATCH --gres=gpu:[type]:[number] - to request the use of GPU on a GPU node. Please see description of partitions above for the available types of GPUs
#SBATCH --time=[hours:minutes:seconds] - time the job needs to complete. Partition limits: general = 336 hours (2 weeks), gpu_rocm, gpu_cuda, gpu_sxm = 168 hours (1 week).

#SBATCH --qos=[normal,gpu,debug,mig,sxm] - to request a quality of service for the job.
#SBATCH -o filename - filename where the standard output should go to. See man sbatch for filename templating options.
#SBATCH -e filename - filename where the standard error should go to. See man sbatch for filename templating options.
#SBATCH -job-name=[Name] - Name for the job that is seen in the queue

#SBATCH --account=[Name] - AccountString for your research or accounting group. All AccountStrings start with a_. Use the groups command to list your groups

#SBATCH --constraint=[epyc3 or epyc4] - to submit to a specific CPU architectures if required, needs to be applied with --batch below.
#SBATCH --batch=[epyc3 or epyc4] - to submit to the a specific CPU architecture, needs to be applied with --constraint above.

#SBATCH --partition=general/gpu_rocm/gpu_cuda/gpu_sxm

#SBATCH --array=[range] - Indicates that this is an array job with range number of tasks. Range can be 0-999. The maximum range value is 1000.

srun - runs the executable using the resources you requested for this job. It will receive info on number of threads, memory, etc from Slurm. There is no need to specify them here.

See man sbatch and man srun for more options (use arrow keys to scroll up and down and q to quit)

Default partition is general If you do not specify a partition when you submit you get the default partition. The default partition is general which is CPU only. Important, the slurm defaults are usually not sufficient for most user jobs. If you want appropriate resources, you are required to request them.

Standard outout and error: Using the SBATCH options -o and -e with a filename in a script will result in the standard error and standard output file to appear as soon as the job starts to run. This behaviour is different to standard PBS behaviour on Tinaroo and FlashLite (unless you specified paths for those files there too) where the standard error, .e, and standard output, .o, files only appeared when the job was finished or had crashed.

Default working directory: In Slurm your job will start in the directory/folder you submitted from. This is different to PBS behaviour where your job started in your home directory. So on Bunya, using slurm, there is no need to change into the job directory, unless this is different to the directory you submitted from.

$TMPDIR: If your job produces temporary files during the calculation or if you need a space to write to that does not impact your quotas in /home, /scratch/user or /scratch/project then please use $TMPDIR during your calculations. $TMPDIR is automatically created at the start of a job and is then automatically deleted at the end of the job. It is therefore the best place to write temporary files (those not needed after the calculation is done) to. If you use $TMPDIR for output you wish to keep then please make sure you copy all needed files to a location in /home, /scratch/user, /scratch/project or /QRISdata. A big (all in one go) copy from $TMPDIR to /QRISdata (RDM) at the end of a job is possible.

Note on maximum memory requests: The standard compute node has 2TB of physical memory available (or 4TB for the 3 high memory compute nodes). Not all of this can be given to jobs running on the compute node as the Linux operating system also needs resources. This is why the maximum requestable memory has been capped at 2000000MB for the standard compute nodes and 4000000MB for high memory compute nodes.

So why is 2000000MB not the same as 2TB? 1024 MB = 1 GB and 1024 GB = 1 TB. This means 2000 GB = 2048000 MB which is larger than 2000000M which is set as the maximum available memory on a standard compute node.

Accounting has now been switched on and will be enforced. Users cannot run jobs without a valid AccountString. Type groups on the command line to check if you have one. All valid AccountStrings start with a_ and are all lower case letters. If you do not have a valid AccountString then please contact your supervisor. AccountStrings and access are managed by research groups and group leaders. Groups who wish to use Bunya are required to apply to set up a group with a valid AccountString. Only group leaders can apply to set up such a group. A PhD student or postdoc without their own funding and group should not apply. Applications can be made by contacting rcc-support@uq.edu.au.

Simple script for CUDA GPUs.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --qos=gpu
#SBATCH --partition=gpu_cuda
#SBATCH --gres=gpu:nvidia_a100_80gb_pcie_1g.10gb:1
#SBATCH --account=AccountString
#SBATCH -o slurm-%j.output
#SBATCH -e slurm-%j.error

module-loads-go-here

srun executable < input > output

For full A100 change to
#SBATCH --gres=gpu:a100:1

For H100 change to
#SBATCH --gres=gpu:h100:1

For L40 change to
#SBATCH --gres=gpu:l40:1

For L40s change to
#SBATCH --gres=gpu:l40s:1

Simple script for AMD ROCM GPUs.

Nodes bun001, bun002, and bun070. These are AMD GPUs. You most likely will need to compile your own code or use a container to run on these. See the AMD Infinity Hub for some available containers

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --qos=gpu
#SBATCH --partition=gpu_rocm
#SBATCH --gres=gpu:mi210:1 #you can ask for up to 2 here
#SBATCH --account=AccountString
#SBATCH -o slurm-%j.output
#SBATCH -e slurm-%j.error

module-loads-go-here

srun executable < input > output

Simple script for CPUs and single node

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --qos=normal
#SBATCH --partition=general
#SBATCH --account=AccountString
#SBATCH -o slurm-%j.output
#SBATCH -e slurm-%j.error

module-loads-go-here

srun executable < input > output

To ask for more than 1 thread change the line

#SBATCH --cpus-per-task=12

To run over 12 threads for example.

You can target specific architectures like epyc3 (phase 1) and epyc4 (phase 2) by adding

#SBATCH --constraint=[epyc3 or epyc4]
#SBATCH --batch=[epyc3 or epyc4]

Simple MPI script (using 192 cores, as an example). Using –ntasks will spread the job over multiple nodes where there is space.

#!/bin/bash --login
#SBATCH --ntasks=192
#SBATCH --ntasks-per-core=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=MPI-Test
#SBATCH --time=1:00:00
#SBATCH --qos=normal
#SBATCH --partition=general
#SBATCH --account=AccountString
#SBATCH -o slurm-%j.output
#SBATCH -e slurm-%j.error

module-loads-go-here

srun executable < input > output

You can target specific architectures like epyc3 (phase 1) and epyc4 (phase 2) by adding

#SBATCH --constraint=[epyc3 or epyc4]
#SBATCH --batch=[epyc3 or epyc4]

Job Arrays

Here is one example of an array job script with 5 array tasks. Important: Request resources for a single task and not resources for all together.

#!/bin/bash --login
#SBATCH --job-name=testarray
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=5G
#SBATCH --time=00:01:00
#SBATCH --qos=normal
#SBATCH --partition=general
#SBATCH --account=AccountString
#SBATCH --output=test_array_%A_%a.out
#SBATCH --array=1-5

module-loads-go-here

srun executable < input > output

Useful variables for array jobs

$SLURM_ARRAY_JOB_ID = Job array’s master job ID number.
$SLURM_ARRAY_TASK_COUNT = Total number of tasks in a job array.
$SLURM_ARRAY_TASK_ID = Job array ID (index) number.

You can target specific architectures like epyc3 (phase 1) and epyc4 (phase 2) by adding

#SBATCH --constraint=[epyc3 or epyc4]
#SBATCH --batch=[epyc3 or epyc4]

How to manage your jobs and cluster activity in SLURM

To list only your jobs

squeue -u YourUsername
squeue -u $USER
squeue --me

Some formatting ideas for more detailed squeue reports

Here are some other useful additions to the squeue command. For information on what all these means please consult the man pages.

squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %.10a %.4c %R"
JOBID  PARTITION  NAME  USER  STATE  TIME TIME_LIMIT NODES  ACCOUNT  MIN_CPU  NODELIST(or REASON)
squeue -o "%12i %7q %.9P %.20j %.10u %.2t %.11M %.4D %.4C %.14b %8m %16R %18p %10B %.10L" 
JOBID QOS PARTITION NAME USER STATE TIME NODE CPUS TRES_PER_NODE MIN_MEMORY NODELIST(REASON) PRIORITY EXEC_HOST TIME_LEFT
#One for the MPI users, perhaps!
squeue -o "%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"
JOBID PARTITION     NAME     USER ST       TIME  NODES CPUS

When you are wondering why your job has not started

Check the REASON in your squeue output

Checking the REASON field of the squeue output should provide you with some clue. The manual page for the squeue command lists the most commonly encountered reasons. They can be found in the “JOB REASON CODES” section of the man squeue command output or online here. A full list of job reasons can be found on this web page

QOSGrpCpuLimit or QOSGrpMemLimit

[20241024] A recent change was made to the batch system configuration to accommodate the Bunya Phase 3 hardware and associated workloads. It makes greater use of the QOS feature of SLURM. So you may now see the REASON given as “QOSGrpCpuLimit” or perhaps “QOSGrpMemLimit” for why your job is queued and not running. When it says that it means that Bunya is currently very busy and there is no available space for your job, at the moment. Previously it would have said something to the effect of Resources are unavailable.

Your jobs will eventually start, if, you leave them in the queue!

Check the sinfo output for a status report of Bunya nodes

The sinfo command is used to obtain information about the actual nodes.
You can request the report for a single or all partitions.
Some of the more commonly seen values for the STATE are:

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu_cuda      up 7-00:00:00     14    mix bun[003-004,068,071-079,082,116]
gpu_cuda      up 7-00:00:00      3   idle bun[005,080-081]
gpu_viz       up 7-00:00:00      4    mix bun[077-079,082]
gpu_viz       up 7-00:00:00      2   idle bun[080-081]

Here some useful examples.

sinfo -o "%n %e %m %a %c %C"
which yields
HOSTNAMES FREE_MEM MEMORY AVAIL CPUS CPUS(A/I/O/T)

sinfo -O Partition,NodeList,Nodes,Gres,CPUs

sinfo -o "%.P %.5a %.10l %.6D %.6t %N %.C %.E %.g %.G %.m"
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST CPUS(A/I/O/T) REASON GROUPS GRES MEMORY

How to know what resources are actually being utilised by a job?

It is also very important that your jobs do not occupy valuable HPC resources and not utilise them. The fairshare system calculates your usage based on the resources you requested. If you order a lot of food, but you don’t eat it all, then you still pay for the food you ordered. HPC’s don’t have a mechanism like doggy bags ;-)

The fairshare system will account for your usage based on the resources (CPU and/or MEM) you requested even if you did not consume them all. That is because noone else could used those resources while your job was running. A job that finishes earlier than the requested walltime does get charged less.

We often don’t know in advance how much resource a set of jobs may require. You may have some idea based on calculations you have performed on a workstation or other HPC cluster. Obviously if your software cannot utilised a GPU resource you would never request it. But if you have CPU code that is expected to run faster on a single node in shared memory (OpenMP) mode, or on multiple nodes in message passing interface (MPI) mode, then you should be certain that you are getting the performance gains that you expect.

Jobstats

You can use the jobstats module to check utilisation of running and completed jobs. It will show CPU, CPU Ram, GPU, and GPU Ram utilisation.

module load jobstats
jobstats JobID
A running job

You are able to access any node that is running a job that belongs to you.

The squeue --me command will tell you which node is running your job. You will need to set up SSH public/private keys and authorized_keys to allow you to easily ssh to any compute node. Refer to this section, above, for how to setup SSH keypairs for internal connections within Bunya.

Once logged in to the node that is running your job you can use the top -c -u $USER command and note the %CPU and RES (memory) values for your processes.
The performance of jobs running on NVIDIA GPUs can be monitored using the /usr/bin/nvidia-smi command once you are logged into the node running your job..

Cancelling a job

If you see that your job is not performing as expected, or is not properly utilising the resources you requested, then you should cancel the job.

Use the scancel command to cancel the job. You can cancle an individaul job or job array element, or an entire job array.

Before resubmitting the job, review the job resource request and job inputs and code.

Your completed jobs, using sacct

The sacct command can be used to report CPU and Memory utilisation by a completed job.

sacct -p  -a -S now-48hours --format JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NodeList,NTasks -u $USER

yields these metrics

|JobID|User|Group|State|Cluster|AllocCPUS|ReqMem|TotalCPU|Elapsed|MaxRSS|ExitCode|NNodes|NodeList|NTasks|

Notes:

Your completed jobs, using seff

The perl utility script /usr/local/bin/seff will generate a brief more readable report of the total resource utilisation (CPU and MEM). It does not report on GPU utilisation.

seff JobID