Introduction

To run an application (job) , computational resources must be allocated. ARIS uses SLURM Workload Manager (Simple Linux Utility for Resource Management) to distribute workloads across the supercomputer. For more information check the slurm quick start guide. https://computing.llnl.gov/linux/slurm/quickstart.html

Slurm has three key functions.

First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.

Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.

Finally, it arbitrates contention for resources by managing a queue of pending work.

Common terms:

  • nodes: the compute resource in SLURM
  • partitions (queues): node groups
  • jobs: allocations of resources
  • job steps: sets of tasks within a job.

SLURM commands

Man pages exist for all SLURM daemons, commands, and API functions. The command option –help also provides a brief summary of options. Note that the command options are all case insensitive.

  • sacct is used to report job or job step accounting information about active or completed jobs.

  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.

  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.

  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job’s node allocation.