Welcome to the Slurm guide! This quick guide will walk you through using Slurm in the cheese cluster.
We have a few reasons:
- Some people need to fire off thousands or millions of tasks/jobs that consume very little resources, but we want to be fair other users on the machine.
- Some people need very quiet machines to get accurate performance numbers.
- We need to (scalably) arbitrate allocations to scarce resources (GPUs, FPGAs, etc.)
NOTE: We are NOT using Slurm the way it is used in supercomputers! We are NOT removing your ability to interactively log-in and use the machines that we have enrolled in Slurm!
NOTE: By default, if you submit a job to Slurm, it will run on one of the x86 machines we have in the cluster. This currently includes:
- dubliner
- roquefort
- limburger
- jarlsberg
- manchego
- v-test-phi[0..3]
We have also enrolled Burrata in the cluster, but your jobs will NOT be sent to burrata by default. We did this to give people fewer surprises when developing their software. Please ask one of the Cheese administrators (Nick, Karl, Alex, or Peter) how to access Burrata using Slurm.
There are 2 ways to submit jobs, that correspond to "the 2 ways" to use a computer:
- Batch/Async/Background
- Interactive/Sync
There is exactly one command for each of these:
sbatch <job-script>salloc
Any job you submit will run as your user!
Mind that the only storage shared across all nodes in /tank and when your job is running (either in batch or interactive mode) you are working with that node's local storage.
This is another good reason to leverage /tank!
Since different nodes may have different software installed, we recommend you make use of Nix and Nix flakes to pull in your project dependencies when possible. If you do not want to use Nix, you can force your jobs to run on the same node you are using for development (see "Useful Flags" below).
sbatch lets you submit a "job-script" to Slurm for execution.
Slurm will schedule your job to execute when the resources you requested become available.
NOTE: You MUST give sbatch something that starts with a shebang line (#!/usr/bin/env bash for instance).
Below is an example on how to use sbatch:
$ cat example-job.sh
#!/usr/bin/env bash
echo "Hello to the world, from Slurm!"
exit 0
$ sbatch ./example-job.shYou can use command-line flags to configure sbatch submissions.
But if you have set of flags you want to reuse over and over again, consider using a comment-directive.
Below is an example job-script that shows how to use comment-directives.
#!/usr/bin/env bash
# Normal comment
# Below is an example of a comment-directive
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --job-name='my-special-job'
# You can comment-out a comment-directive easily:
##SBATCH --mem=3K
# NOTE: Comment-directives will only be applied BEFORE any command in the shell script!
set -o errexitUsing salloc on any of the cheese machines will grant you an allocation on one of the other x86 cheese machines by default.
Exactly which one depends on what everybody else is doing.
Unlike Slurm instances you have used before, you do NOT need to request an allocation then open a shell.
We do this for you by default.
If you just type salloc, you will be granted an allocation on a machine and your shell will be redirected to that machine, just like an SSH connection.
To end your interactive allocation, you can use the normal exit command or press Ctrl+d.
An example of salloc's use and output is shown below:
karl@dubliner:~$ salloc
salloc: Granted job allocation 717
groups: cannot find name for group ID 125
karl@manchego:~$ <Ctrl+d>
exit
salloc: Relinquishing job allocation 717Perhaps you only need 8 cores for your job.
You can use the -c/--cpus-per-task flag to provide that information.
You can use the nproc command to determine how many cores you have once you are given an allocation.
karl@dubliner:~$ salloc -c 8
salloc: Granted job allocation 750
karl@roquefort:~$ nproc
8
karl@roquefort:~$ exit
salloc: Relinquishing job allocation 750Sometimes you want to request a job allocation but not immediately switch your shell to the remote machine.
Use the --no-shell flag to achieve this behavior.
When you want to run job steps on your allocation you must submit the job step using srun.
This is the same behavior as when you use job steps in sbatch.
Note that you can submit multiple job steps against this allocation at a time.
You can also optionally allow the job steps to overlap if you want using the --overlap flag.
If you want an interactive shell on the allocation, you need to use --pty flag to connect your terminal to the Slurm input/output.
There is NO way to detach from an interactive allocation easily. You need to use another tool, like tmux, to achieve this behavior. In summary, you CAN attach to an allocation after requesting it, but you CANNOT detach from an allocation that was immediately opened as interactive (an allocation that immediately opens a shell).
Below is an example of requesting an allocation for an indefinite amount of time then delaying opening a shell until a later time.
karl@dubliner:~$ salloc --no-shell
salloc: Granted job allocation 1389
karl@dubliner:~$ srun --pty --overlap --jobid 1389 bash
karl@roquefort:~$If you are done with your allocation before your time limit, then you must cancel your job using scancel.
# Finished with roquefort. Since this was an indefinite allocation, I must
# manually cancel it
karl@dubliner:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1389 compute no-shell karl R 12:29 1 roquefort
karl@dubliner:~$ scancel 1389
karl@dubliner:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)Here are some recommendations from us, the admins, about how you could use Slurm:
#SBATCHcomment-directives in your submitted script MUST come before any shell commands! Comments before and after#SBATCHcomment-directives are allowed, but Slurm stops looking for them as soon as the first non-comment shell command is found.- Make sure your scripts record some unique component when writing outputs.
Slurm does this automatically with
STDOUTon your behalf. But if you create files/directories, then Slurm will not handle that for you and you must manually uniquify them. We recommend either a timestamp or the job ID, as those will be quite unique over time. - Try to make your script that runs inside the job fairly idempotent. For example, it can assume it is running inside of Slurm, but it should not assume that the directory it is working on is your working copy. Instead, make the working script assume a fresh copy of your work, and that one of the jobs of your Slurm job is to compile your test programs. Obviously, this is not tenable for enormous compile jobs. Use your best judgment.
- Take advantage of the
--chdirflag onsbatchto change the working directory of the Slurm job. You can build a submission directory that your job should work out of, which helps keep your working files separate from your job files. - For very long
sbatchruns, make use ofsrunto split the job up into steps. Slurm can restart jobs that fail for a variety of reasons (your node went down, the server crashed, etc.), but only at the granularity of a step.
Here are a quick run-down of useful flags you can pass to some Slurm commands.
Please check the documentation for each command (man sbatch for instance), as some flags may not be present on certain commands or behave slightly different.
In addition, there are MANY more flags than the ones we present below; check the documentation for them all.
--exclusive: Take exclusive control over the machine, even if you requested fewer resources than the whole machine has.-w/--nodelist=<node-list>: Run on a specific host-c/--cpus-per-task=<ncpu>: Give your job the specified number of hyperthreads/logical cores. NOTE: If you choose an odd number, it will be rounded up to give you a full physical CPU. Your assigned hyperthreads will NOT overlap with another persons'/jobs'.--mem=<size>[units]: Give your job the specified TOTAL amount of memory. Size is just a number, units is ONE of [K, M, G, T].-p/--partition=<partition-name>: Choose a different partition to put your job on. Burrata is in a different partition for instance.
A contrived example of using all of these flags is shown below:
karl@dubliner:~$ sbatch -c 8 --mem 8G \
--partition=intel --nodelist=manchego \
--exclusive \
example-batch-file.sh
Submitted batch job 751Use the squeue command to look at the current Slurm queue.
karl@dubliner:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
719 compute wrap karl R 0:06 1 manchego
720 compute interact karl R 0:02 1 manchego
721 compute interact nick R 0:09 1 roquefortIf there are a lot of jobs in the queue, it is easy to lose where your jobs' positions.
There is a -u/--user=<user1,user2>flag to make finding your jobs easier:
karl@dubliner:~$ echo $USER
karl
karl@dubliner:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
719 compute wrap karl R 0:06 1 manchego
720 compute interact karl R 0:02 1 manchegoWe have divided the cheese machines up a little bit based on what features they have. These include ISA (x86 vs. ARM), configuration (number of sockets), and other oddities (Xeon Phis). The sections below show you hot to find this information.
Partitions are a way to group multiple nodes together.
By default all of your jobs will run on the compute partition, which comprises a majority of our machines (note the Default=YES in the compute partition).
However, we have defined some other partitions as well.
For example, all machines with Intel CPUs are grouped under the intel partition.
Below is an example of how to list all partitions in the cluster:
karl@dubliner:~$ scontrol show partition
PartitionName=cheese
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=burrata,colbyjack,dubliner,jarlsberg,limburger,manchego,pepperjack,roquefort,string,toussaint
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=556 TotalNodes=10 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
PartitionName=compute
...
AllocNodes=ALL Default=YES QoS=N/A
...
Nodes=dubliner,jarlsberg,limburger,manchego,roquefort
...
PartitionName=intel
...
Nodes=dubliner,manchego
...Every computer in the cheese cluster that Slurm can run jobs on is a node. Each node may have different kinds of resources, e.g. Dubliner with its GPU or Toussaint with its FPGA. Below is an example of how to list all nodes in the cluster:
karl@dubliner:~$ scontrol show node
NodeName=burrata Arch=aarch64 CoresPerSocket=128
CPUAlloc=0 CPUTot=128 CPULoad=42.68
...
NodeName=dubliner Arch=x86_64 CoresPerSocket=22
CPUAlloc=0 CPUTot=176 CPULoad=7.21
...
NodeName=jarlsberg Arch=x86_64 CoresPerSocket=24
CPUAlloc=0 CPUTot=48 CPULoad=0.00
...The phi machines (v-test-phi{0..3}) have been enrolled into the cheese cave, users synced, /tank mounted, Nix installed (and flakes).
All 4 phis are identical software-wise, and closely match the other machines in the cluster.
You DO NOT want to develop on those machines! They have incredibly slow CPUs, and unless you can use 4-way barrel SMT, the 256-core count versions are even slower. BUT, these machines have a really interesting topology! 4-way SMT (each hyperthread as a full AVX-512 unit), 64 GiB of DRAM, and 16GiB of on-die HBM. The HBM & DRAM layout is configurable too, so the HBM could be treated as a "fast" cache or as very slow memory.
We have varied across these two axes on our four machines to deliver (hopefully) all possible configurations:
- phi0 - No HT (64 cores), HBM Far
- phi1 - Yes HT (256 cores), HBM Far
- phi2 - No HT, HBM Near
- phi3 - Yes HT, HBM Near
"HBM Far" means the HBM is treated as very slow memory and DRAM is preferred by Linux. "HBM Near" means the HBM is treated as a very fast L3-like cache.
We ask that you do not log into the phis directly, but instead "request time" on them through Slurm (either salloc for interactive or sbatch for batch jobs, see the SLURM Getting Started file in /tank for more information), since these machines have such odd performance characteristics.
We want to make sure your results are sensible vs. what hardware & software you are running on.
We are doing this so that people cannot accidentally step on others' toes.
We are not going to track time-used or anything.
You have full access to /tank from the phis, so take advantage of that when doing development vs. running workloads on them.
Below is a command to request 16 cores from phi1:
karl@dubliner:~$ salloc -p phis -w v-test-phi1 -c 16
salloc: Pending job allocation 974
salloc: job 974 queued and waiting for resources
salloc: job 974 has been allocated resources
salloc: Granted job allocation 974
karl@v-test-phi1:~$ nproc
16
karl@v-test-phi1:~$
exit
salloc: Relinquishing job allocation 974
salloc: Job allocation 974 has been revoked.
karl@dubliner:~$In the future, we may prevent you from logging into SOME of the machines directly. For example, the v-test-phi machines are quite slow to develop on, making them painful to use. We might also designate jarlsberg as a Slurm-dedicated machine. This would make those machines ideal for doing performance evaluation, since no one would ever be logged in.