Job management with SLURM

You should not run your compute code directly on the terminal you find when you log in. The login server dce.metz.centralesupelec.fr is not suited for computations.

In order to submit a job on the cluster, you need to describe the resources (cores, memory, time) you need to the task manager Slurm. The task manager will launch the job on a remote compute node as soon as the resources you need will be available. The job will be executed in a virtual resource chunk called a CGROUP. See section on CGROUPS below for more information.

There are two ways to run a compute code on the DCE :

  • using a interactive Slurm job : this will open a terminal on a compute node where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
  • using a Slurm script : this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.

Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.

In addition to that page which documents slurm commands in the context of the DCE, you can check the slurm workload manager documention.

Slurm script

Most of the time, you will run your code through a Slurm script. This script has the following functions :

  • specify the resources you need for your code : partition, walltime, number of nodes, etc.
  • specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.)
  • if you use GPUs, the type of gpus requested (not yet available)
  • setup the batch environment (load modules, set environment variables)
  • run the code

Running the code will depend on your executable. Parallel codes may have to use srun or having specific environment variables set.

SLURM partitions

  • By defaut partition is set to gpu_inter.
  • You can change this setting by choosing a partition following the needed resources in the list of available partitions .

Slurm directives

You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These options can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.

How to describe your requested ressources with SBATCH

nodes

Number of nodes :

#SBATCH --nodes=<nnodes>

ntasks

Number of tasks (MPI processes) :

#SBATCH --ntasks=<ntasks>

ntasks-per-node

Number of tasks (MPI processes) per node:

#SBATCH --ntasks-per-node=<ntpn>

cpu-per-task

Number of threads per process (Ex: OpenMP threads per MPI process):

#SBATCH --cpus-per-task=<ntpt>

exclusive

Allocated nodes are reserved exclusively in order to avoid sharing nodes with other running jobs. Advised for MPI jobs.

#SBATCH --exclusive

time

Specify the walltime for your job. if your job is still running after the walltime duration, your job will be killed :

#SBATCH --time=<hh:mm:ss> 

partition

Specify the Slurm partition your job will be assigned :

#SBATCH --partition=<PartitionName>

With PartitionName in partition names list

SBATCH additional directives

job-name

Define the job's name :

#SBATCH --job-name=jobName

output

Define the standard output (stdout) for your job :

#SBATCH --output=outputJob.txt

The default is --output=slurm-%j.out.

If you need to direct the stdout to a specific directory, you must first create the directory, say logs, and then set the option as --output=logs/slurm-%j.out.

error

Define the error output (stderr) for your job :

#SBATCH --error=errorJob.txt

By default both standard output and standard error are directed to the same file.

mail-user

Set an email address :

#SBATCH --mail-user=firstname.lastname@mywebserver.com 

mail-type

To be notify by mail when a step has been reached :

#SBATCH --mail-type=ALL

Arguments for -mail-type option are :

  • BEGIN : send an email when the job starts
  • END : send an email when the job stops
  • FAIL : send an email if the job fails
  • ALL : equivalent to BEGIN, END, FAIL.

export

Export user environment variables

  • By default all user environment variables will be loaded (--export=ALL).
  • To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended. In order to not export environment variables present at job submission time to the job's environment:
#SBATCH --export=NONE
  • To select explicitly exported variables from the caller's environment to the job environment:
#SBATCH --export=VAR1,VAR2

You can also assign values to these exported variables, for example :

#SBATCH --export=VAR1=10,VAR2=18

propagate

  • By default all resources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL).
  • To avoid the propagation of interactive limits and erase batch resources limits, it is encouraged to disable the functionality:
#SBATCH --propagate=NONE

account

  • By default the compute time consumption is charged to your default project account
  • To indicate another project account, you can specify it with --account
  • To see the association between a job and the project, you can use squeue, scontrol or sacct commands.
#SBATCH --account=<MY_PROJECT>

Submit and monitor jobs

submit job

You need to submit your script job0 with :

$ sbatch job0
Submitted batch job 29509

which responds with the jobid attributed to the job. For example here, jobid is 29509. The jobid is a unique identifier that is used by many Slurm commands.

monitor job

The squeue command shows the list of jobs :

$ squeue
JOBID PARTITION           NAME     USER ST       TIME  NODES NODELIST(REASON)
29509 gpu_prod_night      job0 username  R       0:02      1 tx13

cancel job

The scancel command cancels jobs.

To cancel job job0 with jobid 29509 (obtained through squeue), you would use :

$ scancel 29509

interactive jobs

  • Example 1: access one node in interactive for an hour
$ srun --nodes=1 --time=00:30:00 -p gpu_inter --pty /bin/bash
[user@cam10 ~]$ hostname
cam10
  • Use --x11 option if you need X forwarding.

job arrays

Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.

# Submit a job array with index values between 0 and 31
[user@chome ~]$ sbatch --array=0-31 job

# Submit a job array with index values of 1, 3, 5 and 7
[user@chome ~]$ sbatch --array=1,3,5,7 job

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
[user@chome ~]$ sbatch --array=1-7:2 job

The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.

chain jobs

If you want to submit a job which must be executed after another job, you can use the chain function in slurm.

[username@chome ~]$ sbatch slurm_script1.sh
Submitted batch job 74698
[username@chome ~]$ squeue 
JOBID PARTITION     NAME     USER      ST    TIME    NODES  NODELIST(REASON)
74698  *******      *******  username  PD    0:00    *      *******
[username@chome ~]$ sbatch --dependency=afterok:74698 slurm_script2.sh
Submitted batch job 74699
[username@chome ~]$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh
Submitted batch job 74700

Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command. If you want these jobs to be automatically canceled on failure, you must specify the –kill-on-invalid-dep = yes option when submitting them.

Here are the common chaining rules :

  • after: = job can start once job has started execution
  • afterany: = job can start once job has terminated
  • afterok: = job can start once job has terminated successfully
  • afternotok: = job can start once job has terminated upon failure
  • singleton = job can start once any previous job with identical name and user has terminated

Accounting

Use the command sacct to get info on your finished jobs.

Note : on the DCE, the accounting information is restricted to your jobs only

CGROUPS

Your slurm job will be executed in a virtual resource chunk called a CGROUP, formed with the allocated amount of RAM, cores and GPUS. In some cases, you will be allowed to see only the selected resources.