Skip to content

Slurm Using

Slurm Spirit Partitions

There is two partition

  • zen4 : node with 64 core and 240GB of memory for jobs ( default partition)
    • mem per cpu 4000m default time 1H Max time 168H
  • zen16 : node with 32 core and 496GB of memory for jobs
    • mem per cpu 16000m default time 1H Max time 72H

multinodes run are not allowed on those partitions

Slurm User limits

Those limits are for zen4 and zen16 partitions

  • Internal Users: cpu=96,mem=262144 MAXJOBS=64 MaxSubmitJobs=1000
  • External Users: cpu=48,mem=131076 MAXJOBS=32 MaxSubmitJobs=500

if one limit of cpu, mem or MAXJOBS is reached , your others jobs stay in PENDING state until you're back under the limit for all the limits

local command slqueue give you the reason in this case :

  • QOSMaxCpuPerUserLimit
  • QOSMaxMemoryPerUser

if you try to submit more than MaxSubmitJobs value , over limit submitted job are rejected

Submitting Interactive Jobs

default partition zen4

default time 1H

default cpu 1

Maxtime for interactive jobs is 6H

Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.

srun --pty [--x11] [--partition zen4|zen16] --time value] [--mem value] [--ntasks num_task] bash 

Sample:

1 core during 1H on partition zen4 with 4000m Memory

srun --pty --x11 bash 
JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4252 RUNNING      zen4       xxxxx  yyyyy    1    1  4000M    0       55:30     1:00:00             spirit64-01

1 core during 2H with 6gb of Memory

srun --pty --x11 --mem 6G --time 2:00:00 bash 
JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4253 RUNNING      zen4       xxxx  yyyyy    1    2     6G    0     1:59:55     2:00:00             spirit64-01

You could see the system give me 2 cpu but I have asked only one. Memory per cpu is 4000m on zen4 partition, so the system is giving you cpu with not enough memory to be used by other jobs

If you need more than 4000m per cpu, better is to used zen16 partition

srun --pty --x11 --partition zen16 --mem 6G --time 2:00:00 bash
JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4254 RUNNING     zen16        xxxx   yyyyy    1    1     6G    0     1:59:58     2:00:00             spirit32-01

in this case I have only 1 cpu corresponding to my ask

Submitting Batch jobs

default partition zen4

default time 1H

default cpu 1

sbatch [ --partition=zen4|zen16 ] [--time=value ] [ --mem=value|--mem-per-cpu=value ] [ --ntasks num_task ] [ --cpus-per-task ] script [ ARG ]

All sbatch options could be put also in the script as shell comment (#SBATCH --option argument )

batch samples

All this samples can be find also on clusters in /net/nfs/tools/meso-u20/batch-samples

Serial job ( could only use one cpu)

if you need more than 4000m of real memory for your job, it is better to submit your job in zen16 partition

# Only one cpu 
#SBATCH --ntasks 1 
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# Partition (zen4 or zen16 depending of your memory requirement)
#SBATCH --partition zen4
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/10.2.0 netcdf-fortran
# execute my program
time my_program

Parallel job ( using multiple CPU)

In a job, many SLURM environment variables are usable to know the number of cpu depending of option you give to slurm

SLURM_CPUS_ON_NODE (always defined) SLURM_JOB_CPUS_PER_NODE (always defined) SLURM_CPUS_PER_TASK (only defined with --cpus-per-task=x option)

openmp

use sbatch option --ntasks=1 --cpu-per-task=x and define environment variable in your job OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK

#OPENMP use multiple cpu in only one node,number of cpu should be asked  
#only one task
#SBATCH --ntasks=1
# number of cpu per task
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/9.4.0 openblas/0.3.17-openmp
# with option --cpus-per-task=x number of cpu is given by
# slurm in environment variable SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time my_code

multithreads ( pthreads )

use sbatch option --ntasks=1 --cpu-per-task=x

controlling the number of thread is generally only possible in the code python multiprocessing is in this part think to adjust your number of threads with the asked number of core or at low level on HeadNode (max 4) using cpu_count is not a good option on cluster best way :

os.getenv('SLURM_CPUS_PER_TASK')
#MULTITHREAD_CODE or Python multiprocessing
#only one task
#SBATCH --ntasks=1
# number of cpu per task 
# get the value of environment variable in your job to set the number of process
# in python os.getenv('SLURM_CPUS_PER_TASK') instead of 'cpu_count'
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interference with current environnement
module purge
module load pangeo-meso/2024.01.22
time python my_program.py

mpi

use our openmpi module , it's compiled to support SLURM directly define the number of mpi task with --ntasks=x --ntasks-per-node=x do not use mpirun -np value but only mpirun ( number of proc is directly passed from slurm to mpirun )

#OPENMPI JOB
# --ntasks=x to give the number of mpi process
#SBATCH --ntasks=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/11.2.0 openmpi/4.0.7
#
time mpirun myprog 

hybrid code mpi/openmp

use --ntasks or --ntasks-per-node for mpi instance and --cpu-per-task for openmp process

# Hybrid code mpi/openmp
# number of MPI instance
#SBATCH --ntasks-per-node=x
# number of openmp threads per mpi instance
#SBATCH --cpus-per-task=y
#SBATCH --partition=zen4
#SBATCH --time 120
module purge 
module load intel openmpi/4.0.7 intel-mkl/2020.4.304-openmp 
export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
time mpirun my_program

Job Array

This is a way to launch many similar job All job in a job array have the same properties ( number of cpu , memory requirement and time limits)

this could be a good way for data traitement

sample:

# Submit a job array with index values between 1 and 31
# good for day of month 
$ sbatch --array=1-31  --ntasks 1 --time 1:00:00 --mem 1g myscript

# Submit a job array with index values of 1990,1991 ... 2010 
# only 10 running at the same time
$ sbatch --array=1990-2010%10 -ntasks=1 tmp

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2   --ntasks 1  tmp
in your script you have acces to environment variable $SLURM_ARRAY_TASK_ID which is the index of your array.

to know more https://slurm.schedmd.com/job_array.html

Job Chaining

Job chaining is to submit again the same job in a job with sbatch

This could cause problem if new job start before a complete end of the precedent, you must know that there is a system epilog at the end of your job causing delay

to achieve this best way is :

  • sbatch must be the last command of your job
  • run the next job only if the current is finished ok

The next command ensure the current job is really finished before starting the next one, SLURM_JOBID variable contain the current jobid

sbatch --dependency=afterok:${SLURM_JOBID} myscript 

Job statistics

it's not anymore possible to have jobs statistic directly in output file (like on ciclad or climserv)

Statistic for

  • Running jobs : command sstat sstat -a -j <JobId\>

  • finished jobs : seff <JobId\> or sacct -j JobId

~# seff  4928
===================================================
Job ID: 4928
Cluster: spirit
User/Account/Group: xxxxx/xyyyy/zzzzzz
Job Name: Test
Running partition: zen4 on spirit64-02
Started : 2022/09/30-17:11:41
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:56:15
CPU Efficiency: 68.71% of 01:21:52 core-walltime
Asked time         : 01:00:00
Job Wall-clock time: 00:05:07
Memory Utilized: 30.68 GB
Memory Efficiency: 49.09% of 62.50 GB

To be continued