Slurm Using

Slurm Spirit Partitions

There is two partition

zen4 : node with 64 core and 240GB of memory for jobs ( default partition)
- mem per cpu 3840m default time 1H Max time 168H
zen16 : node with 32 core and 496GB of memory for jobs
- mem per cpu 15872m default time 1H Max time 72H

multinodes run are not allowed on those partitions

Slurm User limits

Those limits are for zen4 and zen16 partitions

Internal Users: cpu=96,mem=262144 MAXJOBS=64 MaxSubmitJobs=1000
External Users: cpu=48,mem=131076 MAXJOBS=32 MaxSubmitJobs=500

if one limit of cpu, mem or MAXJOBS is reached , your others jobs stay in PENDING state until you're back under the limit for all the limits

local command slqueue give you the reason in this case :

Resources : there is not enough resources for your job but your in first priority
Priority : there in not enough resources for your job and you're not in first priority
Dependency : You have asked a dependency with a not finished running job
QOSMaxCpuPerUserLimit : your running jobs hit the max number of CPU permitted for your account
QOSMaxMemoryPerUser : your running jobs hit the max sum of memory permitted for your account
QOSMaxJobPerUser : your running jobs hit the max number of job permitted for your account
ReqNodeNotAvail, Reserv : there is a reservation on cluster made by admin team for special use or maintenance and with your TimeLimit request, your job cannot be finished before the beginning of the reservation. For example you have requested 7 days and we have a maintenance in 3 days with full stop

if you try to submit more than MaxSubmitJobs value , over limit submitted job are rejected

How to chose the good partition

use zen16 partion if you are really using more than 3840M per CPU in other case use zen4 default partition

for exemple asking 120Gb of memory on zen4 give you 32 Core even if you have only one task (sequential program) the same on zen16 partition give you only 7 cpu

Submitting Interactive Jobs

default partition zen4

default time 1H

default cpu 1

Maxtime for interactive jobs is 10H and only one interactive job per user on Spirit[X].

It's a limit not a target ;-)

Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.

srun --pty [--x11] [--partition zen4|zen16] --time value] [--mem value] [--ntasks num_task] bash

Sample:

1 core during 1H on partition zen4 with 3840m Memory

srun --pty --x11 bash

JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4252 RUNNING      zen4       xxxxx  yyyyy    1    1  3840M    0       55:30     1:00:00             spirit64-01

1 core during 2H with 6gb of Memory

srun --pty --x11 --mem 6G --time 2:00:00 bash

JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4253 RUNNING      zen4       xxxx  yyyyy    1    2     6G    0     1:59:55     2:00:00             spirit64-01

You could see the system give me 2 cpu but I have asked only one. Memory per cpu is 3840m on zen4 partition, so the system is giving you cpu with not enough memory to be used by other jobs

If you need more than 3840m per cpu, better is to used zen16 (15872m per core) partition

srun --pty --x11 --partition zen16 --mem 6G --time 2:00:00 bash

JobId   State Partition    Username Account Node CPUs Memory Batch    TimeLeft   TimeLimit             Node/Reason
 4254 RUNNING     zen16        xxxx   yyyyy    1    1     6G    0     1:59:58     2:00:00             spirit32-01

in this case I have only 1 cpu corresponding to my ask

Submitting Batch jobs

default partition zen4

default time 1H

default cpu 1

sbatch [ --partition=zen4|zen16 ] [--time=value ] [ --mem=value|--mem-per-cpu=value ] [ --ntasks num_task ] [ --cpus-per-task ] script [ ARG ]

All sbatch options could be put also in the script as shell comment (#SBATCH --option argument )

batch samples

All this samples can be find also on clusters in /net/nfs/tools/meso-u20/batch-samples

Sequential job ( could only use one cpu)

if you need more than 3840m of real memory for your job, it is better to submit your job in zen16 partition

# Only one cpu 
#SBATCH --ntasks 1 
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# Partition (zen4 or zen16 depending of your memory requirement)
#SBATCH --partition zen4
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/10.2.0 netcdf-fortran
# execute my program
time my_program

Parallel job ( using multiple CPU)

In a job, many SLURM environment variables are usable to know the number of cpu depending of option you give to slurm

SLURM_CPUS_ON_NODE (always defined) SLURM_JOB_CPUS_PER_NODE (always defined) SLURM_CPUS_PER_TASK (only defined with --cpus-per-task=x option)

openmp

use sbatch option --ntasks=1 --cpus-per-task=x and define environment variable in your job OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK

#OPENMP use multiple cpu in only one node,number of cpu should be asked  
#only one task
#SBATCH --ntasks=1
# number of cpu per task
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/9.4.0 openblas/0.3.17-openmp
# with option --cpus-per-task=x number of cpu is given by
# slurm in environment variable SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time my_code

multithreads ( pthreads )

use sbatch option --ntasks=1 --cpus-per-task=x

controlling the number of thread is generally only possible in the code python multiprocessing is in this part think to adjust your number of threads with the asked number of core or at low level on HeadNode (max 4) using cpu_count is not a good option on cluster best way :

os.getenv('SLURM_CPUS_PER_TASK')

#MULTITHREAD_CODE or Python multiprocessing
#only one task
#SBATCH --ntasks=1
# number of cpu per task 
# get the value of environment variable in your job to set the number of process
# in python os.getenv('SLURM_CPUS_PER_TASK') instead of 'cpu_count'
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interference with current environnement
module purge
module load pangeo-meso/2025.01.24
time python my_program.py

mpi

use our openmpi module , it's compiled to support SLURM directly define the number of mpi task with --ntasks=x --ntasks-per-node=x do not use mpirun -np value but only mpirun ( number of proc is directly passed from slurm to mpirun )

#OPENMPI JOB
# --ntasks=x to give the number of mpi process
#SBATCH --ntasks=x
# asking time could be in minute 120 or 2:00:00  or 1-00:00:00(24H)
#SBATCH --time 120 
# to debug script could be interessant to have 
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/11.2.0 openmpi/4.0.7
#
time mpirun myprog

hybrid code mpi/openmp

use --ntasks or --ntasks-per-node for mpi instance and --cpus-per-task for openmp process

# Hybrid code mpi/openmp
# number of MPI instance
#SBATCH --ntasks-per-node=x
# number of openmp threads per mpi instance
#SBATCH --cpus-per-task=y
#SBATCH --partition=zen4
#SBATCH --time 120
module purge 
module load intel openmpi/4.0.7 intel-mkl/2020.4.304-openmp 
export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
time mpirun my_program

Job Array

This is a way to launch many similar job All job in a job array have the same properties ( number of cpu , memory requirement and time limits)

this could be a good way for data traitement

sample:

# Submit a job array with index values between 1 and 31
# good for day of month 
$ sbatch --array=1-31  --ntasks 1 --time 1:00:00 --mem 1g myscript

# Submit a job array with index values of 1990,1991 ... 2010 
# only 10 running at the same time
$ sbatch --array=1990-2010%10 -ntasks=1 tmp

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2   --ntasks 1  tmp

in your script you have acces to environment variable $SLURM_ARRAY_TASK_ID which is the index of your array.

to know more https://slurm.schedmd.com/job_array.html

Job Chaining

Job chaining is to submit again the same job in a job with sbatch

This could cause problem if new job start before a complete end of the precedent, you must know that there is a system epilog at the end of your job causing delay

to achieve this best way is :

sbatch must be the last command of your job
run the next job only if the current is finished ok

The next command ensure the current job is really finished before starting the next one, SLURM_JOBID variable contain the current jobid

sbatch --dependency=afterok:${SLURM_JOBID} myscript

Job monitoring

Monitoring job during run Possibility to connect (ssh) to the node only if you have running jobs on this node: top , htop MEMORY to use with SLURM in top or htop is in RES column (not in VIRT) MultiThreads program can be identified with CPU more than 100%

PID     USER      PR  NI    VIRT    RES    SHR  S  %CPU   %MEM    TIME     COMMAND                     
232285  xxxx      20   0  8093392   7.4g  18428 R  768.1   2.9   2140:47  myjob

NEVER use ssh to a node to do other things than monitoring job. All commands via ssh are shared with your job resources and admin can revoke your account in abusive case.

Job statistics

it's not anymore possible to have jobs statistic directly in output file (like on ciclad or climserv)

Statistic for

Running jobs : command sstat sstat -a -j <JobId\>
finished jobs : jobreports, seff <JobId\> or sacct -j JobId

good option to start ( all your finished job since 1 week)

jobreports -c -l

more option via jobreports --help

user@host:~> jobreports -c
  JobID      State         Elapsed  TimeEff   CPUEff   MemEff 

   62_1    COMPLETED      00:02:33   4.2%     20.6%    100.0% 
   62_2    COMPLETED      00:01:13   2.0%     46.6%    99.9%  
   62_3    COMPLETED      00:02:29   4.1%     21.5%    100.0% 
   62_4    COMPLETED      00:01:30   2.5%     40.0%    99.9%

user@host ~> jobreports -c -l

  JobID    State       Elapsed  TimeEff   CPUEff   MemEff   User    Partition   TotalCPU    Timelimit   MaxRSS  ReqMem Nodes Tasks CPUS         Start            JobName
   62_1  COMPLETED    00:02:33   4.2%     20.6%    100.0%   xxxxx     zen4      01:03.658   01:00:00    4.00G       4G     1     1    2  2024-12-04T09:58:04  batch-test
   62_2  COMPLETED    00:01:13   2.0%     46.6%    100.0%   xxxxx     zen4      01:08.551   01:00:00    4.00G       4G     1     1    2  2024-12-04T09:58:05  batch-test
   62_3  COMPLETED    00:02:29   4.1%     21.5%    100.0%   xxxxx     zen4      01:04.019   01:00:00    4.00G       4G     1     1    2  2024-12-04T09:58:05  batch-test
   62_4  COMPLETED    00:01:30   2.5%     40.0%    100.0%   xxxxx     zen4      01:12.994   01:00:00    4.00G       4G     1     1    2  2024-12-04T09:58:05  batch-test

~# seff  4928
===================================================
Job ID: 4928
Cluster: spirit
User/Account/Group: xxxxx/xyyyy/zzzzzz
Job Name: Test
Running partition: zen4 on spirit64-02
Started : 2022/09/30-17:11:41
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:56:15
CPU Efficiency: 68.71% of 01:21:52 core-walltime
Asked time         : 01:00:00
Job Wall-clock time: 00:05:07
Memory Utilized: 30.68 GB
Memory Efficiency: 49.09% of 62.50 GB