Slurm Using
Slurm Spirit Partitions
There is two partition
- zen4 : node with 64 core and 240GB of memory for jobs ( default partition)
- mem per cpu 3840m default time 1H Max time 168H
- zen16 : node with 32 core and 496GB of memory for jobs
- mem per cpu 15872m default time 1H Max time 72H
multinodes run are not allowed on those partitions
Slurm User limits
Those limits are for zen4 and zen16 partitions
- Internal Users: cpu=96,mem=262144 MAXJOBS=64 MaxSubmitJobs=1000
- External Users: cpu=48,mem=131076 MAXJOBS=32 MaxSubmitJobs=500
if one limit of cpu, mem or MAXJOBS is reached , your others jobs stay in PENDING state until you're back under the limit for all the limits
local command slqueue
give you the reason in this case :
- QOSMaxCpuPerUserLimit
- QOSMaxMemoryPerUser
if you try to submit more than MaxSubmitJobs value , over limit submitted job are rejected
Submitting Interactive Jobs
default partition zen4
default time 1H
default cpu 1
Maxtime for interactive jobs is 10H and only one interactive job per user on Spirit[X]
Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.
Sample:
1 core during 1H on partition zen4 with 3840m Memory
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4252 RUNNING zen4 xxxxx yyyyy 1 1 3840M 0 55:30 1:00:00 spirit64-01
1 core during 2H with 6gb of Memory
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4253 RUNNING zen4 xxxx yyyyy 1 2 6G 0 1:59:55 2:00:00 spirit64-01
You could see the system give me 2 cpu but I have asked only one. Memory per cpu is 3840m on zen4 partition, so the system is giving you cpu with not enough memory to be used by other jobs
If you need more than 3840m per cpu, better is to used zen16 (15872m per core) partition
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4254 RUNNING zen16 xxxx yyyyy 1 1 6G 0 1:59:58 2:00:00 spirit32-01
in this case I have only 1 cpu corresponding to my ask
Submitting Batch jobs
default partition zen4
default time 1H
default cpu 1
sbatch [ --partition=zen4|zen16 ] [--time=value ] [ --mem=value|--mem-per-cpu=value ] [ --ntasks num_task ] [ --cpus-per-task ] script [ ARG ]
All sbatch options could be put also in the script as shell comment (#SBATCH --option argument )
batch samples
All this samples can be find also on clusters in /net/nfs/tools/meso-u20/batch-samples
Serial job ( could only use one cpu)
if you need more than 3840m of real memory for your job, it is better to submit your job in zen16 partition
# Only one cpu
#SBATCH --ntasks 1
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# Partition (zen4 or zen16 depending of your memory requirement)
#SBATCH --partition zen4
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/10.2.0 netcdf-fortran
# execute my program
time my_program
Parallel job ( using multiple CPU)
In a job, many SLURM environment variables are usable to know the number of cpu depending of option you give to slurm
SLURM_CPUS_ON_NODE (always defined) SLURM_JOB_CPUS_PER_NODE (always defined) SLURM_CPUS_PER_TASK (only defined with --cpus-per-task=x option)
openmp
use sbatch option --ntasks=1 --cpus-per-task=x and define environment variable in your job OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
#OPENMP use multiple cpu in only one node,number of cpu should be asked
#only one task
#SBATCH --ntasks=1
# number of cpu per task
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/9.4.0 openblas/0.3.17-openmp
# with option --cpus-per-task=x number of cpu is given by
# slurm in environment variable SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time my_code
multithreads ( pthreads )
use sbatch option --ntasks=1 --cpus-per-task=x
controlling the number of thread is generally only possible in the code python multiprocessing is in this part think to adjust your number of threads with the asked number of core or at low level on HeadNode (max 4) using cpu_count is not a good option on cluster best way :
#MULTITHREAD_CODE or Python multiprocessing
#only one task
#SBATCH --ntasks=1
# number of cpu per task
# get the value of environment variable in your job to set the number of process
# in python os.getenv('SLURM_CPUS_PER_TASK') instead of 'cpu_count'
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interference with current environnement
module purge
module load pangeo-meso/2024.01.22
time python my_program.py
mpi
use our openmpi module , it's compiled to support SLURM directly define the number of mpi task with --ntasks=x --ntasks-per-node=x do not use mpirun -np value but only mpirun ( number of proc is directly passed from slurm to mpirun )
#OPENMPI JOB
# --ntasks=x to give the number of mpi process
#SBATCH --ntasks=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/11.2.0 openmpi/4.0.7
#
time mpirun myprog
hybrid code mpi/openmp
use --ntasks or --ntasks-per-node for mpi instance and --cpus-per-task for openmp process
# Hybrid code mpi/openmp
# number of MPI instance
#SBATCH --ntasks-per-node=x
# number of openmp threads per mpi instance
#SBATCH --cpus-per-task=y
#SBATCH --partition=zen4
#SBATCH --time 120
module purge
module load intel openmpi/4.0.7 intel-mkl/2020.4.304-openmp
export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
time mpirun my_program
Job Array
This is a way to launch many similar job All job in a job array have the same properties ( number of cpu , memory requirement and time limits)
this could be a good way for data traitement
sample:
# Submit a job array with index values between 1 and 31
# good for day of month
$ sbatch --array=1-31 --ntasks 1 --time 1:00:00 --mem 1g myscript
# Submit a job array with index values of 1990,1991 ... 2010
# only 10 running at the same time
$ sbatch --array=1990-2010%10 -ntasks=1 tmp
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2 --ntasks 1 tmp
$SLURM_ARRAY_TASK_ID
which is the index of your array.
to know more https://slurm.schedmd.com/job_array.html
Job Chaining
Job chaining is to submit again the same job in a job with sbatch
This could cause problem if new job start before a complete end of the precedent, you must know that there is a system epilog at the end of your job causing delay
to achieve this best way is :
- sbatch must be the last command of your job
- run the next job only if the current is finished ok
The next command ensure the current job is really finished before starting the next one, SLURM_JOBID variable contain the current jobid
Job statistics
it's not anymore possible to have jobs statistic directly in output file (like on ciclad or climserv)
Statistic for
-
Running jobs : command sstat
sstat -a -j <JobId\>
-
finished jobs :
seff <JobId\>
orsacct -j JobId
~# seff 4928
===================================================
Job ID: 4928
Cluster: spirit
User/Account/Group: xxxxx/xyyyy/zzzzzz
Job Name: Test
Running partition: zen4 on spirit64-02
Started : 2022/09/30-17:11:41
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:56:15
CPU Efficiency: 68.71% of 01:21:52 core-walltime
Asked time : 01:00:00
Job Wall-clock time: 00:05:07
Memory Utilized: 30.68 GB
Memory Efficiency: 49.09% of 62.50 GB
To be continued