Slurm Using
Slurm Spirit Partitions
There is two partition
- zen4 : node with 64 core and 240GB of memory for jobs ( default partition)
- mem per cpu 3840m default time 1H Max time 168H
- zen16 : node with 32 core and 496GB of memory for jobs
- mem per cpu 15872m default time 1H Max time 72H
multinodes run are not allowed on those partitions
Slurm User limits
Those limits are for zen4 and zen16 partitions
- Internal Users: cpu=96,mem=262144 MAXJOBS=64 MaxSubmitJobs=1000
- External Users: cpu=48,mem=131076 MAXJOBS=32 MaxSubmitJobs=500
if one limit of cpu, mem or MAXJOBS is reached , your others jobs stay in PENDING state until you're back under the limit for all the limits
local command slqueue
give you the reason in this case :
- Resources : there is not enough resources for your job but your in first priority
- Priority : there in not enough resources for your job and you're not in first priority
- Dependency : You have asked a dependency with a not finished running job
- QOSMaxCpuPerUserLimit : your running jobs hit the max number of CPU permitted for your account
- QOSMaxMemoryPerUser : your running jobs hit the max sum of memory permitted for your account
- QOSMaxJobPerUser : your running jobs hit the max number of job permitted for your account
- ReqNodeNotAvail, Reserv : there is a reservation on cluster made by admin team for special use or maintenance and with your TimeLimit request, your job cannot be finished before the beginning of the reservation. For example you have requested 7 days and we have a maintenance in 3 days with full stop
if you try to submit more than MaxSubmitJobs value , over limit submitted job are rejected
How to chose the good partition
use zen16 partion if you are really using more than 3840M per CPU in other case use zen4 default partition
for exemple asking 120Gb of memory on zen4 give you 32 Core even if you have only one task (sequential program) the same on zen16 partition give you only 7 cpu
Submitting Interactive Jobs
default partition zen4
default time 1H
default cpu 1
Maxtime for interactive jobs is 10H and only one interactive job per user on Spirit[X].
It's a limit not a target ;-)
Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.
Sample:
1 core during 1H on partition zen4 with 3840m Memory
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4252 RUNNING zen4 xxxxx yyyyy 1 1 3840M 0 55:30 1:00:00 spirit64-01
1 core during 2H with 6gb of Memory
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4253 RUNNING zen4 xxxx yyyyy 1 2 6G 0 1:59:55 2:00:00 spirit64-01
You could see the system give me 2 cpu but I have asked only one. Memory per cpu is 3840m on zen4 partition, so the system is giving you cpu with not enough memory to be used by other jobs
If you need more than 3840m per cpu, better is to used zen16 (15872m per core) partition
JobId State Partition Username Account Node CPUs Memory Batch TimeLeft TimeLimit Node/Reason
4254 RUNNING zen16 xxxx yyyyy 1 1 6G 0 1:59:58 2:00:00 spirit32-01
in this case I have only 1 cpu corresponding to my ask
Submitting Batch jobs
default partition zen4
default time 1H
default cpu 1
sbatch [ --partition=zen4|zen16 ] [--time=value ] [ --mem=value|--mem-per-cpu=value ] [ --ntasks num_task ] [ --cpus-per-task ] script [ ARG ]
All sbatch options could be put also in the script as shell comment (#SBATCH --option argument )
batch samples
All this samples can be find also on clusters in /net/nfs/tools/meso-u20/batch-samples
Sequential job ( could only use one cpu)
if you need more than 3840m of real memory for your job, it is better to submit your job in zen16 partition
# Only one cpu
#SBATCH --ntasks 1
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# Partition (zen4 or zen16 depending of your memory requirement)
#SBATCH --partition zen4
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/10.2.0 netcdf-fortran
# execute my program
time my_program
Parallel job ( using multiple CPU)
In a job, many SLURM environment variables are usable to know the number of cpu depending of option you give to slurm
SLURM_CPUS_ON_NODE (always defined) SLURM_JOB_CPUS_PER_NODE (always defined) SLURM_CPUS_PER_TASK (only defined with --cpus-per-task=x option)
openmp
use sbatch option --ntasks=1 --cpus-per-task=x and define environment variable in your job OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
#OPENMP use multiple cpu in only one node,number of cpu should be asked
#only one task
#SBATCH --ntasks=1
# number of cpu per task
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/9.4.0 openblas/0.3.17-openmp
# with option --cpus-per-task=x number of cpu is given by
# slurm in environment variable SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time my_code
multithreads ( pthreads )
use sbatch option --ntasks=1 --cpus-per-task=x
controlling the number of thread is generally only possible in the code python multiprocessing is in this part think to adjust your number of threads with the asked number of core or at low level on HeadNode (max 4) using cpu_count is not a good option on cluster best way :
#MULTITHREAD_CODE or Python multiprocessing
#only one task
#SBATCH --ntasks=1
# number of cpu per task
# get the value of environment variable in your job to set the number of process
# in python os.getenv('SLURM_CPUS_PER_TASK') instead of 'cpu_count'
#SBATCH --cpus-per-task=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interference with current environnement
module purge
module load pangeo-meso/2024.01.22
time python my_program.py
mpi
use our openmpi module , it's compiled to support SLURM directly define the number of mpi task with --ntasks=x --ntasks-per-node=x do not use mpirun -np value but only mpirun ( number of proc is directly passed from slurm to mpirun )
#OPENMPI JOB
# --ntasks=x to give the number of mpi process
#SBATCH --ntasks=x
# asking time could be in minute 120 or 2:00:00 or 1-00:00:00(24H)
#SBATCH --time 120
# to debug script could be interessant to have
set -x
# purging all module to be sure to not having interferaence with current environnement
module purge
# loading only needed module for sample
module load gcc/11.2.0 openmpi/4.0.7
#
time mpirun myprog
hybrid code mpi/openmp
use --ntasks or --ntasks-per-node for mpi instance and --cpus-per-task for openmp process
# Hybrid code mpi/openmp
# number of MPI instance
#SBATCH --ntasks-per-node=x
# number of openmp threads per mpi instance
#SBATCH --cpus-per-task=y
#SBATCH --partition=zen4
#SBATCH --time 120
module purge
module load intel openmpi/4.0.7 intel-mkl/2020.4.304-openmp
export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK
time mpirun my_program
Job Array
This is a way to launch many similar job All job in a job array have the same properties ( number of cpu , memory requirement and time limits)
this could be a good way for data traitement
sample:
# Submit a job array with index values between 1 and 31
# good for day of month
$ sbatch --array=1-31 --ntasks 1 --time 1:00:00 --mem 1g myscript
# Submit a job array with index values of 1990,1991 ... 2010
# only 10 running at the same time
$ sbatch --array=1990-2010%10 -ntasks=1 tmp
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2 --ntasks 1 tmp
$SLURM_ARRAY_TASK_ID
which is the index of your array.
to know more https://slurm.schedmd.com/job_array.html
Job Chaining
Job chaining is to submit again the same job in a job with sbatch
This could cause problem if new job start before a complete end of the precedent, you must know that there is a system epilog at the end of your job causing delay
to achieve this best way is :
- sbatch must be the last command of your job
- run the next job only if the current is finished ok
The next command ensure the current job is really finished before starting the next one, SLURM_JOBID variable contain the current jobid
Job monitoring
Monitoring job during run
Possibility to connect (ssh) to the node only if you have running jobs on this node: top
, htop
MEMORY to use with SLURM in top or htop is in RES column (not in VIRT)
MultiThreads program can be identified with CPU more than 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
232285 xxxx 20 0 8093392 7.4g 18428 R 768.1 2.9 2140:47 myjob
NEVER use ssh to a node to do other things than monitoring job. All commands via ssh are shared with your job resources and admin can revoke your account in abusive case.
Job statistics
it's not anymore possible to have jobs statistic directly in output file (like on ciclad or climserv)
Statistic for
-
Running jobs : command sstat
sstat -a -j <JobId\>
-
finished jobs :
jobreports
,seff <JobId\>
orsacct -j JobId
good option to start ( all your finished job since 1 week)
more option viajobreports --help
user@host:~> jobreports -c
JobID State Elapsed TimeEff CPUEff MemEff
62_1 COMPLETED 00:02:33 4.2% 20.6% 100.0%
62_2 COMPLETED 00:01:13 2.0% 46.6% 99.9%
62_3 COMPLETED 00:02:29 4.1% 21.5% 100.0%
62_4 COMPLETED 00:01:30 2.5% 40.0% 99.9%
user@host ~> jobreports -c -l
JobID State Elapsed TimeEff CPUEff MemEff User Partition TotalCPU Timelimit MaxRSS ReqMem Nodes Tasks CPUS Start JobName
62_1 COMPLETED 00:02:33 4.2% 20.6% 100.0% xxxxx zen4 01:03.658 01:00:00 4.00G 4G 1 1 2 2024-12-04T09:58:04 batch-test
62_2 COMPLETED 00:01:13 2.0% 46.6% 100.0% xxxxx zen4 01:08.551 01:00:00 4.00G 4G 1 1 2 2024-12-04T09:58:05 batch-test
62_3 COMPLETED 00:02:29 4.1% 21.5% 100.0% xxxxx zen4 01:04.019 01:00:00 4.00G 4G 1 1 2 2024-12-04T09:58:05 batch-test
62_4 COMPLETED 00:01:30 2.5% 40.0% 100.0% xxxxx zen4 01:12.994 01:00:00 4.00G 4G 1 1 2 2024-12-04T09:58:05 batch-test
~# seff 4928
===================================================
Job ID: 4928
Cluster: spirit
User/Account/Group: xxxxx/xyyyy/zzzzzz
Job Name: Test
Running partition: zen4 on spirit64-02
Started : 2022/09/30-17:11:41
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:56:15
CPU Efficiency: 68.71% of 01:21:52 core-walltime
Asked time : 01:00:00
Job Wall-clock time: 00:05:07
Memory Utilized: 30.68 GB
Memory Efficiency: 49.09% of 62.50 GB