Skip to content

Migration from ciclad/climserv to spirit/spiritx

Account

Account is the same on all clusters (old and new)

Filesystems and shell initialisation

All filesystems are shared between old and new cluster, so you could have probleme when login to new cluster with some environments variables, modules loaded , conda initialisation ...

Shell is bash by default, init files are :

  • $HOME/.bash_profile
  • $HOME/.bashrc

it's a good idea to add the following in your shell init files and move some initialisations in the good block

Particular review is needed for PATH, LD_LIBRARY_PATH variables with directory from old cluster. They can produce unexpected erro using old binara files. env, ldd and type command are good to find those sorts of problems.

The same with patched conda environment (fix-my-python)

SSH

Supported algorythm SSH change :

  • DSA key cannot be used anymore on new cluster: you must change your key.
    • DSA public keys start by ssh-dss in $HOME/.ssh/authorized_keys file on cluster.
    • RSA keys (4096 bits) are still accepted.
      • Not always easy to use because RSA is deactivated by default in recent openssh(8.8+) client.
      • ssh -V to know for your OS and see FAQ to know how to reactivate RSA if needed.
  • ED25519 is the new preferred algorythm (shorter key, better for security).

if you could log in to ciclad or climserv and not to spirit or spiritx, this is probably an ssh-dss key problem

Module

  • No scientific products are loaded by default (matlab, idl ...) if they are not native in the linux distribution. Start by module avail to search your product then load the module to access to the software
  • All Products version change.
  • PGI compiler new name is NVHPC (PGI is now NVIDIA Company).
  • Usage for Compiler and Library change (similar GENCI Idris Jean Zay usage)

Modules collection saved on old cluster cannot work when restoring on new cluster and produce strange behaviour.

Python

Patched python ( with fix-my-python/0.2 module ) on ciclad/climserv could become a nightmare when using your python env on spirit/spiritx.

the best way is to rebuild all your python environments on spirit/spiritx and $HOME/.conda need to be removed after you made a backup (tar)

Matlab

Interactive Matlab session Cannot be run anymore directly on head nodes, must be launched via interactive jobs.

One hour session by default, add --time= option to a maximum of 6 hours

srun --pty --x11  -p zen16 matlab 

Job manager: from Torque/PBS to SLURM

Job Manager is now SLURM not anymore Torque/PBS.

Table 1 lists the common tasks that you can perform in Torque/PBS and the equivalent ways to perform those tasks in SLURM.

Task Torque/PBS SLURM
cluster state(local command*) check-cluster check-cluster
Submit job qsub MyJob.sh sbatch MyJob.sh
interactive job qsub -IV srun --pty bash
interactive job with graphic qsub -IVX srun --pty --x11 bash
Delete job qdel 123 scancel 123
Show job status qstat squeue,slqueue(local command*)
Show job status showq squeue,slqueue(local command*)
Show job details qstat -f 123 scontrol show job 123
Show expected job start time showstart squeue --start
Show queue info qstat -q sinfo
Show queue details qstat -Q -f scontrol show partition <partition_name>
Show node details pbsnode [nodename] scontrol show node [nodename]
in shell script job directive #PBS #SBATCH

* local command is command written by local admin team

Queue (TORQUE/PBS) vs partition (SLURM)

On Old cluster queue was just a manner to specify time limit for job (-l walltime=time could be also used)

short : 2H std : 6H h12 : 12H day : 24H days3 or threedays: 72H week : 168H weeks2: 360H infini: 860H

On new cluster partition refer to different hardware nodes:

  • zen4 : node with 4000MB of memory per core and 64 core (default partition)
  • zen16: node with 16000MB of memory per core and 32 core
Time request Torque/PBS SLURM
2H -q short --time=2:00:00
6H -q std --time=6:00:00
12H -q h12 --time=12:00:00
1 day -q days --time=24:00:00 or --time=1-00:00:00
3 days -q days3 -q threedays --time=3-00:00:00
7 days -q week --time=7-00:00:00
-l walltime= --time=
2 weeks -q weeks2 doesn't exist anymore
1 month -q infini doesn't exist anymore
  • Default time if not specified is 1H in the 2 partitions, Max Time is 7 days (168H)

If you need more than one weeks to compute, you must have in your code a check-pointing solution.

  • Check-pointing is possibility to restart your job from where it's stop before

Table 2 lists the commonly used options in the batch job script for both Torque/PBS (qsub) and SLURM (sbatch/srun/salloc).

Option Torque/PBS SLURM
in shell script job directive #PBS #SBATCH
Declares a name for the job. -N name -J, --job-name=<jobname>
Declares if the standard error stream of the job will be merged with the standard output stream of the job. -j oe / -j eo default behavior
-k oe / -k eo default behavior
Defines the path to be used for the standard output stream of the batch job -o path -o, --output=<filename pattern>
Defines the path to be used for the standard error stream of the batch job -e path -e, --error=<filename pattern>
Defines the working directory path to be used for the job. -w path -D, --workdir=<directory>
Defines the set of conditions under which the execution server will send a mail message about the job. -m mail_options (a, b, e) --mail-type=<type> (type = NONE, BEGIN, END, FAIL, REQUEUE, ALL) default is NONE
Declares the list of users to whom mail is sent by the execution server when it sends mail about the job. -M user_list --mail-user=<user>
Specifies the number of processors per node requested. -l nodes=number:ppn=number --ntasks-per-node=<ntasks> / --tasks-per-node=<n>
Specifies the real memory required per node in Megabytes or Gigabytes. -l mem=mem --mem=<M|G>
Specifies the minimum memory required per allocated CPU in Megabytes or Gigabytes. no equivalent --mem-per-cpu=<M|G>
-l vmem= not used anymore
This job may be scheduled for execution at any point after jobs jobid have started execution. -W depend=after:jobid[:jobid...] -d, --dependency=after:job_id[:jobid...]
This job may be scheduled for execution only after jobs jobid have terminated with no errors. -W depend=afterok:jobid[:jobid...] -d, --dependency=afterok:job_id[:jobid...]
This job may be scheduled for execution only after jobs jobid have terminated with errors. -W depend=afternotok:jobid[:jobid...] -d, --dependency=afternotok:job_id[:jobid...]
This job may be scheduled for execution after jobs jobid have terminated, with or without errors. -W depend=afterany:jobid[:jobid...] -d, --dependency=afterany:job_id[:jobid...]
Expands the list of environment variables that are exported to the job. -v variable_list --export=<environment variables|ALL|NONE>

Table 3 lists the commonly used environment variables in Torque/PBS and the equivalents in SLURM.

Environment Variable For Torque/PBS SLURM
Job ID PBS_JOBID SLURM_JOB_ID / SLURM_JOBID
Job name PBS_JOBNAME SLURM_JOB_NAME
Node list PBS_NODELIST SLURM_JOB_NODELIST / SLURM_NODELIST
Job submit directory PBS_O_WORKDIR SLURM_SUBMIT_DIR
Job array ID (index) PBS_ARRAY_INDEX SLURM_ARRAY_TASK_ID
Number of tasks - SLURM_NTASKS

Container for compatibility with ciclad/climserv clusters version

It's also possible to run old scripts and binary directly on the new cluster.

For MPI code it's not possible to use this container, the only way is to recompile your code on new cluster with the new compilers and library modules

we have made a singularity container under Scientific Linux 6 with all old modules inside.

to use it you just have to launch mesosl6 command

user@spirit1:~>  mesosl6
next message is harmless
ERROR: ld.so: object '/usr/bin/tclsh' from LD_PRELOAD cannot be preloaded: ignored.
----------------------------------------------------------
Welcome on First Singularity Container Meso-SL6 compatible
5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022
Scientific Linux release 6.10 (Carbon)  running on spirit1
WARNING: MPI Program cannot be run this way 
----------------------------------------------------------

[SINGULARITY-MESOSL6]~ \>
under this prompt, environment is the same than old cluster (without any working old PBS batch command inside like qsub).

to launch old script in batch on the new cluster, you should launch the container via SLURM(sbatch/srun) like this sample :

sbatch -p zen4  mesosl6 my_old_batch_script.sh

CRONTABS

crontab on spirit/spiritx doesn't load you shell initialisation files , if you have to load some modules in your cron scripts

start with

#!/bin/bash -l