Migration from ciclad/climserv to spirit/spiritx
Account
Account is the same on all clusters (old and new)
Filesystems and shell initialisation
All filesystems are shared between old and new cluster, so you could have probleme when login to new cluster with some environments variables, modules loaded , conda initialisation ...
Shell is bash by default, init files are :
- $HOME/.bash_profile
- $HOME/.bashrc
it's a good idea to add the following in your shell init files and move some initialisations in the good block
Particular review is needed for PATH, LD_LIBRARY_PATH variables with directory from old cluster. They can produce unexpected erro using old binara files. env, ldd and type command are good to find those sorts of problems.
The same with patched conda environment (fix-my-python)
SSH
Supported algorythm SSH change :
- DSA key cannot be used anymore on new cluster: you must change your key.
- DSA public keys start by ssh-dss in $HOME/.ssh/authorized_keys file on cluster.
- RSA keys (4096 bits) are still accepted.
- Not always easy to use because RSA is deactivated by default in recent openssh(8.8+) client.
ssh -V
to know for your OS and see FAQ to know how to reactivate RSA if needed.
- ED25519 is the new preferred algorythm (shorter key, better for security).
if you could log in to ciclad or climserv and not to spirit or spiritx, this is probably an ssh-dss key problem
Module
- No scientific products are loaded by default (matlab, idl ...) if they are not native in the linux distribution.
Start by
module avail
to search your product then load the module to access to the software - All Products version change.
- PGI compiler new name is NVHPC (PGI is now NVIDIA Company).
- Usage for Compiler and Library change (similar GENCI Idris Jean Zay usage)
Modules collection saved on old cluster cannot work when restoring on new cluster and produce strange behaviour.
Python
Patched python ( with fix-my-python/0.2 module ) on ciclad/climserv could become a nightmare when using your python env on spirit/spiritx.
the best way is to rebuild all your python environments on spirit/spiritx and $HOME/.conda need to be removed after you made a backup (tar)
Matlab
Interactive Matlab session Cannot be run anymore directly on head nodes, must be launched via interactive jobs.
One hour session by default, add --time= option to a maximum of 6 hours
Job manager: from Torque/PBS to SLURM
Job Manager is now SLURM not anymore Torque/PBS.
Table 1 lists the common tasks that you can perform in Torque/PBS and the equivalent ways to perform those tasks in SLURM.
Task | Torque/PBS | SLURM |
---|---|---|
cluster state(local command*) | check-cluster | check-cluster |
Submit job | qsub MyJob.sh | sbatch MyJob.sh |
interactive job | qsub -IV | srun --pty bash |
interactive job with graphic | qsub -IVX | srun --pty --x11 bash |
Delete job | qdel 123 | scancel 123 |
Show job status | qstat | squeue,slqueue(local command*) |
Show job status | showq | squeue,slqueue(local command*) |
Show job details | qstat -f 123 | scontrol show job 123 |
Show expected job start time | showstart | squeue --start |
Show queue info | qstat -q | sinfo |
Show queue details | qstat -Q -f |
scontrol show partition <partition_name> |
Show node details | pbsnode [nodename] | scontrol show node [nodename] |
in shell script job directive | #PBS | #SBATCH |
* local command is command written by local admin team
Queue (TORQUE/PBS) vs partition (SLURM)
On Old cluster queue was just a manner to specify time limit for job (-l walltime=time
could be also used)
short : 2H std : 6H h12 : 12H day : 24H days3 or threedays: 72H week : 168H weeks2: 360H infini: 860H
On new cluster partition refer to different hardware nodes:
- zen4 : node with 4000MB of memory per core and 64 core (default partition)
- zen16: node with 16000MB of memory per core and 32 core
Time request | Torque/PBS | SLURM |
---|---|---|
2H | -q short | --time=2:00:00 |
6H | -q std | --time=6:00:00 |
12H | -q h12 | --time=12:00:00 |
1 day | -q days | --time=24:00:00 or --time=1-00:00:00 |
3 days | -q days3 -q threedays | --time=3-00:00:00 |
7 days | -q week | --time=7-00:00:00 |
-l walltime= | --time= | |
2 weeks | -q weeks2 | doesn't exist anymore |
1 month | -q infini | doesn't exist anymore |
- Default time if not specified is 1H in the 2 partitions, Max Time is 7 days (168H)
If you need more than one weeks to compute, you must have in your code a check-pointing solution.
- Check-pointing is possibility to restart your job from where it's stop before
Table 2 lists the commonly used options in the batch job script for both Torque/PBS (qsub) and SLURM (sbatch/srun/salloc).
Option | Torque/PBS | SLURM |
---|---|---|
in shell script job directive | #PBS | #SBATCH |
Declares a name for the job. | -N name | -J, --job-name=<jobname> |
Declares if the standard error stream of the job will be merged with the standard output stream of the job. | -j oe / -j eo | default behavior |
-k oe / -k eo | default behavior | |
Defines the path to be used for the standard output stream of the batch job | -o path | -o, --output=<filename pattern> |
Defines the path to be used for the standard error stream of the batch job | -e path | -e, --error=<filename pattern> |
Defines the working directory path to be used for the job. | -w path | -D, --workdir=<directory> |
Defines the set of conditions under which the execution server will send a mail message about the job. | -m mail_options (a, b, e) | --mail-type=<type> (type = NONE, BEGIN, END, FAIL, REQUEUE, ALL) default is NONE |
Declares the list of users to whom mail is sent by the execution server when it sends mail about the job. | -M user_list | --mail-user=<user> |
Specifies the number of processors per node requested. | -l nodes=number:ppn=number | --ntasks-per-node=<ntasks> / --tasks-per-node=<n> |
Specifies the real memory required per node in Megabytes or Gigabytes. | -l mem=mem | --mem=<M|G> |
Specifies the minimum memory required per allocated CPU in Megabytes or Gigabytes. | no equivalent | --mem-per-cpu=<M|G> |
-l vmem= | not used anymore | |
This job may be scheduled for execution at any point after jobs jobid have started execution. | -W depend=after:jobid[:jobid...] | -d, --dependency=after:job_id[:jobid...] |
This job may be scheduled for execution only after jobs jobid have terminated with no errors. | -W depend=afterok:jobid[:jobid...] | -d, --dependency=afterok:job_id[:jobid...] |
This job may be scheduled for execution only after jobs jobid have terminated with errors. | -W depend=afternotok:jobid[:jobid...] | -d, --dependency=afternotok:job_id[:jobid...] |
This job may be scheduled for execution after jobs jobid have terminated, with or without errors. | -W depend=afterany:jobid[:jobid...] | -d, --dependency=afterany:job_id[:jobid...] |
Expands the list of environment variables that are exported to the job. | -v variable_list | --export=<environment variables|ALL|NONE> |
Table 3 lists the commonly used environment variables in Torque/PBS and the equivalents in SLURM.
Environment Variable For | Torque/PBS | SLURM |
---|---|---|
Job ID | PBS_JOBID | SLURM_JOB_ID / SLURM_JOBID |
Job name | PBS_JOBNAME | SLURM_JOB_NAME |
Node list | PBS_NODELIST | SLURM_JOB_NODELIST / SLURM_NODELIST |
Job submit directory | PBS_O_WORKDIR | SLURM_SUBMIT_DIR |
Job array ID (index) | PBS_ARRAY_INDEX | SLURM_ARRAY_TASK_ID |
Number of tasks | - | SLURM_NTASKS |
Container for compatibility with ciclad/climserv clusters version
It's also possible to run old scripts and binary directly on the new cluster.
For MPI code it's not possible to use this container, the only way is to recompile your code on new cluster with the new compilers and library modules
we have made a singularity container under Scientific Linux 6 with all old modules inside.
to use it you just have to launch mesosl6
command
user@spirit1:~> mesosl6
next message is harmless
ERROR: ld.so: object '/usr/bin/tclsh' from LD_PRELOAD cannot be preloaded: ignored.
----------------------------------------------------------
Welcome on First Singularity Container Meso-SL6 compatible
5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022
Scientific Linux release 6.10 (Carbon) running on spirit1
WARNING: MPI Program cannot be run this way
----------------------------------------------------------
[SINGULARITY-MESOSL6]~ \>
to launch old script in batch on the new cluster, you should launch the container via SLURM(sbatch/srun) like this sample :
CRONTABS
crontab on spirit/spiritx doesn't load you shell initialisation files , if you have to load some modules in your cron scripts
start with