Access
HAL head nodes
Connection to IPSL ESPRI Mesocenter is made by SSH on the clusters head nodes. Clusters head nodes are WorldWide Open via ssh but only with ED25519 or RSA (4096 bits) key types. IPSL's cluster head nodes do not use password for the connection (whereas a password is asked when deciphering your private key), except IPSL's JupyterHub.
Head nodes addresses
The fully qualified address of the head nodes of HAL:
- hal.ipsl.fr
The head nodes of HAL are just for access and file transferts, there is not any scientific software installed.
Read this page about the computing nodes of HAL.
Warning
The head nodes are set to run short and low memory administrative commands. You may read the interactive jobs page so as to run long interactive jobs with or without GPU support.
HAL access
The HAL cluster is composed of the machine hal.ipsl.fr and the computing nodes hal[1-6].obs.uvsq.fr. hal is the head node of the cluster, on which connections are possible from the internet with SSH key authentication. The hal[1-6] machines can only be accessed by bouncing on hal0 providing you have a running job on that machine.
Connection without JupyterLab / Jupyter Notebook
After requesting the creation of an account on the computing and data centre of the IPSL, and once it is accepted by the IPSL, the connection to the cluster is established with the SSH protocol using the SSH key declared to the centre (more details at this page). If you want to a graphical return (graphical windows deported on your computer), add the -X option to the ssh command (it means X forwarding). Feel free to create connection aliases using a config file (more details at this page).
Example from a terminal on your machine:
Connection with JupyterLab / Jupyter Notebook
It is about creating a SSH tunnel that allows sending the outgoing flow from the Jupyter Lab or Notebook to your machine. We propose two methods that create an interactive job (more information here) on the cluster and a SSH tunnel.
Warning
At the end of your work session, close all your terminals connected to the cluster (press CTRL and C several times) in order to free the allocated resources (because simply leaving the notebook does not stop your interactive job). If any sessions remain blocked, you can cancel all your jobs with this command executed on hal0: scancel -u <cluster_login>
.
Method 1 (semi-automatic)
Example from a terminal on your machine:
ssh <cluster_login>@hal.ipsl.fr
srun --time='1:00:00' --gpus=1 /net/nfs/tools/bin/jupytercluster.sh 'pytorch/2.1.2' # Or any other AI module containing JupyterLab!
Info
The jupytercluster.sh script displays a command line in the terminal to be executed in another terminal on your machine (not on headnode hal). Example: ssh -N -L 30920:hal4.obs.uvsq.fr:30920 delcambre@hal.ipsl.fr
. This command line produces no output, which is normal. Then in a web browser on your machine, copy/paste the URL that is displayed at the very end of the first terminal and that starts with http://127.0.0.1:.......
. Example : http://127.0.0.1:30920/?token=c3a46af69f1eb10fc0e3b8aa270050d7d3046a30508d9376
. Keep both terminals until the end of your Jupyter session.
By default, JupyterLab is started. If you prefer Jupyter Notebook, add the -n
option to jupytercluster.sh. Example:
ssh <cluster_login>@hal.ipsl.fr
srun --time='1:00:00' --gpus=1 /net/nfs/tools/bin/jupytercluster.sh -n 'pytorch/2.1.2' # Or any other AI module containing JupyterLab!
If you want to start Tensorboard, add the -t
option followed by the path to the directory that will contain the training log files. The script will give you an additional URL to copy into your web browser to display Tensorboard. Example:
ssh <cluster_login>@hal.ipsl.fr
srun --time='1:00:00' --gpus=1 /net/nfs/tools/bin/jupytercluster.sh -t "/${HOME}/my_logs_dir" 'pytorch/2.1.2' # Or any other AI module containing JupyterLab!
Info
Of course, the script options are cumulative. To see all the options, run /net/nfs/tools/bin/jupytercluster.sh -h
from hal0.
Info
Option --time='1:00:00'
and --gpus=1
are explained at this page. The interactive job in this example will last 1 hour and use 1 GPU. Modify the values of the options as you need.
Info
The limits and default values for the job specification are described at this page.
Method 2 (manual)
Understand the general instructions found at this page. The following instructions give you a practical example for HAL GPU cluster.
From a local terminal on your machine:
ssh <cluster_login>@hal.ipsl.fr
srun --time='1:00:00' --gpus=1 --pty bash
# Slurm makes you connect to one of the hal[1-6] machine.
# Note the number of the allocated HAL machine for later.
module load pytorch/2.1.2 # Or any other AI module containing JupyterLab!
jupyter lab --no-browser --ip=0.0.0.0 --port=<number between 10000 and 15000>
From another local terminal on your machine:
ssh -N -L <the chosen port number>:hal<allocated machine number>.latmos.ipsl.fr:<the chosen port number> <cluster_login>@hal.ipsl.fr
The allocated machine number can be fetched looking to the output of the squeue --me
command. Then copy and past the JupyterLab URL connection address that was displayed in the first terminal.
Info
Option --time='1:00:00'
and --gpus=1
are explained at this page. The interactive job in this example will last 1 hour and use 1 GPU. Modify the values of the options as you need.
Info
The limits and default values for the job specification are described at this page.