HAL job manager

Slurm

Once connected to the hal0-ng.obsq.uvsq.fr, the head node, you can submit interactive or batch jobs to access the computational resources. General information about Slurm are given here.

Warning

If you are using development IDEs like Spyder, it is necessary to close the application in order to really free the memory of the GPU card.

Info

During the development phase of your code, it is not necessary to use your entire dataset or to use a GPU card. So you can omit the GPU card reservation in your job submissions (remove the option --gpus), which avoids unnecessary immobilization of precious resources that others would like to use. You also have the possibility to develop your code on the other CPU clusters of the computing centre of the IPSL as the AI modules are also available on these clusters. Finally, you can sub-sample your data beforehand.

GPU Monitoring

List of GPU monitoring commands. Executable only on hal[1-6] machines (not on hal0):

nvtop: the GPU version of the top and htop commands. A well made ASCII art command so as to monitor GPU cards in real time (computation activity, memory, data transfer, etc.).

module load nvtop
nvtop

nvidia-smi: text mode command for monitoring GPU cards, that can be part of an automated process.

watch nvidia-smi # Monitoring every two seconds.

Tips

Monitoring the activity of the GPU cards will help you a lot to optimize your computations.