Batch job

The batch job allows you to submit a processing in a detached way for long term treatments: without interaction and without maintaining a connection between your machine and the cluster (the jobs run even if you are disconnected). Several batch jobs can be submitted to run at the same time. However, jobs that cannot fit in available resources are queued to be run later.

Tips

Even if your job is running in batch processing, you can establish a SSH tunneling so as to monitor your treatments (e.g., tensorboard).

To submit a job in batch mode there are two methods, depending on the command used for submission: sbatch and srun.

Sbatch method

The first method is to create an Bash script containing configuration options starting with #SBATCH, placed at the beginning of the file.

Example of a Bash script named job.slurm located in the directory ${HOME}/mydir. The purpose of this script is to execute the Python script ${HOME}/mydir/mycode.py.

To submit the job to Slurm, from hal0:

sbatch ${HOME}/mydir/job.slurm

The script job.slurm is designed to pass any arguments given after the script path to the Python script:

sbatch ${HOME}/mydir/job.slurm <arg1> <arg2>

Contents of the job.slurm file:

#!/bin/bash

# Instructions SBATCH always at the beginning of the script!

# Change the working directory before the execution of the job.
# Warning: the environment variables, e.g. $HOME,
# are not interpreted for the SBATCH instructions.
# Writing absolute paths is recommended.
#SBATCH --chdir=/home/me/mydir

# The job partition (maximum elapsed time of the job).
#SBATCH --partition=batch

# The name of the job.
#SBATCH --job-name=myjobname

# The number of GPU cards requested.
# If the GPU architecture is not specified,
# Slurm chooses first Ada then Ampere then Turing.
#SBATCH --gpus=1

# The maximum elapsed time of computation, expressed in hour.
# Default is 1 hour.
#SBATCH --time=1:00:00

# Email notifications (e.g., the beginning and the end of the job).
#SBATCH --mail-user=me@myprovider.com
#SBATCH --mail-type=all

# The path of the job log files.
# The error and the output logs can be merged into the same file.
# %j is the job id.
#SBATCH --error=job-%j.log
#SBATCH --output=job-%j.log

# Overtakes the system memory limits.
ulimit -s unlimited

# Load the user configuration file.
source /etc/profile

# Unload all modules previously loaded.
module purge

# Changes to the given working directory.
# It supersedes the #SBATCH --chdir instruction,
# and environment variables can be used (e.g., ${HOME}).
#cd ${HOME}/mydir

# Load the required module.
module load 'pytorch/2.6.0' # Or any other AI module.

################ OPTIONAL ENVIRONMENT ##################

##### PYTHON VIRTUAL ENVIRONMENT ACTIVATION
## the extended Python module should be loaded before.
# source "path/to/myenv/bin/activate" 

########################################################

# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1

####################### DEBUG #####################
# Optional instructions.
cat << EOF
===== my job information ====
> module list:
`module list 2>&1`
> node list: ${SLURM_NODELIST}
> my job id: ${SLURM_JOB_ID}
> job name: ${SLURM_JOB_NAME}
> partition: ${SLURM_JOB_PARTITION}
> submit host: ${SLURM_SUBMIT_HOST}
> submit directory: ${SLURM_SUBMIT_DIR}
> current directory: `pwd`
> executed as user: `whoami`
> executed as slurm user: ${SLURM_JOB_USER}
> user PATH: ${PATH}
EOF
####################################################

# Run your process. The optional command line arguments of the job.slurm script
# are passed to your Python scripts ($@).
python mycode.py $@ 

returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}

Info

The limits and default values for the job specification are described at this page.

Warning

The default maximum elapsed time of computation is 1h (when not specifying #SBATCH --time). It can be specified up to 72h for an batch job. Slurm kills the jobs that exceed that limit. Read this page for more information.

Info

The instructions of the "debug" block are optional, but can be useful to understand why a job does not run.

Info

If your Python script requires the activation of a Python virtual environment, un-comment the instructions in the "OPTIONAL ENVIRONMENT" block and add the name of your environment. Read this page for the use case.

Tips

The command sbatch has options that can superseed the #SBATCH instructions. For example, sbatch --time='1:00:00' superseeds the instructions #SBATCH --time=1:00:00. Read more information about sbatch at this page.

Tips

Slurm, the HAL cluster's job manager, offers you an option to choose the GPU architecture where your code will be executed. The #SBATCH --gpus=<ada, ampere or turing>:<1 or 2> instruction in a bootstrap batch script for the sbatch command. e.g. #SBATCH --gpus=turing:1 so as to allocate one Nvidia® GeForce® RTX 2080 Ti GPU cards. Run squeue and sinfo so as to get the availability of the cluster nodes. Note that if the GPU architecture is not specified, Slurm chooses first Ada then Ampere then Turing.

Info

The RAM and CPU resources of a node are shared between jobs executed on the node.

Srun method

The second method uses both scripts (Bash and Python). However, the #SBATCH options are given as arguments to the srun command. The second method is preferable when these options are generated by a meta-script. The execution permission for the slurm script must be set (chmod +x job.slurm).

List of additional srun options for batch job submission. The options seen for interactive job submission make sense except for --pty bash which specifies an interactive job:

--output # Specifies the file path where the standard output will be logged.
--error # Specifies the path of the file where the standard error will be logged.

Example of submitting a batch job using srun, from hal0:

srun --gpus=1 --mail-user='me@myprovider.com' --output='job.log' --error='job.log' "${HOME}/mydir/job.slurm" <arg1> <arg2> &

Info

Note the & symbol at the end of the line which effectively submits the job to Slurm.

Info

The execution permission for the slurm script must be set (chmod +x job.slurm).