Batch job
The batch job allows you to submit a processing in a detached way for long term treatments: without interaction and without maintaining a connection between your machine and the cluster (the jobs run even if you are disconnected). Several batch jobs can be submitted to run at the same time. However, jobs that cannot fit in available resources are queued to be run later.
Tips
Even if your job is running in batch processing, you can establish a SSH tunneling so as to monitor your treatments (e.g., tensorboard).
To submit a job in batch mode there are two methods, depending on the command used for submission: sbatch
and srun
.
Sbatch method
The first method is to create an Bash script containing configuration options starting with #SBATCH
, placed at the beginning of the file.
Example of a Bash script named job.slurm
located in the directory ${HOME}/mydir
. The purpose of this script is to execute the Python script ${HOME}/mydir/mycode.py
.
To submit the job to Slurm, from hal0:
The script job.slurm
is designed to pass any arguments given after the script path to the Python script:
Contents of the job.slurm
file:
#!/bin/bash
# Instructions SBATCH always at the beginning of the script!
# Change the working directory before the execution of the job.
# Warning: the environment variables, e.g. $HOME,
# are not interpreted for the SBATCH instructions.
# Writing absolute paths is recommended.
#SBATCH --chdir=/home/me/mydir
# The job partition (maximum elapsed time of the job).
#SBATCH --partition=batch
# The name of the job.
#SBATCH --job-name=myjobname
# The number of GPU cards requested.
# If the GPU architecture is not specified,
# Slurm chooses first Ampere then Turing.
#SBATCH --gpus=1
# The maximum elapsed time of computation, expressed in hour.
# Default is 1 hour.
#SBATCH --time=1:00:00
# Email notifications (e.g., the beginning and the end of the job).
#SBATCH --mail-user=me@myprovider.com
#SBATCH --mail-type=all
# The path of the job log files.
# The error and the output logs can be merged into the same file.
# %j implements a job counter.
#SBATCH --error=slurm-%j.err
#SBATCH --output=slurm-%j.out
# Overtakes the system memory limits.
ulimit -s unlimited
# Load the user configuration file.
source /etc/profile
# Unload all modules previously loaded.
module purge
# Changes to the given working directory.
# It supersedes the #SBATCH --chdir instruction,
# and environment variables can be used (e.g., ${HOME}).
#cd ${HOME}/mydir
# Load the required module.
module load 'pytorch/2.1.2' # Or any other AI module.
################ OPTIONAL ENVIRONMENT ##################
##### PYTHON VIRTUAL ENVIRONMENT ACTIVATION
## the extended Python module should be loaded before.
# source "path/to/myenv/bin/activate"
########################################################
# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1
####################### DEBUG #####################
# Optional instructions.
cat << EOF
===== my job information ====
> module list:
`module list 2>&1`
> node list: ${SLURM_NODELIST}
> my job id: ${SLURM_JOB_ID}
> job name: ${SLURM_JOB_NAME}
> partition: ${SLURM_JOB_PARTITION}
> submit host: ${SLURM_SUBMIT_HOST}
> submit directory: ${SLURM_SUBMIT_DIR}
> current directory: `pwd`
> executed as user: `whoami`
> executed as slurm user: ${SLURM_JOB_USER}
> user PATH: ${PATH}
EOF
####################################################
# Run your process. The optional command line arguments of the job.slurm script
# are passed to your Python scripts ($@).
python mycode.py $@
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}
Info
The limits and default values for the job specification are described at this page.
Warning
The default maximum elapsed time of computation is 1h (when not specifying #SBATCH --time). It can be specified up to 72h for an batch job. Slurm kills the jobs that exceed that limit. Read this page for more information.
Info
The instructions of the "debug" block are optional, but can be useful to understand why a job does not run.
Info
If your Python script requires the activation of a Python virtual environment, un-comment the instructions in the "OPTIONAL ENVIRONMENT" block and add the name of your environment. Read this page for the use case.
Tips
The command sbatch
has options that can superseed the #SBATCH
instructions. For example, sbatch --time='1:00:00'
superseeds the instructions #SBATCH --time=1:00:00
. Read more information about sbatch
at this page.
Tips
Slurm, the HAL cluster's job manager, offers you an option to choose the GPU architecture where your code will be executed. The #SBATCH --gpus=<ampere or turing>:<1 or 2>
instruction in a bootstrap batch script for the sbatch
command. e.g. #SBATCH --gpus=turing:1
so as to allocate one Nvidia® GeForce® RTX 2080 Ti GPU cards. Run squeue
and sinfo
so as to get the availability of the cluster nodes. Note that if the GPU architecture is not specified, Slurm chooses randomly between Turing and Ampere.
Info
The RAM and CPU resources of a node are shared between jobs executed on the node.
Srun method
The second method uses both scripts (Bash and Python). However, the #SBATCH
options are given as arguments to the srun
command. The second method is preferable when these options are generated by a meta-script. The execution permission for the slurm script must be set (chmod +x job.slurm
).
List of additional srun options for batch job submission. The options seen for interactive job submission make sense except for --pty bash
which specifies an interactive job:
- --output # Specifies the file path where the standard output will be logged.
- --error # Specifies the path of the file where the standard error will be logged.
Example of submitting a batch job using srun, from hal0:
srun --gpus=1 --mail-user='me@myprovider.com' --output='job.log' --error='job.log' "${HOME}/mydir/job.slurm" <arg1> <arg2> &
Info
Note the &
symbol at the end of the line which effectively submits the job to Slurm.
Info
The execution permission for the slurm script must be set (chmod +x job.slurm
).