Parallel hyperparameter optimization with Optuna
Optuna is a hyperparameter optimization (HPO) library that eases the search for optimal machine learning hyperparameter values for one or more monitored training metrics. This tutorial aims to give a simple example of parallel HPO for a classic machine/deep learning training. This example can be easily adapted to your own design as Optuna specific code is developed in its own Python module. Indeed, the training code and the Optuna code are separated. So the training code can be developed first and if the HPO comes into play, Optuna module can be added. Thus, this tutorial is divided into two parts : the training loop than the Optuna hyperparameter values generator.
Warning
This example is not multi-nodes GPU data distributed training able (not a limitation of Optuna). Although this is possible, it requires more complex code.
Find the complete train script at this address and the HPO driving script at this address.
Training loop
Let's start with the classical example from Pytorch Lightning: autoencoder training on MNIST dataset.
# Adapted from https://www.pytorchlightning.ai/
# All rights reserved.
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
# SETTINGS
NB_DATA_WORKERS = 2
def train() -> float:
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64),
nn.ReLU(),
nn.Linear(64, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28 * 28))
def forward(self, x):
embedding = self.encoder(x)
return embedding
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, train_batch, batch_idx):
x, y = train_batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log('train_loss', loss)
return loss
def validation_step(self, val_batch, batch_idx):
x, y = val_batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log_dict({'val_loss': loss, 'val_metric': loss})
# data
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])
train_loader = DataLoader(mnist_train, batch_size=32,
num_workers=NB_DATA_WORKERS, persistent_workers=True)
val_loader = DataLoader(mnist_val, batch_size=32,
num_workers=NB_DATA_WORKERS, persistent_workers=True)
# model
model = LitAutoEncoder()
# training
trainer = pl.Trainer(accelerator='auto',
max_epochs=10)
trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)
return float(trainer.callback_metrics['val_metric'])
Command line options
Instead of hard code the settings of the training and mess up with your version control system (e.g., git), we classically implement a command line option support mecanism (e.g., --batch-size=32).
Python name-main idiom
First, the script has to implement the if __name__ == "__main__"
idiom in order to parse the command line options, for example with the library argparse, than run the training:
import traceback
import argparse
# SETTINGS
SUCCESS_CODE = 0
FAILED_CODE = 1
# Wrapper for command line invocation (without Optuna HPO), support options (e.g., --help).
# Usage example: python train.py --epochs=10 --lr=0.00005
def main() -> None:
training_options = parse_args()
metric_value = train(training_options)
print(f'> metric value of the last epoch: {metric_value}')
if __name__ == '__main__':
main_exit_code = SUCCESS_CODE
try:
main()
except Exception as e:
print(f'> something went wrong: {str(e)}')
traceback.print_exception(type(e), e, e.__traceback__)
main_exit_code = FAILED_CODE
exit(main_exit_code)
Command line options parsing
The argparse library provides a simple, easy-to-configure solution. The object returned by parse_args
is a Python data class that holds the value of the options (e.g., training_options.batch-size
).
# Command line options parser.
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument('-b', '--batch-size', default=32, type=int, help='training batch size per GPU')
parser.add_argument('-e', '--epochs' , default=1, type=int, help='number of epochs')
parser.add_argument('-l', '--loss' , default='mse', type=str, help='loss func: mse | mae')
parser.add_argument('-o', '--optimizer' , default='adamw', type=str, help='optimizer: sgd | adam | adamw')
parser.add_argument('-r', '--lr' , default=1e-4, type=float, help='learning rate')
training_options = parser.parse_args()
return training_options
Loss function mapping
A classic technique for dynamically executing functions specified from command-line options is the function mapping. To dynamically execute the loss function, we propose to map the loss function names to the corresponding Pytorch functions:
Thus, in the model class, the loss function is executed simply by selecting the function and calling it: _LOSS_MAPPING[self.training_options.loss](x_hat, x)
.
Training optimizer factory
A classic technique for dynamically instantiating objects specified from command-line options is a design pattern named factory. To dynamically instantiate the optimizer object, we propose to implement the following function:
from torch.optim import Optimizer, SGD, Adam, AdamW
def create_optimizer(model: Module, optimizer_name: str, lr: float) -> Optimizer:
match optimizer_name:
case 'sgd':
optimizer = SGD(model.parameters(), lr=lr)
case 'adam':
optimizer = Adam(model.parameters(), lr=lr)
case 'adamw':
optimizer = AdamW(model.parameters(), lr=lr)
case _:
raise ValueError(f'unsupported optimizer: {optimizer_name}')
return optimizer
Enhanced training loop
Finnally, we pass the parsed options to the training loop:
def train(training_options: argparse.Namespace, trial: optuna.trial.Trial = None) -> float:
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
class LitAutoEncoder(pl.LightningModule):
def __init__(self, training_options: argparse.Namespace):
super().__init__()
self.training_options = training_options
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64),
nn.ReLU(),
nn.Linear(64, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28 * 28))
def forward(self, x):
embedding = self.encoder(x)
return embedding
def configure_optimizers(self):
optimizer = create_optimizer(self, self.training_options.optimizer, self.training_options.lr)
return optimizer
def training_step(self, train_batch, batch_idx):
x, y = train_batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = _LOSS_MAPPING[self.training_options.loss](x_hat, x)
self.log('train_loss', loss)
return loss
def validation_step(self, val_batch, batch_idx):
x, y = val_batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = _LOSS_MAPPING[self.training_options.loss](x_hat, x)
# Metric must be computed the same manner across the trials in order to compare them!
metric = _LOSS_MAPPING['mse'](x_hat, x) if self.training_options.loss != 'mse' else loss
self.log_dict({'val_loss': loss, 'val_metric': metric})
# data
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])
train_loader = DataLoader(mnist_train, batch_size=training_options.batch_size,
num_workers=NB_DATA_WORKERS, persistent_workers=True)
val_loader = DataLoader(mnist_val, batch_size=32,
num_workers=NB_DATA_WORKERS, persistent_workers=True)
# model
model = LitAutoEncoder(training_options)
# training
trainer = pl.Trainer(accelerator='auto',
max_epochs=training_options.epochs)
trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)
return float(trainer.callback_metrics['val_metric'])
Find the complete file at this address.
Optuna hyperparameter values generator
In general, HPO libraries offer two main functionalities:
- A mechanism, called sampler, for generating hyperparameter values from sets specified by the user. A trial is a set of hyperparameter values selected by the sampler, which are then used to train a machine learning model.
- An early stopping mechanism, called pruner, is used to stop training (and therefore a trial) that does not seem to go beyond the best metric value currently obtained.
The HPO implementation for the previous example is implemented by creating a new module - optuna_bootstrap.py - which will drive the HPO (sampling) and by slightly modifying the training loop to support Optuna's pruning functionality. In this example, the pruning algorithm used is Hyperband and the sampling algorithm is Tree-structured Parzen Estimator (TPE), which is part of the Bayesian optimization methods (BO). This pairing, also known as BOHB, is reputed to be robust (reference).
As HPO is particularly time-consuming, Optuna supports the parallelization: several processes can work on the same HPO study, collaborating with each other and share their results (concurrent access to the study log file is supported). In this tutorial, parallelization is obtained by executing the optuna_bootstrap.py script as many times as possible.
Optuna Bootstrap
Study creation
Before starting HPO, it is necessary to create the study and its log file. The create_study
function, which is embbeded in the optuna_bootstrap.py module, is called once and for all, first. Running a function in a Python module requires a few adjustments, for example: python -c "from optuna_bootstrap import create_study ; create_study()" 'optuna_journal.log' 'study_name'
import sys
import optuna
# Create the Optuna HPO study:
# Usage example: python -c "from optuna_bootstrap import create_study ; create_study()" 'optuna_journal.log' 'study_name'
def create_study() -> None:
journal_file_path = sys.argv[1]
study_name = sys.argv[2]
storage, sampler, pruner = create_optuna_conf(journal_file_path)
print(f"> creating optuna journal file '{journal_file_path}'")
optuna.study.create_study(study_name=study_name, storage=storage, load_if_exists=False, sampler=sampler,
direction='minimize', pruner=pruner)
print('> Done')
Settings
Once the study is created, HPO can begin. The settings of the HPO are few constant variables defined at the begining of the script that are used in create_optuna_conf
function:
# SETTINGS
RANDOM_SEED = 42
SAMPLER = optuna.samplers.TPESampler
PRUNER = optuna.pruners.HyperbandPruner
MIN_NUMBER_EPOCHS = 1
MAX_NUMBER_EPOCHS = 5
# Configure the optimizers.
def create_optuna_conf(journal_file_path: str) -> tuple:
storage = optuna.storages.JournalStorage(optuna.storages.JournalFileStorage(journal_file_path))
sampler = SAMPLER(seed=RANDOM_SEED, multivariate=True)
pruner = PRUNER(min_resource=MIN_NUMBER_EPOCHS, max_resource=MAX_NUMBER_EPOCHS)
return storage, sampler, pruner
Hyperparameters value spaces
The hyperparameters value spaces are defined in the get_training_options
function, thank to the suggest_categorical
, suggest_int
and suggest_float
functions. These functions also fetch the value of the hyperparameters during the HPO following the sampler algorithm.
# Define the space values for each hyperparameter to be optimized and fetch their value for each trial.
def get_training_options(trial: optuna.trial.Trial):
options = dict()
options['loss'] = trial.suggest_categorical(name='loss', choices=['mae', 'mse'])
options['optimizer'] = trial.suggest_categorical(name='optimizer', choices=['sgd', 'adam', 'adamw'])
options['batch_size'] = int(pow(2, trial.suggest_int(name='pow_batch_size', low=4, high=10, step=1)))
options['epochs'] = MAX_NUMBER_EPOCHS
options['lr'] = trial.suggest_float(name='lr', low=1e-6, high=1., log=True)
return options
HPO objective
The objective of HPO is to find the optimal values of a set of hyperparameters regarding a given training metric. The objective
function calls the training loop and gives it the training options that derives from the trial (note the translation into an argparse data class object) and returns the metric value computed at the end of the training. The function get_task_id
resolves the ambiguity between the execution traces (print calls) when parallelizing the HPO.
import os
import argparse
import train
# Return the task id.
def get_task_id() -> str:
if 'SLURM_PROCID' in os.environ:
return os.environ['SLURM_PROCID']
else:
return 'NO_ID'
# Define the HPO objective.
def objective(trial: optuna.trial.Trial) -> float:
training_options = get_training_options(trial)
print(f"> task #{get_task_id()} starting trial #{trial.number} with the following settings:\n{training_options}")
metric_value = train.train(argparse.Namespace(**training_options), trial)
return metric_value
HPO driver
The main
function drives the HPO. Firstly, it loads the previously created study, with the function load_study
, then it starts the optimization when calling the function optimize
. This function returns when the time is out, provided the timeout
parameter whose value is expressed in seconds. Each process of optimization is executed giving three command line options, like this: python optuna_bootstrap.py 'optuna_journal.log' 'study_name' 3600
.
import sys
import traceback
import optuna
# Run the optimization.
# Running this script several times at the same time, enables optimization to be parallelized.
# i.e., each script own it's unique set of hyperparameter values: trial.
# Usage example: python optuna_bootstrap.py 'optuna_journal.log' 'study_name' 3600
# Command line arguments:
# - optuna journal file path
# - study name
# - study duration (in seconds)
def main() -> int:
journal_file_path = sys.argv[1]
study_name = sys.argv[2]
total_optimization_time = int(sys.argv[3]) # Unit: seconds.
storage, sampler, pruner = create_optuna_conf(journal_file_path)
study = optuna.study.load_study(study_name=study_name, storage=storage, sampler=sampler, pruner=pruner)
try:
study.optimize(func=objective, timeout=total_optimization_time)
print(f"> task #{get_task_id()} ends")
exit_code = train.SUCCESS_CODE
except Exception as e:
print(f"> [ERROR] task #{get_task_id()}: {str(e)}")
traceback.print_exception(type(e), e, e.__traceback__)
exit_code = train.FAILED_CODE
return exit_code
if __name__ == '__main__':
main_exit_code = main()
exit(main_exit_code)
Initial trial
You probably have some idea of the optimal values of hyperparameter, especially while using train.py
and its command-line options. To take advantage of your experience and speed up HPO, Optuna lets you create a trial and test it first. The trial configuration consists of a Python dictionary containing the names of the hyperparameters given in the get_training_options
function and their values. This dictionary is then added to the study using the enqueue_trial
function.
__INITIAL_TRIAL = {
'loss': 'mse',
'optimizer': 'adam',
'pow_batch_size': 6, # batch_size: 64
'lr': 1.e-5
}
def create_study() -> None:
journal_file_path = sys.argv[1]
study_name = sys.argv[2]
storage, sampler, pruner = create_optuna_conf(journal_file_path)
print(f"> creating optuna journal file '{journal_file_path}'")
study = optuna.study.create_study(study_name=study_name, storage=storage, load_if_exists=False, sampler=sampler,
direction='minimize', pruner=pruner)
# Set an initial trial based on best hyperparameter values already found.
study.enqueue_trial(__INITIAL_TRIAL)
print('> Done')
Tips
Adding a hand written trials really accelerate the HPO thanks to the pruning algorithm.
Find the complete file at this address.
Slurm job submitters
Finnally, the Slurm script launches multiple processes that execute the optuna_bootstrap.py
script. The Slurm script depends on the underlying cluster. This tutorial gives you an example for Spirit[x], the IPSL's CPU clusters ; HAL, the IPSL's GPU cluster ; and Jean Zay, the IDRIS' GPU cluster.
Tips
Each Slurm script is executed with the same instruction which requires the same parameters expected by optuna_bootstrap.py
. For example: sbatch XXX.slurm 'optuna_journal.log' 'study_name' 3600
.
Note
Don't mind the AssertionError: can only test a child process
while prallelizing the HPO. This exception comes from the MNIST data generator.
Spirit[x]
This Slurm script submits a single job on the partition zen16 (CPU) and runs two optuna_bootstrap.py processes (--ntasks-per-node=2
), each process works on one CPU (--cpus-per-task=1
).
#!/bin/bash
#SBATCH --job-name=optuna
#SBATCH --partition=zen16
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=1:00:00
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out
module purge
module load 'pytorch/2.1.2' # Or any other AI module.
# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1
# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
srun python optuna_bootstrap.py ${@}
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}
Note
Running multiple processes is enable via the srun
command in front of the python
command. Without the srun
command, only one process is executed!
HAL
This Slurm script submits a single job on the partition batch (GPU) and runs one optuna_bootstrap.py process on one GPU (`--gpus=1). Run another time the Slurm script in order to execute another optuna_bootstrap.py process and achieve the parallelization of the HPO.
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=optuna
#SBATCH --gpus=1
#SBATCH --time=1:00:00
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out
module purge
module load 'pytorch/2.1.2' # Or any other AI module.
# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1
# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
python optuna_bootstrap.py ${@}
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}
Jean Zay
This Slurm script submits a single job on the partition gpu_p1 (GPU ; -C v100-32g
) and runs, on four nodes (--nodes=4
), four optuna_bootstrap.py processes (--ntasks-per-node=4
), each process works on one GPU (--gpus-per-task=1
). Thus, 4 nodes of 4 V100 32 Go are entirely reserved, running an overall of 16 processes.
Tips
If Optuna is not shipped with an AI module, you still can extend it, see this procedure.
Tips
As Jean Zay computing nodes don't have access to Internet, you must run train.py on a head node once and for all, so as to download the MNIST data.
Warning
--cpus-per-task
is set according to the partition gpu_p1 hardware specifications. The number of CPU cores differs for the other partitions!
Warning
You must set the value of the highlighted line according to your project:
#!/bin/bash
#SBATCH --account=my_project
#SBATCH -C v100-32g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=10
#SBATCH --hint=nomultithread
#SBATCH --time=01:00:00
#SBATCH --job-name=opti
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out
module purge
module load pytorch-gpu/py3/2.2.0
# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1
# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
srun python optuna_bootstrap.py ${@}
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}
Note
Running multiple processes is enable via the srun
command in front of the python
command. Without the srun
command, only one process is executed!
Tips
The function get_task_id
should be replaced by the idr_torch
package using its rank variable.