Parallel hyperparameter optimization with Optuna

Optuna is a hyperparameter optimization (HPO) library that eases the search for optimal machine learning hyperparameter values for one or more monitored training metrics. This tutorial aims to give a simple example of parallel HPO for a classic machine/deep learning training. This example can be easily adapted to your own design as Optuna specific code is developed in its own Python module. Indeed, the training code and the Optuna code are separated. So the training code can be developed first and if the HPO comes into play, Optuna module can be added. Thus, this tutorial is divided into two parts : the training loop than the Optuna hyperparameter values generator.

Warning

This example is not multi-nodes GPU data distributed training able (not a limitation of Optuna). Although this is possible, it requires more complex code.

Find the complete train script at this address and the HPO driving script at this address.

Training loop

Let's start with the classical example from Pytorch Lightning: autoencoder training on MNIST dataset.

# Adapted from https://www.pytorchlightning.ai/
# All rights reserved.
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl


# SETTINGS
NB_DATA_WORKERS = 2


def train() -> float:
    class LitAutoEncoder(pl.LightningModule):
        def __init__(self):
            super().__init__()
            self.encoder = nn.Sequential(
                nn.Linear(28 * 28, 64),
                nn.ReLU(),
                nn.Linear(64, 3))
            self.decoder = nn.Sequential(
                nn.Linear(3, 64),
                nn.ReLU(),
                nn.Linear(64, 28 * 28))

        def forward(self, x):
            embedding = self.encoder(x)
            return embedding

        def configure_optimizers(self):
            optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
            return optimizer

        def training_step(self, train_batch, batch_idx):
            x, y = train_batch
            x = x.view(x.size(0), -1)
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = F.mse_loss(x_hat, x)
            self.log('train_loss', loss)
            return loss

        def validation_step(self, val_batch, batch_idx):
            x, y = val_batch
            x = x.view(x.size(0), -1)
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = F.mse_loss(x_hat, x)
            self.log_dict({'val_loss': loss, 'val_metric': loss})

    # data
    dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
    mnist_train, mnist_val = random_split(dataset, [55000, 5000])
    train_loader = DataLoader(mnist_train, batch_size=32,
                              num_workers=NB_DATA_WORKERS, persistent_workers=True)
    val_loader = DataLoader(mnist_val, batch_size=32,
                            num_workers=NB_DATA_WORKERS, persistent_workers=True)

    # model
    model = LitAutoEncoder()

    # training
    trainer = pl.Trainer(accelerator='auto',
                         max_epochs=10)

    trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)
    return float(trainer.callback_metrics['val_metric'])

Command line options

Instead of hard code the settings of the training and mess up with your version control system (e.g., git), we classically implement a command line option support mecanism (e.g., --batch-size=32).

Python name-main idiom

First, the script has to implement the if __name__ == "__main__" idiom in order to parse the command line options, for example with the library argparse, than run the training:

import traceback
import argparse

# SETTINGS
SUCCESS_CODE = 0
FAILED_CODE = 1


# Wrapper for command line invocation (without Optuna HPO), support options (e.g., --help).
# Usage example: python train.py --epochs=10 --lr=0.00005
def main() -> None:
    training_options = parse_args()
    metric_value = train(training_options)
    print(f'> metric value of the last epoch: {metric_value}')


if __name__ == '__main__':
    main_exit_code = SUCCESS_CODE
    try:
        main()
    except Exception as e:
        print(f'> something went wrong: {str(e)}')
        traceback.print_exception(type(e), e, e.__traceback__)
        main_exit_code = FAILED_CODE
    exit(main_exit_code)

Command line options parsing

The argparse library provides a simple, easy-to-configure solution. The object returned by parse_args is a Python data class that holds the value of the options (e.g., training_options.batch-size).

# Command line options parser.
def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument('-b', '--batch-size', default=32, type=int, help='training batch size per GPU')
    parser.add_argument('-e', '--epochs'    , default=1, type=int, help='number of epochs')
    parser.add_argument('-l', '--loss'      , default='mse', type=str, help='loss func: mse | mae')
    parser.add_argument('-o', '--optimizer' , default='adamw', type=str, help='optimizer: sgd | adam | adamw')
    parser.add_argument('-r', '--lr'        , default=1e-4, type=float, help='learning rate')
    training_options = parser.parse_args()
    return training_options

Loss function mapping

A classic technique for dynamically executing functions specified from command-line options is the function mapping. To dynamically execute the loss function, we propose to map the loss function names to the corresponding Pytorch functions:

_LOSS_MAPPING = {
    'mse': F.mse_loss,
    'mae': F.l1_loss,
}

Thus, in the model class, the loss function is executed simply by selecting the function and calling it: _LOSS_MAPPING[self.training_options.loss](x_hat, x).

Training optimizer factory

A classic technique for dynamically instantiating objects specified from command-line options is a design pattern named factory. To dynamically instantiate the optimizer object, we propose to implement the following function:

from torch.optim import Optimizer, SGD, Adam, AdamW

def create_optimizer(model: Module, optimizer_name: str, lr: float) -> Optimizer:
    match optimizer_name:
        case 'sgd':
            optimizer = SGD(model.parameters(), lr=lr)
        case 'adam':
            optimizer = Adam(model.parameters(), lr=lr)
        case 'adamw':
            optimizer = AdamW(model.parameters(), lr=lr)
        case _:
            raise ValueError(f'unsupported optimizer: {optimizer_name}')
    return optimizer

Enhanced training loop

Finnally, we pass the parsed options to the training loop:

def train(training_options: argparse.Namespace, trial: optuna.trial.Trial = None) -> float:
    from torch import nn
    from torch.utils.data import DataLoader
    from torch.utils.data import random_split
    from torchvision.datasets import MNIST
    from torchvision import transforms
    import pytorch_lightning as pl

    class LitAutoEncoder(pl.LightningModule):
        def __init__(self, training_options: argparse.Namespace):
            super().__init__()
            self.training_options = training_options
            self.encoder = nn.Sequential(
                nn.Linear(28 * 28, 64),
                nn.ReLU(),
                nn.Linear(64, 3))
            self.decoder = nn.Sequential(
                nn.Linear(3, 64),
                nn.ReLU(),
                nn.Linear(64, 28 * 28))

        def forward(self, x):
            embedding = self.encoder(x)
            return embedding

        def configure_optimizers(self):
            optimizer = create_optimizer(self, self.training_options.optimizer, self.training_options.lr)
            return optimizer

        def training_step(self, train_batch, batch_idx):
            x, y = train_batch
            x = x.view(x.size(0), -1)
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = _LOSS_MAPPING[self.training_options.loss](x_hat, x)
            self.log('train_loss', loss)
            return loss

        def validation_step(self, val_batch, batch_idx):
            x, y = val_batch
            x = x.view(x.size(0), -1)
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = _LOSS_MAPPING[self.training_options.loss](x_hat, x)
            # Metric must be computed the same manner across the trials in order to compare them!
            metric = _LOSS_MAPPING['mse'](x_hat, x) if self.training_options.loss != 'mse' else loss
            self.log_dict({'val_loss': loss, 'val_metric': metric})

    # data
    dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
    mnist_train, mnist_val = random_split(dataset, [55000, 5000])
    train_loader = DataLoader(mnist_train, batch_size=training_options.batch_size,
                              num_workers=NB_DATA_WORKERS, persistent_workers=True)
    val_loader = DataLoader(mnist_val, batch_size=32,
                            num_workers=NB_DATA_WORKERS, persistent_workers=True)

    # model
    model = LitAutoEncoder(training_options)

    # training
    trainer = pl.Trainer(accelerator='auto',
                         max_epochs=training_options.epochs)

    trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)
    return float(trainer.callback_metrics['val_metric'])

Find the complete file at this address.

Optuna hyperparameter values generator

In general, HPO libraries offer two main functionalities:

A mechanism, called sampler, for generating hyperparameter values from sets specified by the user. A trial is a set of hyperparameter values selected by the sampler, which are then used to train a machine learning model.
An early stopping mechanism, called pruner, is used to stop training (and therefore a trial) that does not seem to go beyond the best metric value currently obtained.

The HPO implementation for the previous example is implemented by creating a new module - optuna_bootstrap.py - which will drive the HPO (sampling) and by slightly modifying the training loop to support Optuna's pruning functionality. In this example, the pruning algorithm used is Hyperband and the sampling algorithm is Tree-structured Parzen Estimator (TPE), which is part of the Bayesian optimization methods (BO). This pairing, also known as BOHB, is reputed to be robust (reference).

As HPO is particularly time-consuming, Optuna supports the parallelization: several processes can work on the same HPO study, collaborating with each other and share their results (concurrent access to the study log file is supported). In this tutorial, parallelization is obtained by executing the optuna_bootstrap.py script as many times as possible.

Optuna Bootstrap

Study creation

Before starting HPO, it is necessary to create the study and its log file. The create_study function, which is embbeded in the optuna_bootstrap.py module, is called once and for all, first. Running a function in a Python module requires a few adjustments, for example: python -c "from optuna_bootstrap import create_study ; create_study()" 'optuna_journal.log' 'study_name'

import sys
import optuna

# Create the Optuna HPO study:
# Usage example: python -c "from optuna_bootstrap import create_study ; create_study()" 'optuna_journal.log' 'study_name'
def create_study() -> None:
    journal_file_path = sys.argv[1]
    study_name = sys.argv[2]
    storage, sampler, pruner = create_optuna_conf(journal_file_path)
    print(f"> creating optuna journal file '{journal_file_path}'")
    optuna.study.create_study(study_name=study_name, storage=storage, load_if_exists=False, sampler=sampler,
                              direction='minimize', pruner=pruner)
    print('> Done')

Settings

Once the study is created, HPO can begin. The settings of the HPO are few constant variables defined at the begining of the script that are used in create_optuna_conf function:

# SETTINGS
RANDOM_SEED = 42
SAMPLER = optuna.samplers.TPESampler
PRUNER = optuna.pruners.HyperbandPruner
MIN_NUMBER_EPOCHS = 1
MAX_NUMBER_EPOCHS = 5

# Configure the optimizers.
def create_optuna_conf(journal_file_path: str) -> tuple:
    storage = optuna.storages.JournalStorage(optuna.storages.JournalFileStorage(journal_file_path))
    sampler = SAMPLER(seed=RANDOM_SEED, multivariate=True)
    pruner = PRUNER(min_resource=MIN_NUMBER_EPOCHS, max_resource=MAX_NUMBER_EPOCHS)
    return storage, sampler, pruner

Hyperparameters value spaces

The hyperparameters value spaces are defined in the get_training_options function, thank to the suggest_categorical, suggest_int and suggest_float functions. These functions also fetch the value of the hyperparameters during the HPO following the sampler algorithm.

# Define the space values for each hyperparameter to be optimized and fetch their value for each trial.
def get_training_options(trial: optuna.trial.Trial):
    options = dict()
    options['loss'] = trial.suggest_categorical(name='loss', choices=['mae', 'mse'])
    options['optimizer'] = trial.suggest_categorical(name='optimizer', choices=['sgd', 'adam', 'adamw'])
    options['batch_size'] = int(pow(2, trial.suggest_int(name='pow_batch_size', low=4, high=10, step=1)))
    options['epochs'] = MAX_NUMBER_EPOCHS
    options['lr'] = trial.suggest_float(name='lr', low=1e-6, high=1., log=True)
    return options

HPO objective

The objective of HPO is to find the optimal values of a set of hyperparameters regarding a given training metric. The objective function calls the training loop and gives it the training options that derives from the trial (note the translation into an argparse data class object) and returns the metric value computed at the end of the training. The function get_task_id resolves the ambiguity between the execution traces (print calls) when parallelizing the HPO.

import os
import argparse
import train


# Return the task id.
def get_task_id() -> str:
    if 'SLURM_PROCID' in os.environ:
        return os.environ['SLURM_PROCID']
    else:
        return 'NO_ID'


# Define the HPO objective.
def objective(trial: optuna.trial.Trial) -> float:
    training_options = get_training_options(trial)
    print(f"> task #{get_task_id()} starting trial #{trial.number} with the following settings:\n{training_options}")
    metric_value = train.train(argparse.Namespace(**training_options), trial)
    return metric_value

HPO driver

The main function drives the HPO. Firstly, it loads the previously created study, with the function load_study, then it starts the optimization when calling the function optimize. This function returns when the time is out, provided the timeout parameter whose value is expressed in seconds. Each process of optimization is executed giving three command line options, like this: python optuna_bootstrap.py 'optuna_journal.log' 'study_name' 3600.

import sys
import traceback
import optuna


# Run the optimization.
# Running this script several times at the same time, enables optimization to be parallelized.
# i.e., each script own it's unique set of hyperparameter values: trial.
# Usage example: python optuna_bootstrap.py 'optuna_journal.log' 'study_name' 3600
# Command line arguments:
# - optuna journal file path
# - study name
# - study duration (in seconds)
def main() -> int:
    journal_file_path = sys.argv[1]
    study_name = sys.argv[2]
    total_optimization_time = int(sys.argv[3]) # Unit: seconds.
    storage, sampler, pruner = create_optuna_conf(journal_file_path)
    study = optuna.study.load_study(study_name=study_name, storage=storage, sampler=sampler, pruner=pruner)
    try:
        study.optimize(func=objective, timeout=total_optimization_time)
        print(f"> task #{get_task_id()} ends")
        exit_code = train.SUCCESS_CODE
    except Exception as e:
        print(f"> [ERROR] task #{get_task_id()}: {str(e)}")
        traceback.print_exception(type(e), e, e.__traceback__)
        exit_code = train.FAILED_CODE
    return exit_code


if __name__ == '__main__':
    main_exit_code = main()
    exit(main_exit_code)

Initial trial

You probably have some idea of the optimal values of hyperparameter, especially while using train.py and its command-line options. To take advantage of your experience and speed up HPO, Optuna lets you create a trial and test it first. The trial configuration consists of a Python dictionary containing the names of the hyperparameters given in the get_training_options function and their values. This dictionary is then added to the study using the enqueue_trial function.

__INITIAL_TRIAL = {
    'loss': 'mse',
    'optimizer': 'adam',
    'pow_batch_size': 6, # batch_size: 64
    'lr': 1.e-5
}


def create_study() -> None:
    journal_file_path = sys.argv[1]
    study_name = sys.argv[2]
    storage, sampler, pruner = create_optuna_conf(journal_file_path)
    print(f"> creating optuna journal file '{journal_file_path}'")
    study = optuna.study.create_study(study_name=study_name, storage=storage, load_if_exists=False, sampler=sampler,
                                      direction='minimize', pruner=pruner)
    # Set an initial trial based on best hyperparameter values already found.
    study.enqueue_trial(__INITIAL_TRIAL)
    print('> Done')

Tips

Adding a hand written trials really accelerate the HPO thanks to the pruning algorithm.

Find the complete file at this address.

Slurm job submitters

Finnally, the Slurm script launches multiple processes that execute the optuna_bootstrap.py script. The Slurm script depends on the underlying cluster. This tutorial gives you an example for Spirit[x], the IPSL's CPU clusters ; HAL, the IPSL's GPU cluster ; and Jean Zay, the IDRIS' GPU cluster.

Tips

Each Slurm script is executed with the same instruction which requires the same parameters expected by optuna_bootstrap.py. For example: sbatch XXX.slurm 'optuna_journal.log' 'study_name' 3600.

Note

Don't mind the AssertionError: can only test a child process while prallelizing the HPO. This exception comes from the MNIST data generator.

Spirit[x]

This Slurm script submits a single job on the partition zen16 (CPU) and runs two optuna_bootstrap.py processes (--ntasks-per-node=2), each process works on one CPU (--cpus-per-task=1).

#!/bin/bash
#SBATCH --job-name=optuna
#SBATCH --partition=zen16
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=1:00:00
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out

module purge
module load 'pytorch/2.1.2' # Or any other AI module.

# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1

# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
srun python optuna_bootstrap.py ${@}
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}

Note

Running multiple processes is enable via the srun command in front of the python command. Without the srun command, only one process is executed!

HAL

This Slurm script submits a single job on the partition batch (GPU) and runs one optuna_bootstrap.py process on one GPU (`--gpus=1). Run another time the Slurm script in order to execute another optuna_bootstrap.py process and achieve the parallelization of the HPO.

#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=optuna
#SBATCH --gpus=1
#SBATCH --time=1:00:00
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out

module purge
module load 'pytorch/2.1.2' # Or any other AI module.

# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1

# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
python optuna_bootstrap.py ${@}
returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}

Jean Zay

This Slurm script submits a single job on the partition gpu_p1 (GPU ; -C v100-32g) and runs, on four nodes (--nodes=4), four optuna_bootstrap.py processes (--ntasks-per-node=4), each process works on one GPU (--gpus-per-task=1). Thus, 4 nodes of 4 V100 32 Go are entirely reserved, running an overall of 16 processes.

Tips

If Optuna is not shipped with an AI module, you still can extend it, see this procedure.

Tips

As Jean Zay computing nodes don't have access to Internet, you must run train.py on a head node once and for all, so as to download the MNIST data.

Warning

--cpus-per-task is set according to the partition gpu_p1 hardware specifications. The number of CPU cores differs for the other partitions!

Warning

You must set the value of the highlighted line according to your project:

#!/bin/bash

#SBATCH --account=my_project
#SBATCH -C v100-32g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=10
#SBATCH --hint=nomultithread
#SBATCH --time=01:00:00
#SBATCH --job-name=opti
#SBATCH --error=optuna-%j.out
#SBATCH --output=optuna-%j.out

module purge
module load pytorch-gpu/py3/2.2.0

# Enable the standard and error outputs of Python.
export PYTHONUNBUFFERED=1

# Run your process. The command line options of the Slurm script
# are passed to the Python script ($@).
srun python optuna_bootstrap.py ${@}

returned_code=$?
echo "> script completed with exit code ${returned_code}"
exit ${returned_code}

Note

Running multiple processes is enable via the srun command in front of the python command. Without the srun command, only one process is executed!

Tips

The function get_task_id should be replaced by the idr_torch package using its rank variable.