Introduction

Welcome to the Documentation of GNN4PP a tool to enable simple and quick training of Graph Neural Networks to predict molecular properties.

With this package we aimed to make the creation of machine learning models for molecular property prediction modular and their training quick and easy. In the following some examples of the main functionalites of our Package are shown.

Single Model Training

from pytorch_lightning import seed_everything
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
from pytorch_lightning.callbacks import ModelCheckpoint
import hydra

from source import LightningModels, Architecture
from source.LightningDataModule import LitDataModule
from utils.Callbacks import TestResultCallback
import utils.omegaconf_resolvers


@hydra.main(config_path="./configs", config_name="IQT_MON_RON")
def main(cfg):
    seed_everything(cfg.random_seed, workers=True)
    data_module = LitDataModule(config=cfg)
    architecture = Architecture.create_GNN(cfg, data_module)
    early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                      verbose=True)
    result_callback = TestResultCallback(cfg)
    checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
    trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback],
                      deterministic=True)
    model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                               cfg.Datamodule.Dataset.target_properties)
    trainer.fit(model, data_module)
    trainer.test(model, data_module)


if __name__ == "__main__":
    main()

The important dependencies which you need to know about are PyTorch Lightning and Hydra. PyTorch Lightning is a library that provides a lot of high level functionality for machine learning. The most important tool is the Trainer, this class implements a lot of the boilerplate code needed for training and testing. The other important classes are the LightningModule and LightningDataModule. The LightningDataModule takes care of the preparation and processing of the data and provides dataloaders to the LightningModule during training. The LightningModule contains the model as well as the code that is specific to the training of it’s specific model.

Hydra is a package that help’s managing configuration, which is very useful to keep track of the use hyperparameters of past experiments.

Now to explain the use of our package:

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
from pytorch_lightning.callbacks import ModelCheckpoint
import hydra

First we import the Trainer and some callbacks which implement additional functionality from PyTorch Lightning as well as our config management tool Hydra.

from source import LightningModels, Architecture
from source.LightningDataModule import LitDataModule
from utils.Callbacks import TestResultCallback
import utils.omegaconf_resolvers

From our package we import the LightningModels which contain the definitions of the LightningModules and the Architecture, where the layer architecture of the model is defined. The we import the DataModule and a Callback which takes care of saving the results of testing after the training. Finally we have to import the omegaconf_resolvers, which take care of interpreting some definitions in the config files.

@hydra.main(config_path="./configs", config_name="IQT_MON_RON")

In this line we define where the configs are saved and which config to pass to the main function when we run this script. The config provides the default values for running this script but can also be overwritten from the command line like this:

>python train_single_model.py Training.epochs=10

Now let’s get to the body of our main function where we instantiate our objects and perform the training.

pl.seed_everything(cfg.random_seed, workers=True)
data_module = LitDataModule(config=cfg)
architecture = Architecture.create_GNN(cfg, data_module)

First we set the random seed for reproducibility and instantiate our DataModule and model architecture.

early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                          verbose=True)
result_callback = TestResultCallback(cfg)
checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback],
                  deterministic=True)

Next the callbacks for early-stopping, result saving and model checkpointing are instantiated. Then the trainer is instantiated and we pass the callbacks to the trainer. Callbacks are classes which implement functionalities which are universal and not specific to a single model and can thus be reused for multiple models. This is why we do not save their code in the respective LightningModule and pass them to the trainer instead.

model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                               cfg.Datamodule.Dataset.target_properties)
trainer.fit(model, data_module)
trainer.test(model, data_module)

Finally we instantiate the LightningModule and then call :code:’trainer.fit()’ and :code:’trainer.test()’ on the model with it’s corresponding data module. This is all the code we need to train a model and save it’s results.

By default the resulting output is saved in a directory relative to where the code is located: CodeDirectory/Training/NameOfDataset/Date/Time

In .hydra the configs used for this model run are saved. In lightning_logs the logs of the training process are saved which be inspected using tensorboard: >tensoboard --logdir CodeDirectory/Training/NameOfDataset/Date/Time/lightning_logs. Additionally the checkpoints of the model are saved here. In single_model parity plots and the model scores are saved. Results.xlsx contains the predicted and true target values for all molecules in the test set.

Finally let’s have a look at the config file:

hydra:
  run:
    dir: ./Training/${Datamodule.Dataset.target_data_dir}/${now:%Y-%m-%d}/${now:%H-%M-%S}


# General
random_seed: 1234
classification: False
batch_size: 16
use_gpu: False

Training:
  # Training Defaults
  epochs: 300
  early_stopping_patience: 30


Datamodule:
  # Data Module Defaults
  data_scaling: "standard"
  ext_test: True
  val_percent: 0.15
  test_percent: 0.15
  Dataset:
    # Dataset Defaults
    target_data_dir: ""
    target_properties: []
    save_dir_id: ""

Model:
  # Model Defaults
  pooling: "add"
  dim: 64
  conv: 2
  dropout_rate: 0.1
  lrate: 0.001
  lratefactor: 0.8
  lrpatience: 3
  multi_output_MLP: False

The dir property determines where the output is saved. The properties in the section General are variables which are needed in multiple places or are not related to a specific object. This includes the random seed for reproducibility, whether the task is regression or classification, the batch size and if gpus should be used for training.

In the Training section variables which relate to the trainer and it’s callbacks are defined, e.g. number of epochs, or the patience for early-stopping.

Next in the Datamodule section the variables for the LightningDataModule and the underlying torch dataset are defined, e.g. whether an external test set is used, how large the validation set should be, etc.

Finally in the Model section variables related to the LightningModule and the architecture are defined. LightningModule specific variables include variables related to the optimizer like the learning rate etc. Examples for the architecture specific variables are the dimension of the hidden state dim or the number of convolution layers in the model conv.

K-Fold Cross Validation

K-Fold cross validation works very similar to normal training. However, instead of the normal PyTorch Lightning Trainer you have to use a custom Trainer for which we implemented K-Fold cross-validation training.

from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
from pytorch_lightning.callbacks import ModelCheckpoint
import hydra

from source import LightningModels, Architecture
from source.LightningDataModule import LitKFoldDataModule
from utils.KFoldLoop import KFoldTrainer
import utils.omegaconf_resolvers


@hydra.main(config_path="./configs", config_name="IQT_MON_RON")
def main(cfg):
    pl.seed_everything(0, workers=True)
    data_module = LitKFoldDataModule(config=cfg)
    architecture = Architecture.create_GNN(cfg)
    early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                      verbose=True)
    checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
    trainer = KFoldTrainer(k_fold_training=True, num_folds=cfg.Training.num_folds, max_epochs=cfg.Training.epochs,
                           callbacks=[early_stopping, checkpoint], deterministic=True, num_sanity_val_steps=0)
    model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                              cfg.Datamodule.Dataset.target_properties)
    trainer.fit(model, data_module)


if __name__ == "__main__":
    main()

As you can see here the code looks very similar, the notable differences are:

The Trainer is replaced by a custom KFoldTrainer
The LitDataModule is replaced by LitKFoldDataModule
We do not pass a result callback and don’t call trainer.test(...), the testing is performed automatically at the end of training.

Ensemble Training

To train an ensemble of models you can use the LitEnsemble class. A simple example training an ensemble looks like this:

import pytorch_lightning as pl
import hydra

from source import LightningModels, Ensemble, Architecture, LightningDataModule
import utils.omegaconf_resolvers


@hydra.main(config_path="./configs", config_name="solubility")
def test_ensemble(cfg):
    pl.seed_everything(cfg.random_seed, workers=True)
    data_module = LightningDataModule.LitDataModule(cfg)
    # create models
    models = {}
    for i in range(cfg.Training.num_models):
        temp_architecture = Architecture.create_GNN(cfg, datamodule=data_module)
        # every model get newly shuffled data
        temp_datamodule = LightningDataModule.LitDataModule(cfg)
        temp_model = LightningModels.LitRegressionModel(model=temp_architecture, model_id=i, mean=temp_datamodule.mean,
                                                        std=temp_datamodule.std,
                                                        target_columns=cfg.Datamodule.Dataset.target_properties,
                                                        batch_size=cfg.Model.batch_size)
        temp_params = {
            "monitor": "val_loss",
            "min_delta": 0,
            "patience": cfg.Training.early_stopping_patience,
            "max_epochs": cfg.Training.epochs,
            "datamodule": temp_datamodule,

            # can be use to provide trained models
            "best_model_path": None
        }
        models[temp_model] = temp_params

    ensemble = Ensemble.LitEnsemble(models=models, params=cfg)
    ensemble.train()


if __name__ == '__main__':
    test_ensemble()

It is again very similar to the single model example, the used modules are the same but we replace the trainer with a LitEnsemble object which takes care of the the training and testing. Additionally, instead of passing the DataModule and Module a single time, we create a dictionary which contains a Module, DataModule and parameters for each model that should be trained. Then we pass this dictionary to the LitEnsemble class and call train. Similar to K-Fold cross-validation, trainer.test(...) does not have to be called explicitly but is called automatically.