Introduction
============
Welcome to the Documentation of GNN4PP a tool to enable simple and quick training of Graph Neural Networks to predict
molecular properties.

With this package we aimed to make the creation of machine learning models for molecular property prediction
modular and their training quick and easy.
In the following some examples of the main functionalites of our Package are shown.


Single Model Training
=====================
::

    from pytorch_lightning import seed_everything
    from pytorch_lightning import Trainer
    from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
    from pytorch_lightning.callbacks import ModelCheckpoint
    import hydra

    from source import LightningModels, Architecture
    from source.LightningDataModule import LitDataModule
    from utils.Callbacks import TestResultCallback
    import utils.omegaconf_resolvers


    @hydra.main(config_path="./configs", config_name="IQT_MON_RON")
    def main(cfg):
        seed_everything(cfg.random_seed, workers=True)
        data_module = LitDataModule(config=cfg)
        architecture = Architecture.create_GNN(cfg, data_module)
        early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                          verbose=True)
        result_callback = TestResultCallback(cfg)
        checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
        trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback],
                          deterministic=True)
        model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                                   cfg.Datamodule.Dataset.target_properties)
        trainer.fit(model, data_module)
        trainer.test(model, data_module)


    if __name__ == "__main__":
        main()

The important dependencies which you need to know about are `PyTorch Lightning`_ and Hydra_.
`PyTorch Lightning`_ is a library that provides a lot of high level functionality for machine learning.
The most important tool is the **Trainer**, this class implements a lot of the boilerplate code needed for training and
testing. The other important classes are the **LightningModule** and **LightningDataModule**. The LightningDataModule
takes care of the preparation and processing of the data and provides dataloaders to the LightningModule during
training. The LightningModule contains the model as well as the code that is specific to the training of it's specific
model.

Hydra_ is a package that help's managing configuration, which is very useful to keep track of the use hyperparameters of
past experiments.

Now to explain the use of our package:
::

    from pytorch_lightning import Trainer
    from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
    from pytorch_lightning.callbacks import ModelCheckpoint
    import hydra

First we import the Trainer and some callbacks which implement additional functionality from `PyTorch Lightning`_ as
well as our config management tool Hydra_.

::

    from source import LightningModels, Architecture
    from source.LightningDataModule import LitDataModule
    from utils.Callbacks import TestResultCallback
    import utils.omegaconf_resolvers

From our package we import the LightningModels which contain the definitions of the LightningModules and the
Architecture, where the layer architecture of the model is defined. The we import the DataModule and a Callback which
takes care of saving the results of testing after the training. Finally we have to import the
:code:`omegaconf_resolvers`, which take care of interpreting some definitions in the config files.

::

    @hydra.main(config_path="./configs", config_name="IQT_MON_RON")

In this line we define where the configs are saved and which config to pass to the main function when we run this script.
The config provides the default values for running this script but can also be overwritten from the command line
like this:

::

    >python train_single_model.py Training.epochs=10


Now let's get to the body of our main function where we instantiate our objects and perform the training.

::

    pl.seed_everything(cfg.random_seed, workers=True)
    data_module = LitDataModule(config=cfg)
    architecture = Architecture.create_GNN(cfg, data_module)

First we set the random seed for reproducibility and instantiate our DataModule and model architecture.

::

    early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                              verbose=True)
    result_callback = TestResultCallback(cfg)
    checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
    trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback],
                      deterministic=True)

Next the callbacks for early-stopping, result saving and model checkpointing are instantiated. Then the trainer is
instantiated and we pass the callbacks to the trainer. Callbacks are classes which implement functionalities which are
universal and not specific to a single model and can thus be reused for multiple models. This is why we do not save
their code in the respective LightningModule and pass them to the trainer instead.

::

    model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                                   cfg.Datamodule.Dataset.target_properties)
    trainer.fit(model, data_module)
    trainer.test(model, data_module)

Finally we instantiate the LightningModule and then call :code:'trainer.fit()' and :code:'trainer.test()' on the model
with it's corresponding data module. This is all the code we need to train a model and save it's results.

.. image:: save_directory.PNG

By default the resulting output is saved in a directory relative to where the code is located:
:file:`CodeDirectory/Training/NameOfDataset/Date/Time`

In :file:`.hydra` the configs used for this model run are saved. In :file:`lightning_logs` the logs of the training
process are saved which be inspected using tensorboard:
:code:`>tensoboard --logdir CodeDirectory/Training/NameOfDataset/Date/Time/lightning_logs`.
Additionally the checkpoints of the model are saved here. In :file:`single_model` parity plots and the model
scores are saved. :file:`Results.xlsx` contains the predicted and true target values for all molecules in the test set.

Finally let's have a look at the config file:

::

    hydra:
      run:
        dir: ./Training/${Datamodule.Dataset.target_data_dir}/${now:%Y-%m-%d}/${now:%H-%M-%S}


    # General
    random_seed: 1234
    classification: False
    batch_size: 16
    use_gpu: False

    Training:
      # Training Defaults
      epochs: 300
      early_stopping_patience: 30


    Datamodule:
      # Data Module Defaults
      data_scaling: "standard"
      ext_test: True
      val_percent: 0.15
      test_percent: 0.15
      Dataset:
        # Dataset Defaults
        target_data_dir: ""
        target_properties: []
        save_dir_id: ""

    Model:
      # Model Defaults
      pooling: "add"
      dim: 64
      conv: 2
      dropout_rate: 0.1
      lrate: 0.001
      lratefactor: 0.8
      lrpatience: 3
      multi_output_MLP: False


The :code:`dir` property determines where the output is saved.
The properties in the section *General* are variables which are needed in multiple places or are not related to a
specific object. This includes the random seed for reproducibility, whether the task is regression or classification,
the batch size and if gpus should be used for training.

In the *Training* section variables which relate to the trainer and it's callbacks are defined, e.g. number of epochs,
or the patience for early-stopping.

Next in the *Datamodule* section the variables for the LightningDataModule and the underlying torch dataset are
defined, e.g. whether an external test set is used, how large the validation set should be, etc.

Finally in the *Model* section variables related to the LightningModule and the architecture are defined.
LightningModule specific variables include variables related to the optimizer like the learning rate etc. Examples for
the architecture specific variables are the dimension of the hidden state *dim* or the number of convolution layers in
the model *conv*.

K-Fold Cross Validation
=======================

K-Fold cross validation works very similar to normal training. However, instead of the normal `PyTorch Lightning`_
Trainer you have to use a custom Trainer for which we implemented K-Fold cross-validation training.

::

    from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping
    from pytorch_lightning.callbacks import ModelCheckpoint
    import hydra

    from source import LightningModels, Architecture
    from source.LightningDataModule import LitKFoldDataModule
    from utils.KFoldLoop import KFoldTrainer
    import utils.omegaconf_resolvers


    @hydra.main(config_path="./configs", config_name="IQT_MON_RON")
    def main(cfg):
        pl.seed_everything(0, workers=True)
        data_module = LitKFoldDataModule(config=cfg)
        architecture = Architecture.create_GNN(cfg)
        early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience,
                                          verbose=True)
        checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True)
        trainer = KFoldTrainer(k_fold_training=True, num_folds=cfg.Training.num_folds, max_epochs=cfg.Training.epochs,
                               callbacks=[early_stopping, checkpoint], deterministic=True, num_sanity_val_steps=0)
        model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std,
                                                  cfg.Datamodule.Dataset.target_properties)
        trainer.fit(model, data_module)


    if __name__ == "__main__":
        main()

As you can see here the code looks very similar, the notable differences are:

#. The Trainer is replaced by a custom KFoldTrainer
#. The LitDataModule is replaced by LitKFoldDataModule
#. We do not pass a result callback and don't call :code:`trainer.test(...)`, the testing is performed automatically at
   the end of training.

Ensemble Training
=================

To train an ensemble of models you can use the LitEnsemble class. A simple example training an ensemble looks like
this:

::

    import pytorch_lightning as pl
    import hydra

    from source import LightningModels, Ensemble, Architecture, LightningDataModule
    import utils.omegaconf_resolvers


    @hydra.main(config_path="./configs", config_name="solubility")
    def test_ensemble(cfg):
        pl.seed_everything(cfg.random_seed, workers=True)
        data_module = LightningDataModule.LitDataModule(cfg)
        # create models
        models = {}
        for i in range(cfg.Training.num_models):
            temp_architecture = Architecture.create_GNN(cfg, datamodule=data_module)
            # every model get newly shuffled data
            temp_datamodule = LightningDataModule.LitDataModule(cfg)
            temp_model = LightningModels.LitRegressionModel(model=temp_architecture, model_id=i, mean=temp_datamodule.mean,
                                                            std=temp_datamodule.std,
                                                            target_columns=cfg.Datamodule.Dataset.target_properties,
                                                            batch_size=cfg.Model.batch_size)
            temp_params = {
                "monitor": "val_loss",
                "min_delta": 0,
                "patience": cfg.Training.early_stopping_patience,
                "max_epochs": cfg.Training.epochs,
                "datamodule": temp_datamodule,

                # can be use to provide trained models
                "best_model_path": None
            }
            models[temp_model] = temp_params

        ensemble = Ensemble.LitEnsemble(models=models, params=cfg)
        ensemble.train()


    if __name__ == '__main__':
        test_ensemble()

It is again very similar to the single model example, the used modules are the same but we replace the trainer with
a LitEnsemble object which takes care of the the training and testing.
Additionally, instead of passing the DataModule and Module a single time, we create a dictionary which contains
a Module, DataModule and parameters for each model that should be trained.
Then we pass this dictionary to the LitEnsemble class and call train.
Similar to K-Fold cross-validation, :code:`trainer.test(...)` does not have to be called explicitly but is called
automatically.

.. _Hydra: https://hydra.cc/docs/intro/
.. _PyTorch Lightning: https://pytorch-lightning.readthedocs.io/en/stable/