Introduction ============ Welcome to the Documentation of GNN4PP a tool to enable simple and quick training of Graph Neural Networks to predict molecular properties. With this package we aimed to make the creation of machine learning models for molecular property prediction modular and their training quick and easy. In the following some examples of the main functionalites of our Package are shown. Single Model Training ===================== :: from pytorch_lightning import seed_everything from pytorch_lightning import Trainer from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping from pytorch_lightning.callbacks import ModelCheckpoint import hydra from source import LightningModels, Architecture from source.LightningDataModule import LitDataModule from utils.Callbacks import TestResultCallback import utils.omegaconf_resolvers @hydra.main(config_path="./configs", config_name="IQT_MON_RON") def main(cfg): seed_everything(cfg.random_seed, workers=True) data_module = LitDataModule(config=cfg) architecture = Architecture.create_GNN(cfg, data_module) early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience, verbose=True) result_callback = TestResultCallback(cfg) checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True) trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback], deterministic=True) model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std, cfg.Datamodule.Dataset.target_properties) trainer.fit(model, data_module) trainer.test(model, data_module) if __name__ == "__main__": main() The important dependencies which you need to know about are `PyTorch Lightning`_ and Hydra_. `PyTorch Lightning`_ is a library that provides a lot of high level functionality for machine learning. The most important tool is the **Trainer**, this class implements a lot of the boilerplate code needed for training and testing. The other important classes are the **LightningModule** and **LightningDataModule**. The LightningDataModule takes care of the preparation and processing of the data and provides dataloaders to the LightningModule during training. The LightningModule contains the model as well as the code that is specific to the training of it's specific model. Hydra_ is a package that help's managing configuration, which is very useful to keep track of the use hyperparameters of past experiments. Now to explain the use of our package: :: from pytorch_lightning import Trainer from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping from pytorch_lightning.callbacks import ModelCheckpoint import hydra First we import the Trainer and some callbacks which implement additional functionality from `PyTorch Lightning`_ as well as our config management tool Hydra_. :: from source import LightningModels, Architecture from source.LightningDataModule import LitDataModule from utils.Callbacks import TestResultCallback import utils.omegaconf_resolvers From our package we import the LightningModels which contain the definitions of the LightningModules and the Architecture, where the layer architecture of the model is defined. The we import the DataModule and a Callback which takes care of saving the results of testing after the training. Finally we have to import the :code:`omegaconf_resolvers`, which take care of interpreting some definitions in the config files. :: @hydra.main(config_path="./configs", config_name="IQT_MON_RON") In this line we define where the configs are saved and which config to pass to the main function when we run this script. The config provides the default values for running this script but can also be overwritten from the command line like this: :: >python train_single_model.py Training.epochs=10 Now let's get to the body of our main function where we instantiate our objects and perform the training. :: pl.seed_everything(cfg.random_seed, workers=True) data_module = LitDataModule(config=cfg) architecture = Architecture.create_GNN(cfg, data_module) First we set the random seed for reproducibility and instantiate our DataModule and model architecture. :: early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience, verbose=True) result_callback = TestResultCallback(cfg) checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True) trainer = Trainer(max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint, result_callback], deterministic=True) Next the callbacks for early-stopping, result saving and model checkpointing are instantiated. Then the trainer is instantiated and we pass the callbacks to the trainer. Callbacks are classes which implement functionalities which are universal and not specific to a single model and can thus be reused for multiple models. This is why we do not save their code in the respective LightningModule and pass them to the trainer instead. :: model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std, cfg.Datamodule.Dataset.target_properties) trainer.fit(model, data_module) trainer.test(model, data_module) Finally we instantiate the LightningModule and then call :code:'trainer.fit()' and :code:'trainer.test()' on the model with it's corresponding data module. This is all the code we need to train a model and save it's results. .. image:: save_directory.PNG By default the resulting output is saved in a directory relative to where the code is located: :file:`CodeDirectory/Training/NameOfDataset/Date/Time` In :file:`.hydra` the configs used for this model run are saved. In :file:`lightning_logs` the logs of the training process are saved which be inspected using tensorboard: :code:`>tensoboard --logdir CodeDirectory/Training/NameOfDataset/Date/Time/lightning_logs`. Additionally the checkpoints of the model are saved here. In :file:`single_model` parity plots and the model scores are saved. :file:`Results.xlsx` contains the predicted and true target values for all molecules in the test set. Finally let's have a look at the config file: :: hydra: run: dir: ./Training/${Datamodule.Dataset.target_data_dir}/${now:%Y-%m-%d}/${now:%H-%M-%S} # General random_seed: 1234 classification: False batch_size: 16 use_gpu: False Training: # Training Defaults epochs: 300 early_stopping_patience: 30 Datamodule: # Data Module Defaults data_scaling: "standard" ext_test: True val_percent: 0.15 test_percent: 0.15 Dataset: # Dataset Defaults target_data_dir: "" target_properties: [] save_dir_id: "" Model: # Model Defaults pooling: "add" dim: 64 conv: 2 dropout_rate: 0.1 lrate: 0.001 lratefactor: 0.8 lrpatience: 3 multi_output_MLP: False The :code:`dir` property determines where the output is saved. The properties in the section *General* are variables which are needed in multiple places or are not related to a specific object. This includes the random seed for reproducibility, whether the task is regression or classification, the batch size and if gpus should be used for training. In the *Training* section variables which relate to the trainer and it's callbacks are defined, e.g. number of epochs, or the patience for early-stopping. Next in the *Datamodule* section the variables for the LightningDataModule and the underlying torch dataset are defined, e.g. whether an external test set is used, how large the validation set should be, etc. Finally in the *Model* section variables related to the LightningModule and the architecture are defined. LightningModule specific variables include variables related to the optimizer like the learning rate etc. Examples for the architecture specific variables are the dimension of the hidden state *dim* or the number of convolution layers in the model *conv*. K-Fold Cross Validation ======================= K-Fold cross validation works very similar to normal training. However, instead of the normal `PyTorch Lightning`_ Trainer you have to use a custom Trainer for which we implemented K-Fold cross-validation training. :: from pytorch_lightning.callbacks.early_stopping import EarlyStopping as LitEarlyStopping from pytorch_lightning.callbacks import ModelCheckpoint import hydra from source import LightningModels, Architecture from source.LightningDataModule import LitKFoldDataModule from utils.KFoldLoop import KFoldTrainer import utils.omegaconf_resolvers @hydra.main(config_path="./configs", config_name="IQT_MON_RON") def main(cfg): pl.seed_everything(0, workers=True) data_module = LitKFoldDataModule(config=cfg) architecture = Architecture.create_GNN(cfg) early_stopping = LitEarlyStopping(monitor="val_loss", min_delta=0, patience=cfg.Training.early_stopping_patience, verbose=True) checkpoint = ModelCheckpoint(monitor="val_loss", verbose=True) trainer = KFoldTrainer(k_fold_training=True, num_folds=cfg.Training.num_folds, max_epochs=cfg.Training.epochs, callbacks=[early_stopping, checkpoint], deterministic=True, num_sanity_val_steps=0) model = LightningModels.LitRegressionModel(architecture, 0, data_module.mean, data_module.std, cfg.Datamodule.Dataset.target_properties) trainer.fit(model, data_module) if __name__ == "__main__": main() As you can see here the code looks very similar, the notable differences are: #. The Trainer is replaced by a custom KFoldTrainer #. The LitDataModule is replaced by LitKFoldDataModule #. We do not pass a result callback and don't call :code:`trainer.test(...)`, the testing is performed automatically at the end of training. Ensemble Training ================= To train an ensemble of models you can use the LitEnsemble class. A simple example training an ensemble looks like this: :: import pytorch_lightning as pl import hydra from source import LightningModels, Ensemble, Architecture, LightningDataModule import utils.omegaconf_resolvers @hydra.main(config_path="./configs", config_name="solubility") def test_ensemble(cfg): pl.seed_everything(cfg.random_seed, workers=True) data_module = LightningDataModule.LitDataModule(cfg) # create models models = {} for i in range(cfg.Training.num_models): temp_architecture = Architecture.create_GNN(cfg, datamodule=data_module) # every model get newly shuffled data temp_datamodule = LightningDataModule.LitDataModule(cfg) temp_model = LightningModels.LitRegressionModel(model=temp_architecture, model_id=i, mean=temp_datamodule.mean, std=temp_datamodule.std, target_columns=cfg.Datamodule.Dataset.target_properties, batch_size=cfg.Model.batch_size) temp_params = { "monitor": "val_loss", "min_delta": 0, "patience": cfg.Training.early_stopping_patience, "max_epochs": cfg.Training.epochs, "datamodule": temp_datamodule, # can be use to provide trained models "best_model_path": None } models[temp_model] = temp_params ensemble = Ensemble.LitEnsemble(models=models, params=cfg) ensemble.train() if __name__ == '__main__': test_ensemble() It is again very similar to the single model example, the used modules are the same but we replace the trainer with a LitEnsemble object which takes care of the the training and testing. Additionally, instead of passing the DataModule and Module a single time, we create a dictionary which contains a Module, DataModule and parameters for each model that should be trained. Then we pass this dictionary to the LitEnsemble class and call train. Similar to K-Fold cross-validation, :code:`trainer.test(...)` does not have to be called explicitly but is called automatically. .. _Hydra: https://hydra.cc/docs/intro/ .. _PyTorch Lightning: https://pytorch-lightning.readthedocs.io/en/stable/