molprop.pyg_molgraph.PyGMolgraphDataset

class molprop.pyg_molgraph.PyGMolgraphDataset(root, transform=None, pre_filter=None, args=None, mol_features=None, without_target=False)

Bases: InMemoryDataset

PyG dataset with attributed molecular graphs.

Parameters:: root – Root directory where the dataset should be saved.

static collate(data_list: Sequence[BaseData]) → Tuple[BaseData, Dict[str, Tensor] | None]: Collates a list of Data or HeteroData objects to the internal storage format of InMemoryDataset.

copy(idx: slice | Tensor | ndarray | Sequence | None = None) → InMemoryDataset: Performs a deep-copy of the dataset. If idx is not given, will clone the full dataset. Otherwise, will only clone a subset of the dataset from indices idx. Indices can be slices, lists, tuples, and a torch.Tensor or np.ndarray of type long or bool.

cpu(*args: str) → InMemoryDataset: Moves the dataset to CPU memory.

cuda(device: int | str | None = None) → InMemoryDataset: Moves the dataset toto CUDA memory.

download() → None: Downloads the dataset to the self.raw_dir folder.

get(idx: int) → BaseData: Gets the data object at index idx.

get_summary() → Any: Collects summary statistics for the dataset.

property has_download: bool: Checks whether the dataset defines a download() method.

property has_process: bool: Checks whether the dataset defines a process() method.

index_select(idx: slice | Tensor | ndarray | Sequence) → Dataset: Creates a subset of the dataset from specified indices idx. Indices idx can be a slicing object, e.g., [2:5], a list, a tuple, or a torch.Tensor or np.ndarray of type long or bool.

len() → int: Returns the number of data objects stored in the dataset.

load(path: str, data_cls: ~typing.Type[~torch_geometric.data.data.BaseData] = <class 'torch_geometric.data.data.Data'>) → None: Loads the dataset from the file path path.

property num_classes: int: Returns the number of classes in the dataset.

property num_edge_features: int: Returns the number of features per edge in the dataset.

property num_features: int: Returns the number of features per node in the dataset. Alias for num_node_features.

property num_node_features: int: Returns the number of features per node in the dataset.

print_summary(fmt: str = 'psql') → None

Prints summary statistics of the dataset to the console.

Parameters:: fmt (str, optional) – Summary tables format. Available table formats can be found here. (default: "psql")

process() → None: Processes the dataset to the self.processed_dir folder.

property processed_file_names: str: The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property processed_paths: List[str]: The absolute filepaths that must be present in order to skip processing.

property raw_file_names: str: The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

property raw_paths: List[str]: The absolute filepaths that must be present in order to skip downloading.

classmethod save(data_list: Sequence[BaseData], path: str) → None: Saves a list of data objects to the file path path.

shuffle(return_perm: bool = False) → Dataset | Tuple[Dataset, Tensor]

Randomly shuffles the examples in the dataset.

Parameters:: return_perm (bool, optional) – If set to True, will also return the random permutation used to shuffle the dataset. (default: False)

to(device: int | str) → InMemoryDataset: Performs device conversion of the whole dataset.

to_datapipe() → Any

Converts the dataset into a torch.utils.data.DataPipe.

The returned instance can then be used with :pyg:`PyG's` built-in DataPipes for baching graphs as follows:

from torch_geometric.datasets import QM9

dp = QM9(root='./data/QM9/').to_datapipe()
dp = dp.batch_graphs(batch_size=2, drop_last=True)

for batch in dp:
    pass

See the PyTorch tutorial for further background on DataPipes.

to_on_disk_dataset(root: str | None = None, backend: str = 'sqlite', log: bool = True) → OnDiskDataset

Converts the InMemoryDataset to a OnDiskDataset variant. Useful for distributed training and hardware instances with limited amount of shared memory.

root (str, optional): Root directory where the dataset should be saved.: If set to None, will save the dataset in root/on_disk. Note that it is important to specify root to account for different dataset splits. (optional: None)
backend (str): The Database backend to use.: (default: "sqlite")
log (bool, optional): Whether to print any console output while: processing the dataset. (default: True)