GCN-for-BBBP-Prediction

Graph Convolutional Networks (GCN) for Predicting Blood-Brain Barrier Permeability of Compounds

Some of my work was inspired by the following work: https://www.kaggle.com/code/priyanagda/simple-gcn

Dataset Used:

deepchem's BBBP (Blood-brain Barrier Penetration) dataset

"The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system."

1st column: number of row (unnecessary data)
2nd column: compound name
3rd column: p_np (boolean labels for permeability of blood brain barrier)
4th column: SMILES (simplified molecular-input line-entry system)

The dataset contains the names and SMILE strings of 2050 different compounds.

Sample from the dataset:

The dataset's target values (permeability) are distributed as following:

Reference: https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html

Pre-processing of Dataset

Featurizer: Transforming raw input data into processed data
- MolGraphConvFeaturizer: a featurizer of general graph convolution networks for molecules.
  - Reference: https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html
Stratified Train/Vaild/Test Split: Splitting dataset into train, valid, and test sets
- 80% for training
- 10% for validation
- 10% for testing

edge_index

I used edge_index as the feature to pass into the GCN model to train.

Here, edge_indx has the shape of [2, num_edges]

Each compound can be seen as an unweighted and undirected graph as the example shown below:

Therefore, the first row in the edge_index represents the edge between nodes in a single direction. The second row represents the edge between nodes in the opposite direction.

Example:

1st row: [0, 1, 1, 2] == [node0 -> node1, node1 -> node 2]
2nd row: [1, 0, 2, 1] == [node1 --> node0, node2 --> node1]

Reference: https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

GCN Model Architecture

Result

Testing the Trained Model

Discussion

Why is the validation loss lower than the training loss for the first few epochs of training?

In the beginning, I did not understand why the validation loss was lower than the training loss for the first few epochs of training. I initially had thought that the training loss should be lower than the validation loss. However, I did not find any data leakage issue to my validation dataset. Thus, I decided to further research the cause of this phenomenon.

Below are possible reasons for having a lower validation loss than the training loss.

Dropout is not applied during validation while it is applied during training. I used the dropout technique after my convolution layer. Remember, dropout is a regularization technique that randomly drops out some of the neurons in the network. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.
Regularization is not applied during validation while it is applied during training. I used weight decay (L2 regularization) after each of my convolution layer. Remember, regularization is a technique that penalizes the loss function. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.

References:

[1] https://pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/

[2] https://towardsdatascience.com/what-your-validation-loss-is-lower-than-your-training-loss-this-is-why-5e92e0b1747e#:~:text=The%20regularization%20terms%20are%20only,loss%20than%20the%20training%20set.

sjhpark / gcn-for-bbbp-prediction Goto Github PK