Some of my work was inspired by the following work: https://www.kaggle.com/code/priyanagda/simple-gcn
deepchem's BBBP (Blood-brain Barrier Penetration) dataset
"The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system."
- 1st column: number of row (unnecessary data)
- 2nd column: compound name
- 3rd column: p_np (boolean labels for permeability of blood brain barrier)
- 4th column: SMILES (simplified molecular-input line-entry system)
The dataset contains the names and SMILE strings of 2050 different compounds.
Sample from the dataset:
The dataset's target values (permeability) are distributed as following:
Reference: https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html
- Featurizer: Transforming raw input data into processed data
- MolGraphConvFeaturizer: a featurizer of general graph convolution networks for molecules.
- Stratified Train/Vaild/Test Split: Splitting dataset into train, valid, and test sets
- 80% for training
- 10% for validation
- 10% for testing
I used edge_index as the feature to pass into the GCN model to train.
Here, edge_indx has the shape of [2, num_edges]
Each compound can be seen as an unweighted and undirected graph as the example shown below:
Therefore, the first row in the edge_index represents the edge between nodes in a single direction. The second row represents the edge between nodes in the opposite direction.
Example:
- 1st row: [0, 1, 1, 2] == [node0 -> node1, node1 -> node 2]
- 2nd row: [1, 0, 2, 1] == [node1 --> node0, node2 --> node1]
Reference: https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html
Why is the validation loss lower than the training loss for the first few epochs of training?
In the beginning, I did not understand why the validation loss was lower than the training loss for the first few epochs of training. I initially had thought that the training loss should be lower than the validation loss. However, I did not find any data leakage issue to my validation dataset. Thus, I decided to further research the cause of this phenomenon.
Below are possible reasons for having a lower validation loss than the training loss.
- Dropout is not applied during validation while it is applied during training. I used the dropout technique after my convolution layer. Remember, dropout is a regularization technique that randomly drops out some of the neurons in the network. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.
- Regularization is not applied during validation while it is applied during training. I used weight decay (L2 regularization) after each of my convolution layer. Remember, regularization is a technique that penalizes the loss function. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.
[1] https://pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/