Coder Social home page Coder Social logo

gcn-for-bbbp-prediction's Introduction

GCN-for-BBBP-Prediction

Graph Convolutional Networks (GCN) for Predicting Blood-Brain Barrier Permeability of Compounds

Some of my work was inspired by the following work: https://www.kaggle.com/code/priyanagda/simple-gcn


Dataset Used:

deepchem's BBBP (Blood-brain Barrier Penetration) dataset

"The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system."

  • 1st column: number of row (unnecessary data)
  • 2nd column: compound name
  • 3rd column: p_np (boolean labels for permeability of blood brain barrier)
  • 4th column: SMILES (simplified molecular-input line-entry system)

The dataset contains the names and SMILE strings of 2050 different compounds.

Sample from the dataset:

image

The dataset's target values (permeability) are distributed as following: image

Reference: https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html

Pre-processing of Dataset

  • Featurizer: Transforming raw input data into processed data
  • Stratified Train/Vaild/Test Split: Splitting dataset into train, valid, and test sets
    • 80% for training
    • 10% for validation
    • 10% for testing

edge_index

I used edge_index as the feature to pass into the GCN model to train.

Here, edge_indx has the shape of [2, num_edges]

Each compound can be seen as an unweighted and undirected graph as the example shown below: image

Therefore, the first row in the edge_index represents the edge between nodes in a single direction. The second row represents the edge between nodes in the opposite direction.

Example:

image

  • 1st row: [0, 1, 1, 2] == [node0 -> node1, node1 -> node 2]
  • 2nd row: [1, 0, 2, 1] == [node1 --> node0, node2 --> node1]

Reference: https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

GCN Model Architecture

image

Result

image

Testing the Trained Model

image image

Discussion

Why is the validation loss lower than the training loss for the first few epochs of training?

In the beginning, I did not understand why the validation loss was lower than the training loss for the first few epochs of training. I initially had thought that the training loss should be lower than the validation loss. However, I did not find any data leakage issue to my validation dataset. Thus, I decided to further research the cause of this phenomenon.

Below are possible reasons for having a lower validation loss than the training loss.

  1. Dropout is not applied during validation while it is applied during training. I used the dropout technique after my convolution layer. Remember, dropout is a regularization technique that randomly drops out some of the neurons in the network. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.
  2. Regularization is not applied during validation while it is applied during training. I used weight decay (L2 regularization) after each of my convolution layer. Remember, regularization is a technique that penalizes the loss function. Thus, the network is less likely to overfit during training. However, during validation, the network is not trained. Thus, the network is more likely to overfit during validation. Therefore, the validation loss is higher than the training loss.

References:

[1] https://pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/

[2] https://towardsdatascience.com/what-your-validation-loss-is-lower-than-your-training-loss-this-is-why-5e92e0b1747e#:~:text=The%20regularization%20terms%20are%20only,loss%20than%20the%20training%20set.

gcn-for-bbbp-prediction's People

Contributors

sjhpark avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.