< Paper Reproduction >

Protein Representation Learning by Geometric Structure Pretraining

Author: Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang

Reproduced by: Seungwoo Ryu

Not really well reproduced :(
- Deficiency of domain knowledge on protein
- Some arbitrary interpretations not mentioned on the original paper

Suppose all the snippets below start from your own root directory.
- Downloaded folder name is assumed to be a 'GearNet'.

Pretraining Dataset

Instead of using AlphaFoldDB(805K) for pretraining, I used Swiss-Prot(540K) protein dataset.
Disparity of the pretraining dataset can make subtle (or considerable) difference b/w the result of original paper and that of mine.
Can download the data at Here, or by
```
wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/swissprot_pdb_v3.tar -P ./
```
The expressions/schema of dataset might follow how doc1 or doc2 expresses each protein.

Downstream Dataset

Special Preprocessing on EC & GO
For EC Number Prediction and GO Term Prediction:
- First introducted by Paper
- Caution!
  - It is not possible to use their original data at all.
    - As this paper used contact map as a feature for the model, they didn't use explicit coordinate information of atoms. Therefore, their preprocessed files do not offer any info. about intact 3D coordinates which is essential on GearNet(-Variants). Even the .tfrecords files offered on Data section of the github page only contain information of contact map.
    - The code of the paper offers preprocessing code in preprocessing/data_collection.sh. However, the code in the 20th line
      wget https://cdn.rcsb.org/resources/sequence/clusters/bc-95.out -O $DATA_DIR/bc-95.out
      shows an error with the message Not Found. The requested URL was not found on this server.. Therefore, retrieving necessary information from original PDB file is impossible, and the command afterward is useless.
- My strategy is:
  1. Extract the pdb names from the data split given on the paper and gather all.
  2. Based on the collection of the name, download pdb file one by one from the web.
  3. Extract 3D coordinates information from the downloaded files.
  - After following these steps,
    EC: {'train': 4, 'valid': 3, 'test': 0} sets are inevitably omitted from the original dataset.
    GO: {'train': 18, 'valid': 1, 'test': 1} sets are inevitably omitted from the original dataset.
    whose coordinates are expressed awkward.
- Download the split info. of original paper by:
```
git clone https://github.com/flatironinstitute/DeepFRI
mkdir -p downstream/dataset/EC_GO
cp -r DeepFRI/preprocessing/data/* downstream/dataset/EC_GO/
```

For Fold Classification

First introduced by Paper

Can download the data at Here or by

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar" -O HomologyTAPE.zip && rm -rf /tmp/cookies.txt

For Reaction Classification

Was introduced in a same paper introduced in Fold Classification

Can download the data at Here or by

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI" -O ProtFunct.zip && rm -rf /tmp/cookies.txt

After running all the codes above, preparation for data is all done!

Preparation for the -

Environment


conda create -n GearNet python=3.8.5 
conda activate GearNet
pip install -r requirements.txt

Dataset

└ For Pretraining

mkdir -p uniprot/dataset
tar -xf swissprot_pdb_v3.tar -C ./uniprot/dataset

mkdir -p uniprot/interim
python GearNet/preprocess/preprocess_pt.py --data_dir ./uniprot/dataset --save_dir ./uniprot/interim

As mentioned before, the dataset the model is pretrained on is different from the original one.
- Swiss-Prot data does not have information about resolution (Appendix G).
- The only standard used for filtering: Incorrect records such as 53.353-100.177 at the position of coordinate information.
  - 4121 proteins among 542380 are excluded.
- Additionally, I excluded 3000 datasets for validation.
- So, the final number of data in train set is 535259.

└ For Downstream Task

Although datasets are already prepared in advance following published papers,
we need to pre-process more than those as we need 'coordinate' information for GearNet(-variants).
To extract coordinates info. from raw pdb files and make inputs for model, implement:
```
bash GearNet/preprocess/run_downstream.sh
```

└ Or You can download the preprocessed data from

Locate all the downloaded folders on the root directory.

https://drive.google.com/drive/folders/1aE3TPok3YfF-P5mchIbUmMe3195PlY9S?usp=sharing

Experiment

Following the original paper, all the experiments are set in a DistributedDataParallel(DDP) setting.

Pretraining

bash main.sh pretrain

Can manully change options on main.sh script for other options.
- For example, if you want to...
  
  Pretrain the GearNet-Edge model with MultiviewContrastiveLearning objective on GPU #0,1
  
  set options as
```
gpu="0 1"
enc_model="GearNet-Edge"
task_idx=0
```

Downstream

bash main.sh downstream

Can manually change options on main.sh script, likewise.
If you want to load pre-trained weights for inference, set load option to True
- Because I couldn't train a large model, I don't have any pretrained model to load which is trained on Pretraining objectives.

P.S.

It is not an official code for the original paper, and there are some points that I interpreted arbitrarily in case I don't get clearly on the paper. Furthermore, when specific reference codes/data which were mentioned on the paper don't work well, I coded/preprocessed by myself. Therefore, there is tremendous possibility that this code doesn't fit to the paper of the original authors. They plan to make their code public when their awesome paper is once accepted, so it might be better to refer theirs. I will fix the errors caused by my mis-understanding once their code is open to public.

tryumanshow / gearnet-reproduce-just4fun Goto Github PK

gearnet-reproduce-just4fun's Introduction

Protein Representation Learning by Geometric Structure Pretraining

Pretraining Dataset

Downstream Dataset

Preparation for the -

Environment

Dataset

└ For Pretraining

└ For Downstream Task

└ Or You can download the preprocessed data from

Experiment

Pretraining

Downstream

P.S.

gearnet-reproduce-just4fun's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent