< Paper Reproduction >
Author: Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang
Reproduced by: Seungwoo Ryu
- Not really well reproduced :(
- Deficiency of domain knowledge on protein
- Some arbitrary interpretations not mentioned on the original paper
- Suppose all the snippets below start from your own root directory.
- Downloaded folder name is assumed to be a 'GearNet'.
- Instead of using
AlphaFoldDB
(805K) for pretraining, I usedSwiss-Prot
(540K) protein dataset.
Disparity of the pretraining dataset can make subtle (or considerable) difference b/w the result of original paper and that of mine.
Can download the data at Here, or bywget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/swissprot_pdb_v3.tar -P ./
- The expressions/schema of dataset might follow how doc1 or doc2 expresses each protein.
-
Special Preprocessing on EC & GO
-
For
EC Number Prediction
andGO Term Prediction
:-
First introducted by Paper
-
Caution!
- It is not possible to use their original data at all.
- As this paper used
contact map
as a feature for the model, they didn't use explicit coordinate information of atoms. Therefore, their preprocessed files do not offer any info. about intact 3D coordinates which is essential on GearNet(-Variants). Even the.tfrecords
files offered onData
section of the github page only contain information of contact map. - The code of the paper offers preprocessing code in
preprocessing/data_collection.sh
. However, the code in the 20th lineshows an error with the messagewget https://cdn.rcsb.org/resources/sequence/clusters/bc-95.out -O $DATA_DIR/bc-95.out
Not Found. The requested URL was not found on this server.
. Therefore, retrieving necessary information from original PDB file is impossible, and the command afterward is useless.
- As this paper used
- It is not possible to use their original data at all.
-
My strategy is:
- Extract the pdb names from the data split given on the paper and gather all.
- Based on the collection of the name, download pdb file one by one from the web.
- Extract 3D coordinates information from the downloaded files.
- After following these steps,
EC: {'train': 4, 'valid': 3, 'test': 0} sets are inevitably omitted from the original dataset.
GO: {'train': 18, 'valid': 1, 'test': 1} sets are inevitably omitted from the original dataset.
whose coordinates are expressed awkward.
-
Download the split info. of original paper by:
git clone https://github.com/flatironinstitute/DeepFRI mkdir -p downstream/dataset/EC_GO cp -r DeepFRI/preprocessing/data/* downstream/dataset/EC_GO/
-
-
For
Fold Classification
- First introduced by Paper
- Can download the data at Here or by
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar" -O HomologyTAPE.zip && rm -rf /tmp/cookies.txt
-
For
Reaction Classification
- Was introduced in a same paper introduced in
Fold Classification
- Can download the data at Here or by
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI" -O ProtFunct.zip && rm -rf /tmp/cookies.txt
- Was introduced in a same paper introduced in
-
After running all the codes above, preparation for data is all done!
conda create -n GearNet python=3.8.5
conda activate GearNet
pip install -r requirements.txt
mkdir -p uniprot/dataset
tar -xf swissprot_pdb_v3.tar -C ./uniprot/dataset
mkdir -p uniprot/interim
python GearNet/preprocess/preprocess_pt.py --data_dir ./uniprot/dataset --save_dir ./uniprot/interim
- As mentioned before, the dataset the model is pretrained on is different from the original one.
- Swiss-Prot data does not have information about resolution (Appendix G).
- The only standard used for filtering: Incorrect records such as
53.353-100.177
at the position of coordinate information.- 4121 proteins among 542380 are excluded.
- Additionally, I excluded 3000 datasets for validation.
- So, the final number of data in train set is
535259
.
- Although datasets are already prepared in advance following published papers,
we need to pre-process more than those as we need 'coordinate' information for GearNet(-variants). - To extract coordinates info. from raw pdb files and make inputs for model, implement:
bash GearNet/preprocess/run_downstream.sh
-
Locate all the downloaded folders on the root directory.
https://drive.google.com/drive/folders/1aE3TPok3YfF-P5mchIbUmMe3195PlY9S?usp=sharing
- Following the original paper, all the experiments are set in a DistributedDataParallel(DDP) setting.
bash main.sh pretrain
- Can manully change options on
main.sh
script for other options.-
For example, if you want to...
Pretrain
theGearNet-Edge
model withMultiviewContrastiveLearning
objective onGPU #0,1
set options as
gpu="0 1" enc_model="GearNet-Edge" task_idx=0
-
bash main.sh downstream
- Can manually change options on
main.sh
script, likewise. - If you want to load pre-trained weights for inference, set
load
option toTrue
- Because I couldn't train a large model, I don't have any pretrained model to load which is trained on
Pretraining objectives
.
- Because I couldn't train a large model, I don't have any pretrained model to load which is trained on
- It is not an official code for the original paper, and there are some points that I interpreted arbitrarily in case I don't get clearly on the paper. Furthermore, when specific reference codes/data which were mentioned on the paper don't work well, I coded/preprocessed by myself. Therefore, there is tremendous possibility that this code doesn't fit to the paper of the original authors. They plan to make their code public when their awesome paper is once accepted, so it might be better to refer theirs. I will fix the errors caused by my mis-understanding once their code is open to public.