Coder Social home page Coder Social logo

deepgraphlearning / gearnet Goto Github PK

View Code? Open in Web Editor NEW
246.0 10.0 25.0 524 KB

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)

License: MIT License

Python 98.25% Dockerfile 1.75%
graph-neural-networks pre-training protein-representation-learning

gearnet's Introduction

GearNet: Geometry-Aware Relational Graph Neural Network

This is the official codebase of the paper

Protein Representation Learning by Geometric Structure Pretraining, ICLR'2023

[ArXiv] [OpenReview]

Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang

and the paper

Enhancing Protein Language Models with Structure-based Encoder and Pre-training, ICLR'2023 MLDD Workshop

[ArXiv] [OpenReview]

Zuobai Zhang, Minghao Xu, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang

News

  • [2023/10/17] Please check the latest version of the ESM-GearNet paper and code implementation!!

  • [2023/03/14] The code for ESM_GearNet has been released with our latest paper.

  • [2023/02/25] The code for GearNet_Edge_IEConv & Fold3D dataset has been released.

  • [2023/02/01] Our paper has been accepted by ICLR'2023! We have released the pretrained model weights here.

  • [2022/11/20] We add the scheduler in the downstream.py and provide the config file for training GearNet-Edge with single GPU on EC. Now you can reproduce the results in the paper.

Overview

GeomEtry-Aware Relational Graph Neural Network (GearNet) is a simple yet effective structure-based protein encoder. It encodes spatial information by adding different types of sequential or structural edges and then performs relational message passing on protein residue graphs, which can be further enhanced by an edge message passing mechanism. Though conceptually simple, GearNet augmented with edge message passing can achieve very strong performance on several benchmarks in a supervised setting.

GearNet

Five different geometric self-supervised learning methods based on protein structures are further proposed to pretrain the encoder, including Multivew Contrast, Residue Type Prediction, Distance Prediction, Angle Prediction, Dihedral Prediction. Through extensively benchmarking these pretraining techniques on diverse downstream tasks, we set up a solid starting point for pretraining protein structure representations.

SSL

This codebase is based on PyTorch and TorchDrug (TorchProtein). It supports training and inference with multiple GPUs. The documentation and implementation of our methods can be found in the docs of TorchDrug. To adapt our model in your setting, you can follow the step-by-step tutorials in TorchProtein.

Installation

You may install the dependencies via either conda or pip. Generally, GearNet works with Python 3.7/3.8 and PyTorch version >= 1.8.0.

From Conda

conda install torchdrug pytorch=1.8.0 cudatoolkit=11.1 -c milagraph -c pytorch-lts -c pyg -c conda-forge
conda install easydict pyyaml -c conda-forge

From Pip

pip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip install torchdrug
pip install easydict pyyaml

Using Docker

First, make sure to setup docker with GPU support (guide).

Next, build docker image

docker build . -t GearNet

Then, after image is built, you can run training commands from within docker with following command

docker run -it -v /path/to/dataset/directory/on/disk:/root/scratch/ --gpus all GearNet bash

Reproduction

Training From Scratch

To reproduce the results of GearNet, use the following command. Alternatively, you may use --gpus null to run GearNet on a CPU. All the datasets will be automatically downloaded in the code. It takes longer time to run the code for the first time due to the preprocessing time of the dataset.

# Run GearNet on the Enzyme Comission dataset with 1 gpu
python script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0]

We provide the hyperparameters for each experiment in configuration files. All the configuration files can be found in config/*.yaml.

To run GearNet with multiple GPUs, use the following commands.

# Run GearNet on the Enzyme Comission dataset with 4 gpus
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

# Run ESM_GearNet on the Enzyme Comission dataset with 4 gpus
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/ESM_gearnet.yaml --gpus [0,1,2,3]

# Run GearNet_Edge_IEConv on the Fold3D dataset with 4 gpus
# You need to first install the latest version of torchdrug from source. See https://github.com/DeepGraphLearning/torchdrug.
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/Fold3D/gearnet_edge_ieconv.yaml --gpus [0,1,2,3]

Pretraining and Finetuning

By default, we will use the AlphaFold Datase for pretraining. To pretrain GearNet-Edge with Multiview Contrast, use the following command. Similar, all the datasets will be automatically downloaded in the code and preprocessed for the first time you run the code.

# Pretrain GearNet-Edge with Multiview Contrast
python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]

# Pretrain ESM_GearNet with Multiview Contrast
python script/pretrain.py -c config/pretrain/mc_esm_gearnet.yaml --gpus [0]

After pretraining, you can load the model weight from the saved checkpoint via the --ckpt argument and then finetune the model on downstream tasks.

# Finetune GearNet-Edge on the Enzyme Commission dataset
python script/downstream.py -c config/downstream/EC/gearnet_edge.yaml --gpus [0] --ckpt <path_to_your_model>

You can find the pretrained model weights here, including those pretrained with Multiview Contrast, Residue Type Prediction, Distance Prediction, Angle Prediction and Dihedral Prediction.

Results

Here are the results of GearNet w/ and w/o pretraining on standard benchmark datasets. All the results are obtained with 4 A100 GPUs (40GB). Note results may be slightly different if the model is trained with 1 GPU and/or a smaller batch size. For EC and GO, the provided config files are for 4 GPUs with batch size 2 on each one. If you run the model on 1 GPU, you should set the batch size as 8. More detailed results are listed in the paper.

Method EC GO-BP GO-MF GO-CC
GearNet 0.730 0.356 0.503 0.414
GearNet-Edge 0.810 0.403 0.580 0.450
Multiview Contrast 0.874 0.490 0.654 0.488
Residue Type Prediction 0.843 0.430 0.604 0.465
Distance Prediction 0.839 0.448 0.616 0.464
Angle Prediction 0.853 0.458 0.625 0.473
Dihedral Prediction 0.859 0.458 0.626 0.465
ESM_GearNet 0.883 0.491 0.677 0.501
ESM_GearNet (Multiview Contrast) 0.894 0.516 0.684 0.5016

Citation

If you find this codebase useful in your research, please cite the following papers.

@inproceedings{zhang2022protein,
  title={Protein representation learning by geometric structure pretraining},
  author={Zhang, Zuobai and Xu, Minghao and Jamasb, Arian and Chenthamarakshan, Vijil and Lozano, Aurelie and Das, Payel and Tang, Jian},
  booktitle={International Conference on Learning Representations},
  year={2023}
}
@article{zhang2023enhancing,
  title={A Systematic Study of Joint Representation Learning on Protein Sequences and Structures},
  author={Zhang, Zuobai and Wang, Chuanrui and Xu, Minghao and Chenthamarakshan, Vijil and Lozano, Aurelie and Das, Payel and Tang, Jian},
  journal={arXiv preprint arXiv:2303.06275},
  year={2023}
}

gearnet's People

Contributors

inc0 avatar oxer11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gearnet's Issues

Dealing with proteins with multiple chains

For proteins with multiple chains, did you split them by chain and input the splits into the model one by one, or directly input the whole proteins?

In the section "F ADDITIONAL EXPERIMENTAL RESULTS ON EC AND GO PREDICTION - Pretraining on different datasets" of your paper, you wrote:

Specifically, we extract 123,505
experimentally-determined protein structures from PDB whose resolutions are between 0.0 and 2.5
angstroms, and we further extract 305,265 chains from these proteins to construct the final dataset

which seems to implying that you trained the model on a bunch of single protein chains. However, meanwhile you did experiments of Enzyme Comission code prediction. To my knowledge, there are many enzymes containing more than one chain. It is impossible to split the enzyme into different chains and input into the model respectively (which hardly predicts the enzyme type correctly).

pretrain dataset

Hi, I download about 38k pdb files by using the config files, and paper indicates the pretrain dataset is 805k. Should this be expected? Many thanks!

Attribution information of the Fold3D dataset

Hi, may I ask if you have a more detailed description of the data structure of the protein .hdf5 file of the Fold3D dataset? I find it contains much information about the protein, but I am not sure what some of them mean.

Any plans on releasing the model's weights?

Hi,
Thank you for your great work. Do you plan on releasing the model's weights any time soon? The README doesn't seem to mention any pretrained model. This would be very helpful to quickly get representations for new sequences.

Node classification tasks

Hi! First of all great job! I have been trying to do node classification in residue view, using my own node labels. However, I haven't been able to configure the NodePropertyPrediction task to use those labels instead of predicting the residue features. Do you have any guidance on how I can proceed to do this? Any help is appreciated

Asking about implementation of series connection of PLM & GNN in the FusionNetwork.

Hi, I've learned a lot from this great work. Thank you for presenting it in the paper and here!

I wanted to ask about implementation of series connection of PLM & GNN in the FusionNetwork. In the PLM+GNN paper (
Zhang, Z. et al. Enhancing Protein Language Models with Structure-based Encoder and Pre-training. Arxiv (2023) doi:10.48550/arxiv.2303.06275), the authors tested three ways of fusing PLM & GNN and decided to use the series connection. The series connection is described as

Series: we replace the node features of GearNet with the output of ESM-1b and use the output of GearNet as final representations.

In the implementation of FusionNetwork. I saw it indeed uses the output of ESM-1b as the node features of GearNet, but then seems to use the output of GearNet concatenated with the output of ESM-1b as final representations (pasted below). So which is the way that the authors found most effective? Shall one use sole output from GearNet or the concatenated output?

    def forward(self, graph, input, all_loss=None, metric=None):
        output1 = self.sequence_model(graph, input, all_loss, metric)
        node_output1 = output1.get("node_feature", output1.get("residue_feature"))
        output2 = self.structure_model(graph, node_output1, all_loss, metric)
        node_output2 = output2.get("node_feature", output2.get("residue_feature"))
        node_feature = torch.cat([node_output1, node_output2], dim=-1)
        graph_feature = torch.cat([
            output1['graph_feature'], 
            output2['graph_feature']
        ], dim=-1)
        return {
            "graph_feature": graph_feature,
            "node_feature": node_feature
        }

If possible, could you please share some configurations on trying out the "cross" style (quote below) of fusing PLM & GNN? I am interested in testing this option and wanted to learn about the configurations of the transformer (number of layers, hidden dims, number of head) that you have tried.

Cross: we concatenate the output of ESM-1b and GearNet and then feed them into a transformer to perform cross-attention between modalities. The output of the transformer will be used asfinal representations.

max_length=100 for TruncateProtein in pretrain config files

Hello! Amazing work here. I am curious about a detail of setup of different pretrain tasks specified in the config files (.yaml files). In config of self-prediction tasks, there seems to be a TruncateProtein applied to the AlphaFoldDB dataset with max_length=100, while in config of Multiview Contrast task there isn't. Is similar truncation specified implicitly somewhere else in cases for MC task? Is the truncating using max_length=100 needed to reproduce the results for pretraining on self-prediction tasks?
Thank you!

A dataset not found when I run "python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]"

It seems that the file located in "https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000006548_3702_ARATH_v2.tar" really doesn't exist. When I entered this url in my browser, it also noticed me that the file doesn't exist.

14:43:55   Downloading https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000006548_3702_ARATH_v2.tar to /home/horace/scratch/protein-datasets/alphafold/UP000006548_3702_ARATH_v2.tar
Traceback (most recent call last):
  File "script/pretrain.py", line 50, in <module>
    dataset = core.Configurable.load_config_dict(cfg.dataset)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
    return cls(**new_config)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/core/core.py", line 288, in wrapper
    return init(self, *args, **kwargs)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/datasets/alphafolddb.py", line 122, in __init__
    tar_file = utils.download(self.urls[species_id], path, md5=self.md5s[species_id])
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/utils/file.py", line 31, in download
    urlretrieve(url, save_file)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

About The Pre-training Process

Hey, I am sorry to trouble you about the pre-training details about GeatNet. :)

After pre-training on the AlphaFold, will you fix the model's parameters and only change the prediction head's parameter? Or update the pre-trained model and its corresponding prediction head together?

shape mismatch

I encountered a shape mismatch issue during runtime.

File "/home/admin/anaconda3/envs/test_env/lib/python3.7/site-packages/torchdrug-0.2.0-py3.7.egg/torchdrug/layers/conv.py", line 813, in message_and_aggregate
    return update.view(graph.num_node, self.num_relation * self.input_dim)
RuntimeError: shape '[975, 472]' is invalid for input of size 312000

protein structure:

print(protein, protein.node_feature.shape)  # PackedProtein(batch_size=1, num_atoms=[51], num_bonds=[975], num_residues=[51])   torch.Size([51, 21])

secondary structure evaluation

Hi, thank you for your amazing work!

I am trying to evaluate GearNet on secondary structure dataset, but it gives me this error:

AttributeError: 'PackedProtein' object has no attribute 'node_position'

I think it is because secondary structure datset doesn't provide node_position, which is needed for gearnet.

Is there any other way I can evaluate secondary structure on gearnet?

Thank you.

Error in Fold3D config file

Hi! Thank you for your great work!! I just have a quick question. I am trying to use Gearnet with the Fold3D dataset using the configuration file you provided. But I keep on getting this error (I add the screenshot in attach). If I remove mlp_batch_norm and mlp_dropout the code runs, but the model doesn't seem to train properly. I would really appreciate if you could give me your input on that, or let me know what I am doing incorrectly.
Thanks a lot!!

Screenshot 2023-04-04 at 14 51 18

multi-gpu training fails

Hello,

Running

python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

does not succeed with following log:

20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
Loading /home/chenshoufa/scratch/protein-datasets/EnzymeCommission/enzyme_commission.pkl.gz:  64%|██████████████████████████████████████████▉                        | 11854/18515 [08:49<20:55,  5.30it/s]Killing subprocess 1350247
Killing subprocess 1350248
Killing subprocess 1350249
Killing subprocess 1350250
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chenshoufa/anaconda3/envs/gear/bin/python', '-u', 'script/downstream.py', '--local_rank=3', '-c', 'config/downstream/EC/gearnet.yaml', '--gpus', '[0,1,2,3]']' died with <Signals.SIGKILL: 9>.

Could you help me with this issue?

RuntimeError: addmm: Argument #3 (dense), for training ESM_GearNet on EC

Hi, thank you for sharing the project. I am trying to reproduce the result of the ESM_GearNet model, and I have some problems on fine-tuning it on EC downstream task. Here is the picture.
1692110063132
I was able to pretrain the model on AlphaFoldDB but failed to fine-tune it or train it from scratch on EC.

Reproduce the result for ProteinBERT

Hi, thank you for sharing the work and answering questions. I recently want to reproduce the proteinBERT results as shown in the your paper. However, the performance with directly using the given config file is only about 0.079.
image
The loss is pretty low on training, validation, and testing, but it seems the model isn't able to classify data correctly. The downstream task is EC.
Do you have any suggestions fixing this issue?

And I also try to use HuggingFace protBERT to rerun the experiments. The result is also around 0.078 and has low loss values. Would you be willing to give any advice on this as well?

Many thanks for your answering!

error in mc_esm_gearnet

Howdy, thank u for ur awesome work in Enhancing Protein Language Models with Structure-based Encoder and Pre-training. I am running the pretaining experiment now, and I am facing an issue of "Can't find atom_feature in features.atom". I paste the error statement below.

image

It seems like it cannot recognize the atom_feature: null or bond_feature: null. Do I need to change the source code for implementing these two arguments?

Any help will be grateful!

Training details of basic GearNet on Fold prediction task

Hi, may I ask if you can provide your training details of basic GearNet (without IEConv) on the Fold prediction task?
The results of your paper are in page 8:
28.4 42.6 95.3 for test_fold, test_superfamily, test_family.

Is it here https://github.com/DeepGraphLearning/GearNet/blob/main/config/downstream/Fold3D/gearnet.yaml? I didn't find the basic GearNet model architecture in this repo.

Thanks!

About Dataset

Hello Dear Author!
For the dataset in the experiment, we have the following confusions:

  1. for Enzyme Commission dataset, I downloaded the dataset, but only get the PDB file, the PDB index of the training set.
    But how do I get the labels? I guess the suffix of PDB stands for label? For example 2FOR-A stands for A?
    Same for Gene Ontology (GO).
  2. alphafold dataset why there is training set test set validation set?

By the way, I tried using torchDrug, but had a slightly different experience than PyG.

Low performance on training from scratch on a single GPU

Hi, I am trying to reproduce the experiments, but the reproduced results have large gaps between the paper results.
Reproduced:
GearNet:
EC: 0.514 (200 epochs)
GO-BP: 0.176 (146 epochs)
GO-CC: 0.145 (84 epochs)
GearNet-Edge:
EC: 0.404 (163 epochs)
GO-BP: 0.255 (100 epochs)
GO-CC: 0.163 (107 epochs)

I use the same configuration and hyperparameter as provided in the rep. Training runs on one single GPU, and the some of the experiments are still under training.

Many thanks

UserWarning: Unknown value

Hi,

Thanks for your wonderful work.

When running

# Run GearNet on the Enzyme Comission dataset with 1 gpu
python script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0]

I met the following log:

/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `PT`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                                                                                                                           
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value ` PT`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                             | 172/19198 [00:29<1:01:27,  5.16it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `COB`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `COB`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                               | 183/19198 [00:30<57:46,  5.48it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `Be`
  warnings.warn("Unknown value `%s`" % x)                                                            
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `ADP`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                                                                                                                           
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `ADP`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BEF`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `BEF`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                             | 186/19198 [00:31<1:01:54,  5.12it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `1NB`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `1NB`

Is it normal?

Thanks in advance.

Config For Dataset Fold

Hello, I'd like to know whether I can get the configuration file for training the fold dataset?

About Fold and Reaction dataset in torchdrug

Hi, sorry for bothering you about Torch drug API. I want to reproduce your results on Fold Classification and Reaction. However, I find that the Fold dataset in torchdrug/datasets/fold/Fold doesn't contain protein structure data? It basically contains Sequences and Labels. According to my understanding of GearNet, it's a structure-based method and the pretrain tasks are also structure-based. So, I am confused about current situation.
Besides, I can't find the reaction dataset in torchdrug. Could you please tell me which dataset you used and the training config like EC and GO.

Sorry again for adding to your trouble. Thank you for open sourcing such a great work.

Pre-training on different datasets

Hello, you discussed the results of pre-training on different datasets in the appendix.
As we can see in Table 8, the performance is comparable with real PDB or alphafold (V1 or V2), but real PDB has only 300,000 structures and alphafold has 800,000 structures.
Why the authors use more structures of alphafold in the main text? Finally, theoretically, the larger the dataset, the better the pre-training results, why Table 8 is not valid?

Edge_list set to [0,0,0]

Hi, thanks for your work!

In the Fold3D dataset class, why is the edge_list field set to an empty edge list of [[0,0,0]] when the input hdf5 files are loaded (on line 85 in dataset.py)? I'm trying to load in my own protein graphs into the GearNetIEConv model to get an output embedding, and if I don't set edge_list=[[0,0,0]], I run into an IndexOutOfBounds error later on when the protein graph is getting passed through the "message" function of the GeometricRelationalGraphConv layer.

confusion on epochs

Hi!
I am wondering the number of epochs in experiment. The epoch is set to 200 for EC stated in the paper, but in the config the epoch is set to 50. Whether should I modify the epochs to 200 for reproducing the experiment?

Thanks for your help!

Asking about the default config option `save_interval: 5` when pretraining on AlphaDB

Hi there, I noticed that the default config option save_interval: 5 (

), when taken by the script pretrain.py, will let the model to train on one pickled part (consisting 220k proteins) of AlphaDB for 5 epoch, and then another pickled part for 5 epochs, and so on so forth. (This option also controls the interval that the model is saved, although it could be also be adjusted independently. )

Could you provide a bit insight on why is it set this way? It it needed for some practical reason to train enough number of epochs on one pickle before moving to the next one? Thank you!

The pre-trained GearNet-Edge model for Fold Classification

Thank you for your amazing work! I found that for the Fold Classification task, the GearNet-Edge model was implemented based on the GearNetIEConv script rather than the GearNet script, which has some detail differences (e.g., extra input embedding and ieconv layers). Based on this, I would like to ask whether you could provide the pretrained GearNet-Edge model based on multiview contrast learning and the GearNetIEConv script for Fold Classification (rather than based on GearNet script for EC task)? Thank you.

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Thank you for your great work. However when try to pretrain,I encountered such error.
python script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus null
Traceback (most recent call last):
File "script/downstream.py", line 11, in
from torchdrug import core, models, tasks, datasets, utils
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchdrug/models/init.py", line 10, in
from .esm import EvolutionaryScaleModeling
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchdrug/models/esm.py", line 6, in
import esm
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/esm/init.py", line 8, in
from .data import Alphabet, RobertaAlphabet, BatchConverter, FastaBatchedDataset # noqa
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/esm/data.py", line 11, in
from torchvision.datasets.utils import download_url
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/init.py", line 5, in
from torchvision import datasets
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/datasets/init.py", line 1, in
from ._optical_flow import KittiFlow, Sintel, FlyingChairs, FlyingThings3D, HD1K
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/datasets/_optical_flow.py", line 26, in
class FlowDataset(ABC, VisionDataset):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
Here is information of my environment:
torch 1.11.0
torchdrug 0.2.0
pyg 2.0.4

asking about how to obtain the new graph based on contrast learning

Hello, because my code understanding ability is not very strong, I have a little problem in understanding the model:
(Because I am very interested in your work, I am sorry to have a lot of questions~)
Refer to the mc_gearnet_edge.yaml file, the Multiview Contrast in the model is followed by a multi-layer perceptron. However, the output in Multiview Contrast is divided into output1 and output2 consisting of graph features and node features, but there is only one input in MLP.

  1. I would like to ask what is the input in MLP?
  2. what is the model in the MultiviewContrast module?
    [["def init(self, model, crop_funcs, noise_funcs, num_mlp_layer=2, activation="relu", tau=0.07):
    super(MultiviewContrast, self).init()"]]
    is it GeometryAwareRelationalGraphNeuralNetwork?
  3. In addition, which step did you obtain the new graph based on contrast learning mentioned in your article?(because the MultiviewContrast module has two outputs results, I don't know which one is better)

Looking forward to your reply very much!

How can I load pretrained weights from checkpoint to go on pretraining?

Hello! Thx for your great work!
For some reasons, I couldn't run the whole training loop in your pretraining scripts. But I got some checkpoints like "model_epoch_25.pth". The question is, how can I load this checkpoint and go on finishining my pretraining?
Looking forward to your reply!

a question about the downstream tasks

hi, authors, great works, I notice that the GearNet does 4 downstream tasks, they are 1): EC number prediction, 2): GO term prediction, 3): Fold classification, 4): Reaction classification, I am interested in the GO term prediction task, could the authors release the corresponding dataset about this task? Thanks!!!

Pre-trained weight for ESM-GearNet

Howdy, thank you for sharing the amazing work, and may I ask if you have any plans on releasing the pre-trained weights for ESM-GearNet model? Many thanks!

What information that the hidden dimensions respectively represent

Hello! GearNet is really a good work! But I have a problem. I see that the hidden dimension set in the config file is [512,512,512,512,512,512]. Since I don't know much about the specific principle of graph neural network, I want to know what information these dimensions respectively represent.Thank you!

Request for the pretrained model and instructions on getting own proteins' embeddings.

Thanks for the wonderful work!

I am trying the use the learned embeddings for a downstream protein classification problem on my own datasets. Since training the model requires a good HPC, I am wondering:

  1. whether you could kindly upload your pretrained model.
  2. could you explain how to generate the training and testing datasets (the pkl.gz file) from our own PDB files.
  3. based on the generated pkl.gz file in Q1, how to apply the trained model to get the final embedding vectors (512 dimensions) for our own PDB files.

Custom dataset. Data preprocessing

Thank you so much for your outstanding work!

I'm interested in your models and would like to run them on some custom datasets. Unfortunately, I haven't found any instructions on how to preprocess the raw data. Could you please tell me whether it is possible to run your models on custom datasets? And if so, where can I find your preprocessing script?

Thank you!

atom view

I was wondering how atom view is implemented? I'm getting a shape mismatch.

In mc-gearnet_edge.yaml I changed the view and entity level to 'atom' and input dimension to 38. As i find 38 atom types in the torchdrug protein class.
Is there another setting i need to change?

Thanks for creating GearNet!

ValueError: Unknown value `CHI_SQUAREPLANAR`. Available vocabulary is `range(0, 4)`

15:52:05   Config file: ./config/downstream/GO-BP/gearnet_yy.yaml
15:52:05   {'dataset': {'branch': 'BP',
             'class': 'GeneOntology',
             'path': '/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/',
             'test_cutoff': 0.95,
             'transform': {'class': 'ProteinView', 'view': 'residue'}},
 'engine': {'batch_size': 2, 'gpus': [0], 'log_interval': 1000},
 'metric': 'f1_max',
 'optimizer': {'class': 'AdamW', 'lr': 0.0001, 'weight_decay': 0},
 'output_dir': '/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein_output/downstream/GO-BP',
 'task': {'class': 'MultipleBinaryClassification',
          'criterion': 'bce',
          'graph_construction_model': {'class': 'GraphConstruction',
                                       'edge_feature': 'gearnet',
                                       'edge_layers': [{'class': 'SequentialEdge',
                                                        'max_distance': 2},
                                                       {'class': 'SpatialEdge',
                                                        'min_distance': 5,
                                                        'radius': 10.0},
                                                       {'class': 'KNNEdge',
                                                        'k': 10,
                                                        'min_distance': 5}],
                                       'node_layers': [{'class': 'AlphaCarbonNode'}]},
          'metric': ['auprc@micro', 'f1_max'],
          'model': {'batch_norm': True,
                    'class': 'GearNet',
                    'concat_hidden': True,
                    'hidden_dims': [512, 512, 512, 512, 512, 512],
                    'input_dim': 21,
                    'num_relation': 7,
                    'readout': 'sum',
                    'short_cut': True},
          'num_mlp_layer': 3},
 'train': {'num_epoch': 200}}
15:52:05   Downloading https://zenodo.org/record/6622158/files/GeneOntology.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology.zip
15:53:38   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO
15:53:41   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/train.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology
15:56:21   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/valid.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology
15:56:37   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/test.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology

Constructing proteins from pdbs:   0%|          | 0/36635 [00:00<?, ?it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `HOH`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `HOH`
  warnings.warn("Unknown value `%s`" % x)
[15:56:55] Explicit valence for atom # 6 O, 3, is greater than permitted
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BIS`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `BIS`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `EPE`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `EPE`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 3/36635 [00:00<54:08, 11.28it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `SO4`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `SO4`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `PO4`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `PO4`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BME`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `BME`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 5/36635 [00:00<1:06:38,  9.16it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `Fe`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 5/36635 [00:00<1:10:20,  8.68it/s]
Traceback (most recent call last):
  File "/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/script/downstream.py", line 56, in <module>
    dataset = core.Configurable.load_config_dict(cfg.dataset)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
    return cls(**new_config)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/core/core.py", line 288, in wrapper
    return init(self, *args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/datasets/gene_ontology.py", line 72, in __init__
    self.load_pdbs(pdb_files, verbose=verbose, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/dataset.py", line 750, in load_pdbs
    protein = data.Protein.from_molecule(mol, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/utils/decorator.py", line 192, in wrapper
    return obj(*args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py", line 185, in from_molecule
    protein = Molecule.from_molecule(mol, atom_feature=atom_feature, bond_feature=bond_feature,
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/utils/decorator.py", line 192, in wrapper
    return obj(*args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/molecule.py", line 189, in from_molecule
    feature += func(atom)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py", line 77, in atom_default
    onehot(atom.GetChiralTag(), chiral_tag_vocab) + \
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py", line 47, in onehot
    raise ValueError("Unknown value `%s`. Available vocabulary is `%s`" % (x, vocab))
ValueError: Unknown value `CHI_SQUAREPLANAR`. Available vocabulary is `range(0, 4)`

Dear developers,

Thanks for your great work. When I am trying to have a quick run through fine-tuning, via python script/downstream.py -c ./config/downstream/EC/gearnet.yaml --gpus [0], the above error messages are returned before model training (for both EC and GO-BP). I would appreciate your time to help me resolve it.

Request for guidance on preprocessing PDB files for model input

Hello,
I have come across your fascinating GitHub repository on protein structure pre-training, and I am excited to explore its potential for my own research. I noticed that the provided data is in HDF5 format, and there is no preprocessing code available for PDB files. I would like to use my own PDB files for inference with your model, but I am unsure how to preprocess them to match the expected input format.

Would you be able to provide some guidance or share a sample preprocessing script for converting PDB files to the required HDF5 format? This would greatly help me and other researchers who are interested in utilizing your work for various applications.

Thank you for your time and for sharing your valuable work with the community. I am looking forward to your response and any assistance you can provide.

Error:General Union types are not currently supported. Only Union[T, NoneType] (i.e. Optional[T]) is supported.: File "/home/lvqy/anaconda3/envs/ZernikeMetric/lib/python3.8/site-packages/torch_cluster/rw.py", line 18

Hi, when I execute the command:

python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]

There is an error occurring. How to solve This problem?

RuntimeError:
General Union types are not currently supported. Only Union[T, NoneType] (i.e. Optional[T]) is supported.:
File "/home/lvqy/anaconda3/envs/ZernikeMetric/lib/python3.8/site-packages/torch_cluster/rw.py", line 18
num_nodes: Optional[int] = None,
return_edge_indices: bool = False,
) -> Union[Tensor, Tuple[Tensor, Tensor]]:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
"""Samples random walks of length :obj:walk_length from all node indices
in :obj:start in the graph given by :obj:(row, col) as described in the

input data

Dear authors, thanks for your great works.
I have some questions about the data. I try to visualize the input protein sequence by call the to_sequence() during each batch. Here is the figure showing the sequence.
image
I wonder why there are some many .Gs, since . is a separator for multiple sequences (DeepGraphLearning/torchdrug#151). Also, after deleting all the ., the length of remaining sequence is the same as the number of the graph's nodes. Could you help to explain why there are some many .Gs? Many thanks!

Non-deterministic embeddings

Hi!

I was wondering if there is any reason that the GearNetIEConv encoder would return variable embeddings for the same input file. I encountered this using my own data, but when I set a torch manual_seed, the embeddings became constant for the same input. And is this expected to have any effect on model performance?

Thanks for your help!

An error occurring when using the StepLR scheduler on the FOLD3D dataset

Hi, thank you for your amazing work!
I tried to reproduce the GearNet results on Fold3D dataset, I followed the original .yaml file in which the StepLR scheduler was specified. However, there was an error occurring when using the scheduler as follows, I would like to ask what causes this, thank you!

15:28:38 #train: 12312, #valid: 736, #test: 718
Traceback (most recent call last):
File "script/downstream.py", line 74, in
solver, scheduler = util.build_downstream_solver(cfg, dataset)
File "/GearNet-new/util.py", line 121, in build_downstream_solver
scheduler = core.Configurable.load_config_dict(cfg.scheduler)
File "/torchdrug/lib/python3.8/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
return cls(**new_config)
File "/torchdrug/lib/python3.8/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/torchdrug/lib/python3.8/site-packages/torchdrug/core/core.py", line 288, in wrapper
return init(self, *args, **kwargs)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
[Previous line repeated 991 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary....

When I run
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/GO-BP/gearnet_edge.yaml --gpus [0,1,2,3] --ckpt
on worker*1 Tesla-V100-SXM2-32GB:4 GPU, 47 CPU, I got the error:

[219013] [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219014] [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219015] [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219016] Traceback (most recent call last):
[219017] File "/hubozhen/GearNet/script/downstream.py", line 75, in
[219018] train_and_validate(cfg, solver, scheduler)
[219019] File "/hubozhen/GearNet/script/downstream.py", line 30, in train_and_validate
[219020] solver.train(**kwargs)
[219021] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/core/engine.py", line 155, in train
[219022] loss, metric = model(batch)
[219023] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219024] return forward_call(*input, **kwargs)
[219025] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
[219026] output = self.module(*inputs[0], **kwargs[0])
[219027] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219028] return forward_call(*input, **kwargs)
[219029] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 279, in forward
[219030] pred = self.predict(batch, all_loss, metric)
[219031] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 300, in predict
[219032] output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric)
[219033] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219034] return forward_call(*input, **kwargs)
[219035] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/models/gearnet.py", line 99, in forward
[219036] edge_hidden = self.edge_layers[i](line_graph, edge_input)
[219037] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219038] return forward_call(*input, **kwargs)
[219039] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 92, in forward
[219040] output = self.combine(input, update)
[219041] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 438, in combine
[219042] output = self.batch_norm(output)
[219043] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219044] return forward_call(*input, **kwargs)
[219045] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
[219046] world_size,
[219047] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 42, in forward
[219048] dist._all_gather_base(combined_flat, combined, process_group, async_op=False)
[219049] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2070, in _all_gather_base
[219050] work = group._allgather_base(output_tensor, input_tensor)
[219051] RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219052] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219053] index1 = local_index // local_inner_size + offset1
[219054] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219055] index1 = local_index // local_inner_size + offset1
[219056] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219057] terminate called after throwing an instance of 'std::runtime_error'
[219058] what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219059] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219060] terminate called after throwing an instance of 'std::runtime_error'
[219061] what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219062] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/data/graph.py:1667: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219063] edge_in_index = local_index // local_inner_size + edge_in_offset
[219064] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219065] terminate called after throwing an instance of 'std::runtime_error'
[219066] what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219067] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21 closing signal SIGTERM
[219068] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary: /opt/anaconda3/envs/manifold/bin/python
[219069] Traceback (most recent call last):
[219070] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[219071] "main", mod_spec)
[219072] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 85, in _run_code
[219073] exec(code, run_globals)
[219074] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
[219075] main()
[219076] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
[219077] launch(args)
[219078] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
[219079] run(args)
[219080] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
[219081] )(*cmd_args)
[219082] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
[219083] return launch_agent(self._config, self._entrypoint, list(args))
[219084] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
[219085] failures=result.failures,
[219086] torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
[219087] ===================================================
[219088] /hubozhen/GearNet/script/downstream.py FAILED
[219089] ---------------------------------------------------
[219090] Failures:
[219091] [1]:
[219092] time : 2022-12-12_09:41:02
[219093] host : pytorch-7c3c96f1-d9hcm
[219094] rank : 2 (local_rank: 2)
[219095] exitcode : -6 (pid: 22)
[219096] error_file: <N/A>
[219097] traceback : Signal 6 (SIGABRT) received by PID 22
[219098] [2]:
[219099] time : 2022-12-12_09:41:02
[219100] host : pytorch-7c3c96f1-d9hcm
[219101] rank : 3 (local_rank: 3)
[219102] exitcode : -6 (pid: 23)
[219103] error_file: <N/A>
[219104] traceback : Signal 6 (SIGABRT) received by PID 23
[219105] ---------------------------------------------------
[219106] Root Cause (first observed failure):
[219107] [0]:
[219108] time : 2022-12-12_09:41:02
[219109] host : pytorch-7c3c96f1-d9hcm
[219110] rank : 0 (local_rank: 0)
[219111] exitcode : -6 (pid: 20)
[219112] error_file: <N/A>
[219113] traceback : Signal 6 (SIGABRT) received by PID 20
[219114] ===================================================

Someone said this happened when loading big data, I find the use ratios of these for GPUs are 100%.
However, I changed the same procedure on another V100 mechaine (worker*1:
Tesla-V100-SXM-32GB:4 GPU, 48 CPU,), it is OK.
It confused me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.