Coder Social home page Coder Social logo

microsoft / graphormer Goto Github PK

View Code? Open in Web Editor NEW
1.9K 28.0 310.0 8.45 MB

Graphormer is a general-purpose deep learning backbone for molecular modeling.

License: MIT License

Python 94.08% Cython 1.48% Shell 4.44%
graph transformer deep-learning ai4science molecule-simulation

graphormer's Introduction

Graphormer is a deep learning package that allows researchers and developers to train custom models for molecule modeling tasks. It aims to accelerate the research and application in AI for molecule science, such as material discovery, drug discovery, etc. Project website.

Advanced pre-trained versions of Graphormer are available exclusively on Azure Quantum Elements.

Hiring

Hiring is temporarily freezed and will be re-opened soon. Please stay tuned.

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on AI for Molecule Science, please send your resume to [email protected].

Highlights in Graphormer v2.0

  • The model, code, and script used in the Open Catalyst Challenge are available.
  • Pre-trained models on PCQM4M and PCQM4Mv2 are available, more pre-trained models are comming soon.
  • Supports interface and datasets of PyG, DGL, OGB, and OCP.
  • Supports fairseq backbone.
  • Document is online!

What's New:

03/10/2022

  1. We upload a technical report which describes improved benchmarks on PCQM4M & Open Catalyst Project.

12/22/2021

  1. Graphormer v2.0 is released. Enjoy!

12/10/2021

  1. Graphormer has won the Open Catalyst Challenge. The technical talk could be found through this link.
  2. The slides of NeurIPS 2021 could be found through this link.
  3. The new release of Graphormer is comming soon, as a general molecule modeling toolkit, with models used in OC dataset, completed pre-trained model zoo, flexible data interface, and higher effiency of training.

09/30/2021

  1. Graphormer has been accepted by NeurIPS 2021.
  2. We're hiring! Please contact shuz[at]microsoft.com for more information.

08/03/2021

  1. Codes and scripts are released.

06/16/2021

  1. Graphormer has won the 1st place of quantum prediction track of Open Graph Benchmark Large-Scale Challenge (KDD CUP 2021) [Competition Description] [Competition Result] [Technical Report] [Blog (English)] [Blog (Chinese)]

Get Started

Our primary documentation is at https://graphormer.readthedocs.io/ and is generated from this repository, which contains instructions for getting started, training new models and extending Graphormer with new model types and tasks.

Next you may want to read:

  • Examples showing command line usage of common tasks.

Requirements and Installation

Setup with Conda

bash install.sh

Citation

Please kindly cite this paper if you use the code:

@article{shi2022benchmarking,
  title={Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets},
  author={Yu Shi and Shuxin Zheng and Guolin Ke and Yifei Shen and Jiacheng You and Jiyan He and Shengjie Luo and Chang Liu and Di He and Tie-Yan Liu},
  journal={arXiv preprint arXiv:2203.04810},
  year={2022},
  url={https://arxiv.org/abs/2203.04810}
}

@inproceedings{
ying2021do,
title={Do Transformers Really Perform Badly for Graph Representation?},
author={Chengxuan Ying and Tianle Cai and Shengjie Luo and Shuxin Zheng and Guolin Ke and Di He and Yanming Shen and Tie-Yan Liu},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=OeWooOxFwDa}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

graphormer's People

Contributors

chrisxu2016 avatar dreaming-panda avatar guolinke avatar jndean avatar lithiumda avatar mavisguan avatar microsoftopensource avatar shiyu1994 avatar volltin avatar youjiacheng avatar zhengsx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphormer's Issues

Report errors with validation dataloader

Hi, I try to reproduce the results on ogb-lsc. It report errors at the 90% iteration of the first epoch. It seems there is something wrong with the validation dataloader. Could you give me some suggestions to fix it? Thanks!
image

bugs in model.py line 144.

    in_degree, out_degree = batched_data.in_degree, batched_data.in_degree

--->
in_degree, out_degree = batched_data.in_degree, batched_data.out_degree

Encoding problem

Which part of the code is the encoding method mentioned in the paper specifically reflected?

unable to import PygPCQN4MDataset

hi, I am having trouble importing PygPCQN4MDataset, I ran the following line of code from ogb.lsc.pcqm4m_pyg import PygPCQM4MDataset and it threw this error ImportError: cannot import name 'smiles2graph' from 'ogb.utils' (/usr/local/lib/python3.7/dist-packages/ogb/utils/__init__.py). I am trying to run it the code on colab and fulfilled requirements of code but this error poped up.

About reproducing PCBA result

Hi authors, thanks for your great work. As I have been trying to reproduce the No.1 result on ogb-pcba board, I didn't find the checkpoints mentioned in your paper that pretrained for the PCBA task. Therefore, I turned to use the PCQM checkpoint you provided for the PCQM task. But during loading the checkpoints, an error occured even I have set the hidden dimension and ffn dimension from 1024 to 768:
`RuntimeError: Error(s) in loading state_dict for Graphormer:

    size mismatch for atom_encoder.weight: copying a param with shape torch.Size([4737, 768]) from checkpoint, the shape in current model is torch.Size([4609, 768]).

    size mismatch for edge_encoder.weight: copying a param with shape torch.Size([769, 32]) from checkpoint, the shape in current model is torch.Size([1537, 32]).`

Thus, may I ask two questions about the reproducing process:

  1. Can you provide the checkpoints that can be used to reproduce the PCBA result?
  2. Is there a reason why the code cannot load the previous PCQM checkpoints even though having changed the ffn and hidden dimension?

Looking forward to your reply.
Thank you!

train from scratch on molecule datasets

Hello, I am trying to use Graphormer on other commonly used datasets from MoleculeNet (https://moleculenet.org/datasets-1) to check the performance, such as BACE, BBBP, etc. I have used the default hparams in the script of molhiv, but the results are horrible...

  1. May I know have you tried your model on these datasets without pretrained model? And do you have any suggestions on the hparams for these datasets if we want to train from scratch? I am trying to find out why the results are so bad...
  2. For molhiv without pretrained model, I have tried with the provided script in the examples folder, with not adding the "checkpoint_path" argument, and train for 100 epochs. But the best val score is only around 0.763 and the corresponding test score is only 0.636... I don't know what goes wrong... May I know have you tried to use Graphormer directly on molhiv without pretrained model? How is the performance?
    Thank you.

a question about code

def convert_to_single_emb(x, offset=512):
feature_num = x.size(1) if len(x.size()) > 1 else 1
feature_offset = 1 +
torch.arange(0, feature_num * offset, offset, dtype=torch.long)
x = x + feature_offset
return x
could you tell me why add this func? i am very about that?
thanks!

Changing entry.py for MisconfigurationException error

Hi! This is Stella from Seoul National University, I'm getting a lot of help from your code.
I have a question about entry.py line 87.
Originally it has metric = 'valid_' + get_dataset(dm.dataset_name)['metric']
image
but when I run model, I faced error like this:
'pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='valid_mae') not found in the returned metrics: ['train_loss']. HINT: Did you call self.log('valid_mae', value) in the LightningModule?'

So I changed the line 87 as metric = 'train_loss'
image
and it runs well.

I'm quite afraid that I'm doing something wrong, is it right way to modify the code?
Here are some useful information for my project:

  1. task: regression
  2. input type : integer (originally continuous value, but discretized)
  3. target type : real value
  4. eval metric : rmse
  5. features from data.py
  •         'num_class': 1,
    
  •         'loss_fn': F.l1_loss,
    
  •         'metric': 'mae',
    
  •         'metric_mode': 'min',
    

Training Graphormer-Small on dataset PCQM4M and Understanding of Edge Encoder

Hello,
I have read your code and have some question.

  1. How to train Graphormer-Small on dataset PCQM4M, could you provide a sample? e.g. shell code
  2. How to understand the dimension nn.Embedding(512 * 3 + 1, num_heads, padding_idx=0) of edge encoder? The shape of edge_input at the end of the function collator in colloator.py (before making a Batch format) is [n_graph, n_node, n_node, multi_hop_max_dist, n_edge_features], where n_edge_features=3. So how do the multiplication with 512 and the addition with 1 come?
    Thanks a lot!

Feature offset

Hi. What does the function convert_to_single_emb mean? And why the original features should be added an offset? Thank you.

Unable to reproduce results

I'm trying to reproduce the reported results on OGB and ZINC datasets, but I failed to achieve the performance.

I first directly run the provided scripts hiv.sh to train a graphormer on MolHiv dataset without pretraining. The final AUC is 73.10%. Then I followed the instructions and hyper-parameter settings in the paper to do pre-training. I pre-trained on the PCQM4M for 20 epochs (until the loss converge) and fine-tuned the model on MolHiv for 8 epochs (as specified in the script) The best result turn out to be 76.25%.

Despite some improvement, the final AUC is not as high as it was reported in the paper. I also tried to reproduce the result on ZINC via the example script. But the best MAE is 0.1576, which is lower than 0.122 reported in the paper.

I'm wondering what I'm likely to miss that results in my poor performance. Can I know more reproduction details? My python environment is elaborated as below:

pytorch==1.9.0
pytorch-geometric==1.7,2
pytorch-scatter==2.0.8
pytorch-sparse==0.6.11
pytorch-lightning==1.3.0
ogb==1.3.1
cudatoolkit==11.1

I'd really appreciate it if someone could share their reproduced results and give me some suggestions.

Cannot Reproduce Result of ZINC

Hi,
I trained some models on PCQM4M-LSC, ogbg-molhiv, and ZINC following the setting in the paper, and the results of PCQM4M-LSC and ogbg-molhiv are same as the paper. I also run experiment on ZINC several times, but the MAE is always more than 0.14 (with or without adding --intput_dropout_rate 0), which should be about 0.12 according to the paper. Here is my command:

python3 entry.py --dataset_name ZINC --hidden_dim 80 --ffn_dim 80 --num_heads 8 --tot_updates 400000 --batch_size 256 --warmup_updates 40000 --precision 16 --intput_dropout_rate 0 --gradient_clip_val 5 --num_workers 8 --gpus 1 --accelerator ddp --max_epochs 10000

Edge Understanding of Graphormer

It looks like the only use of edge encodings right now is to change the bias of the attention, so in a toy example where there are just two identical nodes with a single edge between them, with the edge label(edge feature) either 0 or 1, and a binary classification task, where the wanted prediction is the label of this edge, how would the network solve this task? Just using the edge encoding as attention bias should not be enough here right? I am asking because, I successfully applied graphormer to a similer task, but now I am not exactly sure how it works

Performance of Graphormer on traditional GNN benchmarks

Hi! Thanks for your great work! I wonder how does Graphormer perform on some traditional GNN graph classification benchmarks (such as ones used in the original GIN paper). I've tried to apply Graphormer in my task, but the result is not very ideal without pre-training. Are pre-training and a large dataset necessary for the distinguished performance of Graphormer?

Evidential deep learning and other feature requests

Dear Graphormer authors,

thanks for this great piece of software!
I have some feature requests.

Can you please add the functionality for evidential deep learning?
See article:
ACS Cent. Sci. 2021, 7, 8, 1356–1367

Please add the 10 smaller datasets from MoleculeNet to the benchmarks. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression.
See https://ogb.stanford.edu/docs/graphprop/

Please add functionality for molecular representation pre-training via attribute masking
See Strategies for Pre-training Graph Neural Networks

Please add metrics described in the Regression Metrics Guide

As the manual selection of parameters for a graph neural network is difficult, please add support
for some of the automated machine learning techniques.
See for example techniques described in AutoGL

Many thanks.

question about code

Hello:
I have a little question when runing your code.

image

Why execute the else statement and not execute if statement .

Reproduce Validate MAE

Hi,

Thanks for your interesting work. I have a problem regarding the evaluation. I downloaded your checkpoints from here, then I run the following command as mentioned in the Readme (for all_fold_seed0 checkpoint):

conda activate graphormer-lsc
export arch="--ffn_dim 768 --hidden_dim 768 --attention_dropout_rate 0.1 --dropout_rate 0.1 --n_layers 12 --peak_lr 2e-4 --edge_type multi_hop --multi_hop_max_dist 20 --weight_decay 0.0 --intput_dropout_rate 0.0"
export ckpt_path="checkpoints"
export ckpt_name="all_fold_seed0.ckpt"
bash inference.sh

The output log is:

Global seed set to 1
 > PCQM4M-LSC loaded!
{'num_class': 1, 'loss_fn': <function l1_loss at 0x7fc2381b3950>, 'metric': 'mae', 'metric_mode': 'min', 'evaluator': <ogb.lsc.pcqm4m.PCQM4MEvaluator object at 0x7fc1995d2110>, 'dataset': MyPygPCQM4MDataset2(3803453), 'max_node': 128}
 > dataset info ends
total params: 47167841
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Global seed set to 1
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
len(val_dataloader) 1487
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1487/1487 [03:04<00:00,  7.42it/s]
0.027769196778535843
--------------------------------------------------------------------------------                                                                                                                           
DATALOADER:0 VALIDATE RESULTS
{'valid_mae': 0.027769196778535843}
--------------------------------------------------------------------------------
[{'valid_mae': 0.027769196778535843}]

I assumed that I should get results near Table 1 "validate MAE column", but it's different from that. Do I miss something?

Thanks for your help.

Feature Requests & Voting Hub

This issue is to maintain all features request on one page.

Note to contributors: If you want to work for a requested feature, re-open the linked issue. Everyone is welcome to work on any of the issues below.

Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked good first issue. these should be left open so they are discoverable by new contributors.


Call for voting

we would like to call the voting here, to prioritize these requests.
If you think a feature request is very necessary for you, you can vote for it by the following process:

got the issue (feature request) number.
search the number in this issue, check the voting of it exists or not.
if the voting exists, you can add 👍 to that voting
if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.


Efficiency related

  • High efficient shortest path implementation (#69 )
  • Compact Pre-trained Graphormer-base on PCQM4M (#80 )

New features:

  • Molecule Feature Extraction by RDKit (#71 )

  • Package Graphormer to PyPI (#72)

  • Windows support (#144)

New algorithms:

  • Examples of node classification and link prediction (#75 )

Objective and metric functions:

New pre-trained models:

  • Pre-trained Graphormer v2.0 for ogbg-molpcba (#70 )
  • Pre-trained Graphormer v2.0 on OC20 (#73 )
  • Pre-trained Graphormer-large v2.0 on PCAM4M (#74 )

Input enhancements:

  • Sparse Graph Representation (#82 )

Bug fixs:

Node classification

What do I have to modify the code in order to try the model on ogbn-proteins? (Wrapper, collator..)

PreTrained Models

Hello!


Any chance uploading a PreTrained models for the different experiments ?
Thanks a lot!

Using the code for the new dataset that we made.

I am trying to use a Graphormer on brain intelligence regression problems. So, we are using brain connectivity as an input graph, and we are trying to solve the graph regression problem.

And we set the whole edge and node features as an integer.

We are facing these errors below.
expected tensor for argument #1 'indices' to have scalar type long but got torch.FloatTensor instead

So I tried to change the code in model.py (line 162 originally) in a diverse way.
image

However, after I changed X.type to the torch.cuda.LongTensor, I am still facing another error related to Cuda.

image
image

Can you help me with how to solve the problem?

Thank You

problem when using algos.pyx

Hello,
I got a problem while using https://github.com/microsoft/Graphormer/blob/main/graphormer/algos.pyx
Here is the error :

Traceback (most recent call last):

  File "C:\Users\James\AppData\Local\Temp/ipykernel_1060/863204263.py", line 1, in <module>
    shortest_path_result, path = algos.floyd_warshall(adj.numpy())

  File "algos.pyx", line 19, in algos.floyd_warshall
    cdef numpy.ndarray[long, ndim=2, mode='c'] path = numpy.zeros([n, n], dtype=numpy.int64)

ValueError: Buffer dtype mismatch, expected 'long' but got 'long long'

I'm using Windows-64bit and Python 3.8.2. Numpy version is 1.21.2

On the example usage of graphformer encoder

Hi, thank you for your exciting work on graphformer. I am curious in understanding the mechanisims for this model. I tried to declare the example Encoder layer. I commented out the data import lines.

It seems the Multihead Attention is not imported and I am not sure whether this MHA module under graphformer is customized or not. I am mostly curious on the implementation of spatial encoding part.

May I know is it possible for you to provide a toy example? May be a forward pass for a random 10x10 node matrix will do.

About OGB submission

Thank you for your leaderboard submission. Please provide the exact command to reproduce your leaderboard results.

algos AttributeError

Hello:
I got an error when runing your code.Here:

"Graphormer/OGB-LSC/graphormer/src/ogb_wrapper.py", line 34, in preprocess_item
    all_rel_pos_3d_with_noise = torch.from_numpy(algos.bin_rel_pos_3d_1(item.all_rel_pos_3d, noise=noise)).long()
AttributeError: module 'algos' has no attribute 'bin_rel_pos_3d_1'

how to speed up one epoch

hello,
Thank you for your sharing , this is a great work!
When i trained my data with your code, i found a problem.
When every epoch starts, it needs to wait for a long time to see the GPU utilization rised up..
What may be the reason for this ?

How can I do graph regression with graphormer?

Hi! this is Stella from Seoul National University.
I'd like to ask how can I implement regression task on Graphormer.
I adjusted ogb module for our data, and setted num_class as -1 like other regression datasets.
And I faced problem with editing model dimensions at model.py, line 62 ~ 75.
image
I think that 512*9+1 is something like vocabulary size, which is calculated by 512 * (number of categories of node features) + 1.
Is my guess right? And you said that it should be greater than the number of the class of all categories in issue #32, and how can I set this number in regression task? maybe number of graphs?

Thank you!

Report errors from pytorch_lightning

I follow the conda setup in the readme file. However one error happens when importing pytorch_lightning:
image
Could you give me some ideas how to fix it? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.