graphdeeplearning / benchmarking-gnns Goto Github PK
View Code? Open in Web Editor NEWRepository for benchmarking graph neural networks
Home Page: https://arxiv.org/abs/2003.00982
License: MIT License
Repository for benchmarking graph neural networks
Home Page: https://arxiv.org/abs/2003.00982
License: MIT License
Hi,
First of all, congratulations on your great work.
Maybe I've caught a minor mistake. It seems that you're leaving out the nearest neighbor when computing the edges list in the file data/superpixels.py (method: compute_edges_list):
knns = np.argpartition(A, new_kth - 1, axis=-1)[:, new_kth:-1]
knn_values = np.partition(A, new_kth - 1, axis=-1)[:, new_kth:-1]
I think it should be
knns = np.argpartition(A, new_kth, axis=-1)[:, new_kth+1:]
knn_values = np.partition(A, new_kth, axis=-1)[:, new_kth+1:]
Could you please verify that?
Thanks.
Thanks for providing the benchmark!
As a Ph.D student working on GNNs in China, I'd like to ask you considering providing additional mirror for the datasets, e.g., on github or other websites, since dropbox may not be accessble for us.
(Just to mention, similar problems have happend in other packages as well, e.g., see pyg-team/pytorch_geometric#1116 (comment))
Is it possible to add a graph feature (a vector for each graph) to be used togheter with the graph itself to predict the label, in the graph classification/ graph regression settings? If it is, how can I do so? I'm not very experienced with torch and tensorflow. Thank you in advance.
Hi, when I ran your script https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/script_one_code_to_rull_them_all.sh
I found all the processes were running on gpu0.
Your code of choosing gpu is setting environment variables in code.
def gpu_setup(use_gpu, gpu_id):
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
if torch.cuda.is_available() and use_gpu:
print('cuda available with GPU:',torch.cuda.get_device_name(0))
device = torch.device("cuda")
else:
print('cuda not available')
device = torch.device("cpu")
return device
But In my machine with 8 gpus, it caused the same issue as https://discuss.pytorch.org/t/why-setting-cuda-visible-devices-within-the-code-doesn-t-work/31826
All processes run only on gpu:0.
And I found other people said that setting variables outside of the code or before importing something associated with gpu would be better.
I wonder that if you have the same issue or just everything runs normally?
I think using the methods in https://pytorch.org/docs/master/notes/cuda.html may be better and more compatible.
Thanks for your answer!
Hi,
Thanks for sharing this framework. It was the missing tool and a big step for the gnn research.
If we have one graph and the task is to classify nodes, how should we prepare the dataset?
In other words, how to split one graph to train, val and test datasets for node classification using gnns?
Thanks,
why implement the reduce function of GCNLayer use a custom function, not fn.mean?
layers/gcn_layer.py line 18~20
def reduce(nodes):
accum = torch.mean(nodes.mailbox['m'], 1)
return {'h': accum}
I replaced to fn.mean and got a great speedup, is there any bug in fn.mean?
Hi,
I've been trying to run the Graph Regression Demo on ZINC: https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/main_molecules_graph_regression.ipynb
I ran the script as it is on Google Colab, and I ran into an error in the following code:
# """
# USER CONTROLS
# """
if notebook_mode == True:
#MODEL_NAME = '3WLGNN'
#MODEL_NAME = 'RingGNN'
MODEL_NAME = 'GatedGCN'
#MODEL_NAME = 'MoNet'
#MODEL_NAME = 'GCN'
# MODEL_NAME = 'GAT'
# MODEL_NAME = 'GraphSage'
# MODEL_NAME = 'DiffPool'
# MODEL_NAME = 'MLP'
# MODEL_NAME = 'GIN'
DATASET_NAME = 'ZINC'
out_dir = 'out/molecules_graph_regression/'
root_log_dir = out_dir + 'logs/' + MODEL_NAME + "_" + DATASET_NAME + "_" + time.strftime('%Hh%Mm%Ss_on_%b_%d_%Y')
root_ckpt_dir = out_dir + 'checkpoints/' + MODEL_NAME + "_" + DATASET_NAME + "_" + time.strftime('%Hh%Mm%Ss_on_%b_%d_%Y')
print("[I] Loading data (notebook) ...")
dataset = LoadData(DATASET_NAME)
trainset, valset, testset = dataset.train, dataset.val, dataset.test
print("[I] Finished loading.")
Error:
[I] Loading data (notebook) ...
[I] Loading dataset ZINC...
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-15-38be5d2cb47d> in <module>()
23
24 print("[I] Loading data (notebook) ...")
---> 25 dataset = LoadData(DATASET_NAME)
26 trainset, valset, testset = dataset.train, dataset.val, dataset.test
27 print("[I] Finished loading.")
1 frames
/content/benchmarking-gnns/data/data.py in LoadData(DATASET_NAME)
23 # handling for (ZINC) molecule dataset
24 if DATASET_NAME == 'ZINC' or DATASET_NAME == 'ZINC-full':
---> 25 return MoleculeDataset(DATASET_NAME)
26
27 # handling for the TU Datasets
/content/benchmarking-gnns/data/molecules.py in __init__(self, name)
180 self.name = name
181 data_dir = 'data/molecules/'
--> 182 with open(data_dir+name+'.pkl',"rb") as f:
183 f = pickle.load(f)
184 self.train = f[0]
FileNotFoundError: [Errno 2] No such file or directory: 'data/molecules/ZINC.pkl'
I tried digging in and noticed that there isn't actually any file named ZINC.pkl at data/molecules in this repository.
Is there a way out?
I got this issue when I run the following command:
python main_superpixels_graph_classification.py --dataset CIFAR10 --gpu_id 3 --see
d 41 --config configs/superpixels_graph_classification_GAT_CIFAR10_100k.json
cuda available with GPU: GeForce GTX 1080
[I] Loading dataset CIFAR10...
Traceback (most recent call last):
File "main_superpixels_graph_classification.py", line 430, in
main()
File "main_superpixels_graph_classification.py", line 314, in main
dataset = LoadData(DATASET_NAME)
File "/mnt/DISK10T/zhanghm/benchmarkingGNN/benchmarking-gnns/data/data.py", line 21, in LoadData
return SuperPixDataset(DATASET_NAME)
File "/mnt/DISK10T/zhanghm/benchmarkingGNN/benchmarking-gnns/data/superpixels.py", line 272, in init
f = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated
Could assistance be provided for addressing this issue?
Hello,
which GNN model can be applied on weighted graph? like take into consideration the edge features/weights? and is there any GNN for unsupervised clustering? and what's the largest my graph can be?(not semi supervised).
Your answer will be much appreciated :))
including the Training and Inferencing speed, along with the Memory Usage is important
Looking at the data generation for molecules task, I only see the eigvec assigment to 'pos_enc' node features, but cannot find where random sign flipping is implemented. Is this in the codebase somewhere?
Hi,
I use "data/molecules.py" to load the zinc dataset.
But I got confused when computing the num_atom_type and num_bond_type.
It is said that there are 28 different atom types and four bond types. However, the actual numbers I got are 21 and 3.
Does anyone know why?
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
{1, 2, 3}
Here is my code:
zinc_dataset = MoleculeDatasetDGL()
node_set = set()
edge_set = set()
for data in zinc_dataset.train:
g, label = data
node_set = node_set.union( set(g.ndata['feat'].numpy()) )
edge_set = edge_set.union( set(g.edata['feat'].numpy()) )
print(node_set)
print(edge_set)
Hi, when I run the following file
https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/data/superpixels/prepare_superpixels_MNIST.ipynb
in line
if not os.path.isfile('superpixels.zip'):
print('downloading..')
!curl https://www.dropbox.com/s/y2qwa77a0fxem47/superpixels.zip?dl=1 -o superpixels.zip -J -L -k
!tar -xvf superpixels.zip -C ../
else:
print('File already downloaded')
it exists an error
tar: This does not look like a tar archive
Maybe we should change the command
!tar -xvf superpixels.zip -C ../
to
!unzip superpixels.zip -d ../
Thank you!
Thank you for your work in providing a standard repository for graph benchmarking
Some applications might require using the SMILES to build different types of graphs than the one provided by the benchmarking platform. I know that the ZINC dataset come from the JT-VAE paper where the SMILES are provided. However, this paper is only a subset of the original dataset, and the train-val-test split is different.
I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.
The version 2 update
commit removes the DiffPool layer and net implementation. What happens with this structure?
Hi, thanks for creating this awesome project. However, I've encountered some problems on reproduction of results present in the paper and the leaderboard. Generally speaking, results are very unstable in my experiments.
On ZINC dataset, here are my results of GAT model :
My results of GCN model:
I think we can not neglect these fluctuating errors since sometimes errors are around 0.01. And some recent SOTA models only outperform 0.01 over the best model in the benchmark paper.
It seems setting random seeds does not really work in this case. Is there any other requirements for reproduction? Do you think such fluctuation is tolerable?
After I trained a GCN model on ZINC dataset, I reload the latest checkpoint and then try to reproduce test MAE through evaluate_network_sparse
function. Unfortunately, test MAE from the checkpoint (0.3670) is different than test MAE from the result file (0.3757). Can we reproduce results via checkpoints? Could you please provide me some code of testing pipeline in case that I missed some important steps during reproduction?
Note: I changed readout
to 'sum'. Besides, I didn't modify other hyper-parameters.
My experiments are conducted on Quadro P6000.
Can you give me some suggestion about how to build a heterogeneous graph with node features and edge features and do edge classification? Thank you very much!
Hi, thank you for sharing the project with us.
I'm confused about that though I fixed the random seed as your config file did, I still got different results between different runs just like without setting seed.
I want to know that if fixing the seed can ensure the reproducibility and get the same accuracy?
Hi, I am trying to reproduce the results on the ZINC dataset with GraphSAGE but I keep getting test MAE values around 0.58 instead of the 0.43 reported in the paper, with all settings at default (L=4 layers). I am also getting a slightly different number of parameters (104167 instead of 105301). The other networks work fine.
Thanks!
Hello~ I wanna add a new dataset. I saw that in MNIST(superpixel) data fold, there are only mnist_75sp_test.pkl and mnist_75sp_train.pkl. But I do not know how to convert my image data into such XXX.pkl.
Thanks!
Hi, I think the codes (or dataset, not sure) for loading node classification datasets SBM_CLUSTER and SBM_PATTERN are buggy. I got "too many values to unpack" error when I tried to reproduce the result for SBM. Below is the log
(benchmark_gnn) -bash-4.2$ python main_SBMs_node_classification.py --dataset SBM_CLUSTER --gpu_id 0 --seed 41 --config 'configs/SBMs_node_clustering_MLP_CLUSTER.json'
Using backend: pytorch
cuda available with GPU: TITAN X (Pascal)
[I] Loading dataset SBM_CLUSTER...
Traceback (most recent call last):
File "main_SBMs_node_classification.py", line 409, in
main()
File "main_SBMs_node_classification.py", line 307, in main
dataset = LoadData(DATASET_NAME)
File "/afs/ece.cmu.edu/usr/yaoj/Desktop/benchmarking-gnns/data/data.py", line 33, in LoadData
return SBMsDataset(DATASET_NAME)
File "/afs/ece.cmu.edu/usr/yaoj/Desktop/benchmarking-gnns/data/SBMs.py", line 132, in init
f = pickle.load(f)
File "/afs/ece.cmu.edu/usr/yaoj/anaconda3/envs/benchmark_gnn/lib/python3.7/site-packages/dgl/graph_index.py", line 53, in setstate
num_nodes, readonly, src, dst = state
ValueError: too many values to unpack (expected 4)
I am trying to read superpixels_visualization_mnist.ipynb and feel confuse about the function compute_edges_list(A, kth=8+1):
knns = np.argpartition(A, new_kth-1, axis=-1)[:, new_kth:-1]
We ignore the last column in that case. I think it is better to use
knns = np.argpartition(A, new_kth, axis=-1)[:, new_kth+1:]
.
Let me know if I am wrong. Thank you very much!
Hi~first I would like to appreciate all contributors for such an awesome job,thanks!
I’m not familiar with the module “dgl”,so I have some confusion about the code.
I think the adj matrix should participate in the operation,but i didn't find it in the code.
Meanwhile, I'm not sure about the meaning of g.edata['feat']. For example, what does it stand for in SBM Datasets?
Thanks for your reply in advance.
Hope a nice day!
Hi!
I have a fork of the repo where I made the corresponding changes to add the ChebNet. Would you be interested to integrate it if I make a pull request for that or do you prefer keeping the repo as your own code production for the time being?
Best,
Axel
Hi, I see your code in
benchmarking-gnns/data/superpixels.py
Lines 114 to 144 in 92c762a
, find the superpixel graph is directed graph not an undirected graph? Isn't it true?
I see Figure 1 of your paper, you draw MNIST and CIFAR10 as undirected graphs, but it should be directed graphs based on your code?
An open discussion: With the inclusion of leaderboards, would we like to keep updating them with new results as and when they hit arXiv? Should there be a protocol for leaderboard submissions and validating the results?
E.g. https://arxiv.org/abs/2006.07846, https://arxiv.org/abs/2004.05718
This paper claimed that the introduction of so-called graph size normalization
can boost performance in various graph machine learning tasks. But, I can't find the implementation of this simple and effective normalization. Therefore, could anyone point out the exact place of this implementation?
Hi,
When I was loading the SBM_PATTERN dataset, I got the following error:
File "main_SBMs_node_classification.py", line 445, in
main()
File "main_SBMs_node_classification.py", line 333, in main
dataset = LoadData(DATASET_NAME)
File "/mnt/DISK_10T/zhanghm/Benchmarking_gnn/benchmarking-gnns/data/data.py", line 36, in LoadData
return SBMsDataset(DATASET_NAME)
File "/mnt/DISK_10T/zhanghm/Benchmarking_gnn/benchmarking-gnns/data/SBMs.py", line 161, in init
with open(data_dir+name+'.pkl',"rb") as f:
AttributeError: Can't get attribute 'HeteroPickleStates' on <module 'dgl.heterograph_index' from '/home/zhanghm/anaconda3/envs/benchmark_gnn/lib/python3.7/site-packages/dgl/heterograph_index.py'>
For the remaining datasets, I can successfully load them.
Could assistance be provided to help with this issue?
Thanks,
Hi,
Thank you for the great work! This work is really wonderful. When I try to use GCN model for node classification by running:
python main_SBMs_node_classification.py --dataset SBM_PATTERN --gpu_id 0 --seed 41 --config 'configs/SBMs_node_clustering_GCN_PATTERN_100k.json'
I found that when I set dgl_builtin to false, the test acc is 63.77, which is consistent with the results reported in the paper; however, when I set dgl_builtin to true, the test acc became 85.56.
I do not think this behavior is normal. But I did not figure out why the performances are so different after struggling for some time. I would appreciate it if you could help me. Thank you! Have a nice day!
Best,
Yongcheng
It is very difficult from your paper and GitHub to really understand what we are predicting and where do the datasets come from?
For ZINC, you mention "constrained solubility", but I don't find any reference to it in the ZINC dataset that your paper references. It is not clear whether it is a computed property or a measured one, and what method is used to measure/compute the metric. Can you state more clearly the name of the used property, and make available the ZINC ID so they can be checked? Additionally, ZINC has 230 millions molecules, but you only use 12,000. How do you select the ones to include?
For CIFAR10 and MNIST, you do not mention how the images were clustered into superpixels, what method was used and what is the average resolution of the resulting image.
For PATTERN and CLUSTER, it is not mentioned what are the patterns that we are looking to find, what is the average degree of the graphs, what is the diameter distribution, what is the diameter and degree of the patterns, etc.
These pieces of information are useful to evaluate if the performance of models is satisfactory or not. I feel that the current description leaves us blind to truly understand why certain networks are better than others and make the benchmarking of GNNs more about "beating the benchmarks" than increasing the discriminative abilities.
Thank you, and great work on the paper, it was really needed in the GNN community
Hi,
Very grateful for sharing the project! I have a question on SBM-Cluster. It is a semi-supervised node classification task, and in paper it is said to use one single label for each community (for training?), but in code it seems that for each training graph all labels are used. Did I misunderstand something?
Thank you for sharing this repo with us!
Could you please add an explicit LICENSE
file to the repo so that it's clear
under what terms the content is provided, and under what terms user
contributions are licensed?
[...] without a license, the default copyright laws apply, meaning that you
retain all rights to your source code and no one may reproduce, distribute,
or create derivative works from your work. If you're creating an open source
project, we strongly encourage you to include an open source license.
Thanks!
Hoping for your answers. Thanks!
Hi,
Thanks for the great repo and paper.
I'm interested in the CLUSTER and PATTERN experiments, and I want to reproduce them using pytorch-geometric, but I'm having hard time doing this even if I'm using the same loss and accuracy functions, and I tried to use the same hyperparameters.
I want to know if there is any pytorch geometric implementation for these experiments? if so, can you please point me to them? if not, Can you give me general hints on how to reproduce them?
Thank you in advance!
Hi, I encountered an error when I loaded the MNIST.pkl and CIFAR10.pkl
<stdout>:cuda available with GPU: Tesla P100-PCIE-16GB
<stdout>:[I] Loading dataset CIFAR10...
<stderr>:Traceback (most recent call last):
<stderr>: File "tools/train_superpixels_graph_classification.py", line 444, in <module>
<stderr>: main()
<stderr>: File "tools/train_superpixels_graph_classification.py", line 325, in main
<stderr>: dataset = LoadData(DATASET_NAME, data_dir=args.dataDir)
<stderr>: File "/var/storage/shared/nextmsra/sys/jobs/application_1583577754071_28152/gnn/data/data.py", line 18, in LoadData
<stderr>: return SuperPixDataset(DATASET_NAME, data_dir)
<stderr>: File "/var/storage/shared/nextmsra/sys/jobs/application_1583577754071_28152/gnn/data/superpixels.py", line 271, in __init__
<stderr>: f = pickle.load(f)
<stderr>: File "/tmp/cache/python/lib/python3.7/site-packages/dgl/graph_index.py", line 53, in __setstate__
<stderr>: num_nodes, readonly, src, dst = state
<stderr>:ValueError: too many values to unpack (expected 4)
Here is my environment.
cuda version :10.0
pytorch version : pytorch 1.4 with cudatoolkit 10.0
dgl version: v0.4.3
I wonder if the new dgl version causes the inconsistency.
Because I can run well when dgl is v0.4.2., but now I don't find how to downgrade my dgl to v0.4.2 explicitly.
Thanks for solving this :)
Or can you publish your code of how to generate the dgl graph from CIFAR and MNIST dataset?
Thank you
Hi. First of all, thank you for your great work for establishing a benchmark for GNN.
Now we try to visualize the data from ZINC dataset using rdkit.
However, we fail to find the proper dictionary of atom type for ZINC data.
We think that the file named atom_dict.pickle contains such contents, but we cannot open this via pickle because it causes the error as follows:
AttributeError: Can't get attribute 'Dictionary' on <module 'main' from 'open_pickle.py'>
So, can you share the method for open atom_dict.pickle file or share the dictionary for ZINC atoms? Then it will be very grateful.
Thank you.
I have tried installing the environment as specified in the guide, but I get a lot of packet conflicts when running conda env create -f environment_cpu.yml
. Environment details: OS: Windows 10, no gpu. I don't get the same conflicts when running it on a Linux VM, any idea why?
benchmarking-gnns/data/superpixels.py
Line 270 in 31a2e0e
This error occurred when I ran the following code on the command line:
!python /content/main_superpixels_graph_classification.py --dataset MNIST --gpu_id 0 --config '/content/confi
gs/superpixels_graph_classification_GatedGCN_MNIST_100k.json' # for GPU
`
When I change the code to the following, there is still the same error
self.train, self.val, self.test = pickle.load(f)
How to solve this error? thanks!
Hello! Thank you for sharing this project!
I have a question concerning adding new datasets.
I have followed the pipeline described in docs to prepare a dataset for molecular regression. My dataset consists of SMILES strings and their respective scores so I'm using MoleculeCSVDataset method from the DGL library to convert molecules to graphs. I'm also using CanonicalAtomFeaturizer and CanonicalBondFeaturizer for extracting node and edge features respectively. In the end I get a feature matrix g.ndata['feat']
of size n x 74
and g.edata['feat']
of size e x 12
for each graph. However, I get shape mismatches during the forward pass.
Is my understanding of the featurization process correct or should it be done differently?
Many thanks in advance and apologies for the trivial question.
Hello,thank you for this wonderful work.
I print the size of batch_x in "train_molecules_graph_regression.py", and its size is [batch_num, 1]. But in Table 10 in the paper, the dimension is 28 for ZINC dataset. Is there anything wrong with this dataset ? Looking forward to your reply, thanks ! :)
In graphsage at each layer each node updates its embedding with a fixed (S_i) number of neighbors. Is it possible in this framework to choose the S_i parameters for the 2 layers in graphsage? I don't see the possibility
Thank you anyway
I am a newbie, is the GatedGCN in the layers the GatedGCN-E version?And What is their difference?
From my understanding, you randomly generated 10k graphs of random sizes for training.
For training with a citation network, a single graph A is passed into the GNN.
For training with SBM_Cluster, do you combine these randomly generated graphs (A_{i}) into a big diagonal block matrix? (like how we aggregate graphs for a graph classification problem, but without the pooling layer?)
Runnning the notebook 'main_molecules_graph_regression.ipynb' as it is, you receive the error "CalledProcessError: Command 'jupyter nbconvert --to script main_molecules_graph_regression.ipynb' returned non-zero exit status 1." when running (last cell) the utils/cleaner_main script, in particular when it runs subprocess.check_output. The actual error is masked from the check_output function, in my case I needed to install the "jupyter_contrib_nbextensions" library.
Hi,
We are the publisher of OpenBioLink, a large-scale biomedical heterogeneous knowledge graph, consisting of over 5. Mio edges. Would you be interested in OpenBioLink as a contribution to the existing datasets?
Hi! Thanks for this great work!
I was just curious as to what it he 'z' variable in line 59 of graph_transformer_layer.py? I cannot seem to find this variable in the paper. It seems to be dividing the output from the multihead attention. Is there a specific reason as to why this was necessary?
Thanks :)
Dear Authors,
In Benchmarking Graph Neural Networks, you said the max pooling should be more powerful than the average pooling (Appendix E.2). Why this statement is right? Does any theoretic guarantee or it's just empirical conclusion?
Best,
Ken
Hi, thank you for providing a great benchmark!
I have a question about the CLUSTER dataset. As stated in the paper, the input attributes for all nodes are zero except for the 6 "seed" nodes which have one-hot input features [1,2,3,4,5,6]. Are these the only node attributes that are fed into the GNNs? I wonder if non-zero node attributes are used/needed for the non-seed nodes?
Thanks!
Hi, thank you for your wonderful work.
I have used the datasets ZINC, SBM_CLUSTER, SBM_PATTERN from the past February. Today, I found the paper is updated substantially and looked through the datasets.
It occurs to me that while ZINC, SBM_CLUSTER pkl file links are as they were but SBM_PATTERN link got changed. Is it safe to use the SBM_PATTERN pkl file from February still?
Or could you tell me what have changed since?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.