Coder Social home page Coder Social logo

graphdeeplearning / benchmarking-gnns Goto Github PK

View Code? Open in Web Editor NEW
2.4K 59.0 450.0 3.29 MB

Repository for benchmarking graph neural networks

Home Page: https://arxiv.org/abs/2003.00982

License: MIT License

Python 13.18% Jupyter Notebook 83.35% Shell 3.47%
graph-representation-learning graph-neural-networks benchmark-framework graph-deep-learning pytorch dgl deep-learning

benchmarking-gnns's People

Contributors

barclayii avatar chaitjo avatar vijaydwivedi75 avatar xbresson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmarking-gnns's Issues

k-nn edge filter

Hi,

First of all, congratulations on your great work.

Maybe I've caught a minor mistake. It seems that you're leaving out the nearest neighbor when computing the edges list in the file data/superpixels.py (method: compute_edges_list):

knns = np.argpartition(A, new_kth - 1, axis=-1)[:, new_kth:-1] 
knn_values = np.partition(A, new_kth - 1, axis=-1)[:, new_kth:-1]  

I think it should be

knns = np.argpartition(A, new_kth, axis=-1)[:, new_kth+1:] 
knn_values = np.partition(A, new_kth, axis=-1)[:, new_kth+1:]  

Could you please verify that?

Thanks.

Additional Mirror for Datasets

Thanks for providing the benchmark!

As a Ph.D student working on GNNs in China, I'd like to ask you considering providing additional mirror for the datasets, e.g., on github or other websites, since dropbox may not be accessble for us.

(Just to mention, similar problems have happend in other packages as well, e.g., see pyg-team/pytorch_geometric#1116 (comment))

Graph features

Is it possible to add a graph feature (a vector for each graph) to be used togheter with the graph itself to predict the label, in the graph classification/ graph regression settings? If it is, how can I do so? I'm not very experienced with torch and tensorflow. Thank you in advance.

Always use gpu0

Hi, when I ran your script https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/script_one_code_to_rull_them_all.sh
I found all the processes were running on gpu0.
Your code of choosing gpu is setting environment variables in code.

def gpu_setup(use_gpu, gpu_id):
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)  

    if torch.cuda.is_available() and use_gpu:
        print('cuda available with GPU:',torch.cuda.get_device_name(0))
        device = torch.device("cuda")
    else:
        print('cuda not available')
        device = torch.device("cpu")
    return device

But In my machine with 8 gpus, it caused the same issue as https://discuss.pytorch.org/t/why-setting-cuda-visible-devices-within-the-code-doesn-t-work/31826
All processes run only on gpu:0.

And I found other people said that setting variables outside of the code or before importing something associated with gpu would be better.
I wonder that if you have the same issue or just everything runs normally?
I think using the methods in https://pytorch.org/docs/master/notes/cuda.html may be better and more compatible.
Thanks for your answer!

Preparing a dataset for node classification

Hi,
Thanks for sharing this framework. It was the missing tool and a big step for the gnn research.

If we have one graph and the task is to classify nodes, how should we prepare the dataset?
In other words, how to split one graph to train, val and test datasets for node classification using gnns?

Thanks,

GCNLayer reduce function

why implement the reduce function of GCNLayer use a custom function, not fn.mean?

layers/gcn_layer.py line 18~20

def reduce(nodes):
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

I replaced to fn.mean and got a great speedup, is there any bug in fn.mean?

File Not Found Error on running main_molecules_graph_regression.ipynb

Hi,

I've been trying to run the Graph Regression Demo on ZINC: https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/main_molecules_graph_regression.ipynb

I ran the script as it is on Google Colab, and I ran into an error in the following code:

# """
#     USER CONTROLS
# """
if notebook_mode == True:
    
    #MODEL_NAME = '3WLGNN'
    #MODEL_NAME = 'RingGNN'
    MODEL_NAME = 'GatedGCN'
    #MODEL_NAME = 'MoNet'
    #MODEL_NAME = 'GCN'
    # MODEL_NAME = 'GAT'
    # MODEL_NAME = 'GraphSage'
    # MODEL_NAME = 'DiffPool'
    # MODEL_NAME = 'MLP'
    # MODEL_NAME = 'GIN'

    DATASET_NAME = 'ZINC'

    out_dir = 'out/molecules_graph_regression/'
    root_log_dir = out_dir + 'logs/' + MODEL_NAME + "_" + DATASET_NAME + "_" + time.strftime('%Hh%Mm%Ss_on_%b_%d_%Y')
    root_ckpt_dir = out_dir + 'checkpoints/' + MODEL_NAME + "_" + DATASET_NAME + "_" + time.strftime('%Hh%Mm%Ss_on_%b_%d_%Y')

    print("[I] Loading data (notebook) ...")
    dataset = LoadData(DATASET_NAME)
    trainset, valset, testset = dataset.train, dataset.val, dataset.test
    print("[I] Finished loading.")

Error:

[I] Loading data (notebook) ...
[I] Loading dataset ZINC...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-15-38be5d2cb47d> in <module>()
     23 
     24     print("[I] Loading data (notebook) ...")
---> 25     dataset = LoadData(DATASET_NAME)
     26     trainset, valset, testset = dataset.train, dataset.val, dataset.test
     27     print("[I] Finished loading.")

1 frames
/content/benchmarking-gnns/data/data.py in LoadData(DATASET_NAME)
     23     # handling for (ZINC) molecule dataset
     24     if DATASET_NAME == 'ZINC' or DATASET_NAME == 'ZINC-full':
---> 25         return MoleculeDataset(DATASET_NAME)
     26 
     27     # handling for the TU Datasets

/content/benchmarking-gnns/data/molecules.py in __init__(self, name)
    180         self.name = name
    181         data_dir = 'data/molecules/'
--> 182         with open(data_dir+name+'.pkl',"rb") as f:
    183             f = pickle.load(f)
    184             self.train = f[0]

FileNotFoundError: [Errno 2] No such file or directory: 'data/molecules/ZINC.pkl'

I tried digging in and noticed that there isn't actually any file named ZINC.pkl at data/molecules in this repository.
Is there a way out?

_pickle.UnpicklingError: pickle data was truncated

I got this issue when I run the following command:
python main_superpixels_graph_classification.py --dataset CIFAR10 --gpu_id 3 --see
d 41 --config configs/superpixels_graph_classification_GAT_CIFAR10_100k.json

cuda available with GPU: GeForce GTX 1080
[I] Loading dataset CIFAR10...
Traceback (most recent call last):
File "main_superpixels_graph_classification.py", line 430, in
main()
File "main_superpixels_graph_classification.py", line 314, in main
dataset = LoadData(DATASET_NAME)
File "/mnt/DISK10T/zhanghm/benchmarkingGNN/benchmarking-gnns/data/data.py", line 21, in LoadData
return SuperPixDataset(DATASET_NAME)
File "/mnt/DISK10T/zhanghm/benchmarkingGNN/benchmarking-gnns/data/superpixels.py", line 272, in init
f = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

Could assistance be provided for addressing this issue?

GNN for weighted graph clustering

Hello,
which GNN model can be applied on weighted graph? like take into consideration the edge features/weights? and is there any GNN for unsupervised clustering? and what's the largest my graph can be?(not semi supervised).
Your answer will be much appreciated :))

Implementation of RandSign EigVec?

Looking at the data generation for molecules task, I only see the eigvec assigment to 'pos_enc' node features, but cannot find where random sign flipping is implemented. Is this in the codebase somewhere?

Question about the num_atom_type and num_bond_type on zinc dataset

Hi,

I use "data/molecules.py" to load the zinc dataset.
But I got confused when computing the num_atom_type and num_bond_type.

It is said that there are 28 different atom types and four bond types. However, the actual numbers I got are 21 and 3.

Does anyone know why?

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
{1, 2, 3}

Here is my code:

    zinc_dataset = MoleculeDatasetDGL()
    node_set = set()
    edge_set = set()
    for data in zinc_dataset.train:
        g, label = data
        node_set = node_set.union( set(g.ndata['feat'].numpy()) )
        edge_set = edge_set.union( set(g.edata['feat'].numpy()) )
    print(node_set)
    print(edge_set)

tar cannot extract zip file!

Hi, when I run the following file
https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/data/superpixels/prepare_superpixels_MNIST.ipynb

in line

if not os.path.isfile('superpixels.zip'):
    print('downloading..')
    !curl https://www.dropbox.com/s/y2qwa77a0fxem47/superpixels.zip?dl=1 -o superpixels.zip -J -L -k
    !tar -xvf superpixels.zip -C ../
else:
    print('File already downloaded')

it exists an error

tar: This does not look like a tar archive

Maybe we should change the command

!tar -xvf superpixels.zip -C ../

to

!unzip superpixels.zip -d ../

Thank you!

Can you provide the SMILES with labels for the ZINC dataset?

Thank you for your work in providing a standard repository for graph benchmarking

Some applications might require using the SMILES to build different types of graphs than the one provided by the benchmarking platform. I know that the ZINC dataset come from the JT-VAE paper where the SMILES are provided. However, this paper is only a subset of the original dataset, and the train-val-test split is different.

I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.

DiffPool Layer & Net

The version 2 update commit removes the DiffPool layer and net implementation. What happens with this structure?

Results fluctuate heavily

Hi, thanks for creating this awesome project. However, I've encountered some problems on reproduction of results present in the paper and the leaderboard. Generally speaking, results are very unstable in my experiments.

1. Results fluctuate heavily.

On ZINC dataset, here are my results of GAT model :

  • Trial 1: Mean test MAE = 0.38158. Under each seed (95, 41, 12, 35 and the same below): 0.3972, 0.3788, 0.3800, 0.3703
  • Trial 2: Mean test MAE = 0.37115. Under each seed: 0.3636, 0.3785, 0.3769, 0.3656.

My results of GCN model:

  • Trial 1: Mean test MAE = 0.37575. Under each seed: 0.3770, 0.3659, 0.3742, 0.3859
  • Trial 2: Mean test MAE = 0.38070. Under each seed: 0.3946, 0.3668, 0.3764, 0.3850

I think we can not neglect these fluctuating errors since sometimes errors are around 0.01. And some recent SOTA models only outperform 0.01 over the best model in the benchmark paper.

It seems setting random seeds does not really work in this case. Is there any other requirements for reproduction? Do you think such fluctuation is tolerable?

2. I cannot reproduce results using checkpoints.

After I trained a GCN model on ZINC dataset, I reload the latest checkpoint and then try to reproduce test MAE through evaluate_network_sparse function. Unfortunately, test MAE from the checkpoint (0.3670) is different than test MAE from the result file (0.3757). Can we reproduce results via checkpoints? Could you please provide me some code of testing pipeline in case that I missed some important steps during reproduction?

Note: I changed readout to 'sum'. Besides, I didn't modify other hyper-parameters.
My experiments are conducted on Quadro P6000.

Setting seed does not work

Hi, thank you for sharing the project with us.
I'm confused about that though I fixed the random seed as your config file did, I still got different results between different runs just like without setting seed.
I want to know that if fixing the seed can ensure the reproducibility and get the same accuracy?

GraphSage reproducibility

Hi, I am trying to reproduce the results on the ZINC dataset with GraphSAGE but I keep getting test MAE values around 0.58 instead of the 0.43 reported in the paper, with all settings at default (L=4 layers). I am also getting a slightly different number of parameters (104167 instead of 105301). The other networks work fine.

Thanks!

SBM_CLUSTER and SBM_PATTERN unable to unpack

Hi, I think the codes (or dataset, not sure) for loading node classification datasets SBM_CLUSTER and SBM_PATTERN are buggy. I got "too many values to unpack" error when I tried to reproduce the result for SBM. Below is the log

(benchmark_gnn) -bash-4.2$ python main_SBMs_node_classification.py --dataset SBM_CLUSTER --gpu_id 0 --seed 41 --config 'configs/SBMs_node_clustering_MLP_CLUSTER.json'
Using backend: pytorch
cuda available with GPU: TITAN X (Pascal)
[I] Loading dataset SBM_CLUSTER...
Traceback (most recent call last):
File "main_SBMs_node_classification.py", line 409, in
main()
File "main_SBMs_node_classification.py", line 307, in main
dataset = LoadData(DATASET_NAME)
File "/afs/ece.cmu.edu/usr/yaoj/Desktop/benchmarking-gnns/data/data.py", line 33, in LoadData
return SBMsDataset(DATASET_NAME)
File "/afs/ece.cmu.edu/usr/yaoj/Desktop/benchmarking-gnns/data/SBMs.py", line 132, in init
f = pickle.load(f)
File "/afs/ece.cmu.edu/usr/yaoj/anaconda3/envs/benchmark_gnn/lib/python3.7/site-packages/dgl/graph_index.py", line 53, in setstate
num_nodes, readonly, src, dst = state
ValueError: too many values to unpack (expected 4)

Question about converting super-pixel image as a graph

I am trying to read superpixels_visualization_mnist.ipynb and feel confuse about the function compute_edges_list(A, kth=8+1):

knns = np.argpartition(A, new_kth-1, axis=-1)[:, new_kth:-1]

We ignore the last column in that case. I think it is better to use

knns = np.argpartition(A, new_kth, axis=-1)[:, new_kth+1:].

Let me know if I am wrong. Thank you very much!

About adj matrix and g.edata['feat']

Hi~first I would like to appreciate all contributors for such an awesome job,thanks!
I’m not familiar with the module “dgl”,so I have some confusion about the code.
I think the adj matrix should participate in the operation,but i didn't find it in the code.
Meanwhile, I'm not sure about the meaning of g.edata['feat']. For example, what does it stand for in SBM Datasets?
Thanks for your reply in advance.
Hope a nice day!

Enhancement, adding a model

Hi!

I have a fork of the repo where I made the corresponding changes to add the ChebNet. Would you be interested to integrate it if I make a pull request for that or do you prefer keeping the repo as your own code production for the time being?

Best,
Axel

The question about directed and undirected graphs

Hi, I see your code in

if self.use_mean_px:
A = compute_adjacency_matrix_images(coord, mean_px) # using super-pixel locations + features
else:
A = compute_adjacency_matrix_images(coord, mean_px, False) # using only super-pixel locations
edges_list, edge_values_list = compute_edges_list(A) # NEW
N_nodes = A.shape[0]
mean_px = mean_px.reshape(N_nodes, -1)
coord = coord.reshape(N_nodes, 2)
x = np.concatenate((mean_px, coord), axis=1)
edge_values_list = edge_values_list.reshape(-1) # NEW # TO DOUBLE-CHECK !
self.node_features.append(x)
self.edge_features.append(edge_values_list) # NEW
self.Adj_matrices.append(A)
self.edges_lists.append(edges_list)
for index in range(len(self.sp_data)):
g = dgl.DGLGraph()
g.add_nodes(self.node_features[index].shape[0])
g.ndata['feat'] = torch.Tensor(self.node_features[index]).half()
for src, dsts in enumerate(self.edges_lists[index]):
# handling for 1 node where the self loop would be the only edge
# since, VOC Superpixels has few samples (5 samples) with only 1 node
if self.node_features[index].shape[0] == 1:
g.add_edges(src, dsts)
else:
g.add_edges(src, dsts[dsts!=src])

, find the superpixel graph is directed graph not an undirected graph? Isn't it true?
I see Figure 1 of your paper, you draw MNIST and CIFAR10 as undirected graphs, but it should be directed graphs based on your code?

Where is the implementation of `Graph Size Normalization`?

This paper claimed that the introduction of so-called graph size normalization can boost performance in various graph machine learning tasks. But, I can't find the implementation of this simple and effective normalization. Therefore, could anyone point out the exact place of this implementation?

Issue with loading SBM_PATTERN

Hi,

When I was loading the SBM_PATTERN dataset, I got the following error:

File "main_SBMs_node_classification.py", line 445, in
main()
File "main_SBMs_node_classification.py", line 333, in main
dataset = LoadData(DATASET_NAME)
File "/mnt/DISK_10T/zhanghm/Benchmarking_gnn/benchmarking-gnns/data/data.py", line 36, in LoadData
return SBMsDataset(DATASET_NAME)
File "/mnt/DISK_10T/zhanghm/Benchmarking_gnn/benchmarking-gnns/data/SBMs.py", line 161, in init
with open(data_dir+name+'.pkl',"rb") as f:
AttributeError: Can't get attribute 'HeteroPickleStates' on <module 'dgl.heterograph_index' from '/home/zhanghm/anaconda3/envs/benchmark_gnn/lib/python3.7/site-packages/dgl/heterograph_index.py'>

For the remaining datasets, I can successfully load them.

Could assistance be provided to help with this issue?

Thanks,

Inconsistent performance by setting dgl_builtin=True in GCNLayer

Hi,

Thank you for the great work! This work is really wonderful. When I try to use GCN model for node classification by running:
python main_SBMs_node_classification.py --dataset SBM_PATTERN --gpu_id 0 --seed 41 --config 'configs/SBMs_node_clustering_GCN_PATTERN_100k.json'
I found that when I set dgl_builtin to false, the test acc is 63.77, which is consistent with the results reported in the paper; however, when I set dgl_builtin to true, the test acc became 85.56.

I do not think this behavior is normal. But I did not figure out why the performances are so different after struggling for some time. I would appreciate it if you could help me. Thank you! Have a nice day!

Best,
Yongcheng

Please give more details about the provenance of datasets

It is very difficult from your paper and GitHub to really understand what we are predicting and where do the datasets come from?

For ZINC, you mention "constrained solubility", but I don't find any reference to it in the ZINC dataset that your paper references. It is not clear whether it is a computed property or a measured one, and what method is used to measure/compute the metric. Can you state more clearly the name of the used property, and make available the ZINC ID so they can be checked? Additionally, ZINC has 230 millions molecules, but you only use 12,000. How do you select the ones to include?

For CIFAR10 and MNIST, you do not mention how the images were clustered into superpixels, what method was used and what is the average resolution of the resulting image.

For PATTERN and CLUSTER, it is not mentioned what are the patterns that we are looking to find, what is the average degree of the graphs, what is the diameter distribution, what is the diameter and degree of the patterns, etc.

These pieces of information are useful to evaluate if the performance of models is satisfactory or not. I feel that the current description leaves us blind to truly understand why certain networks are better than others and make the benchmarking of GNNs more about "beating the benchmarks" than increasing the discriminative abilities.

Thank you, and great work on the paper, it was really needed in the GNN community

Question on SBM-Cluster

Hi,

Very grateful for sharing the project! I have a question on SBM-Cluster. It is a semi-supervised node classification task, and in paper it is said to use one single label for each community (for training?), but in code it seems that for each training graph all labels are used. Did I misunderstand something?

Please add a license to this repo

Thank you for sharing this repo with us!

Could you please add an explicit LICENSE file to the repo so that it's clear
under what terms the content is provided, and under what terms user
contributions are licensed?

Per GitHub docs on licensing:

[...] without a license, the default copyright laws apply, meaning that you
retain all rights to your source code and no one may reproduce, distribute,
or create derivative works from your work. If you're creating an open source
project, we strongly encourage you to include an open source license.

Thanks!

implementation using Pytorch geometric

Hi,

Thanks for the great repo and paper.

I'm interested in the CLUSTER and PATTERN experiments, and I want to reproduce them using pytorch-geometric, but I'm having hard time doing this even if I'm using the same loss and accuracy functions, and I tried to use the same hyperparameters.

I want to know if there is any pytorch geometric implementation for these experiments? if so, can you please point me to them? if not, Can you give me general hints on how to reproduce them?

Thank you in advance!

pickle.load error

Hi, I encountered an error when I loaded the MNIST.pkl and CIFAR10.pkl

<stdout>:cuda available with GPU: Tesla P100-PCIE-16GB
<stdout>:[I] Loading dataset CIFAR10...
<stderr>:Traceback (most recent call last):
<stderr>:  File "tools/train_superpixels_graph_classification.py", line 444, in <module>
<stderr>:    main() 
<stderr>:  File "tools/train_superpixels_graph_classification.py", line 325, in main
<stderr>:    dataset = LoadData(DATASET_NAME, data_dir=args.dataDir)
<stderr>:  File "/var/storage/shared/nextmsra/sys/jobs/application_1583577754071_28152/gnn/data/data.py", line 18, in LoadData
<stderr>:    return SuperPixDataset(DATASET_NAME, data_dir)
<stderr>:  File "/var/storage/shared/nextmsra/sys/jobs/application_1583577754071_28152/gnn/data/superpixels.py", line 271, in __init__
<stderr>:    f = pickle.load(f)
<stderr>:  File "/tmp/cache/python/lib/python3.7/site-packages/dgl/graph_index.py", line 53, in __setstate__
<stderr>:    num_nodes, readonly, src, dst = state
<stderr>:ValueError: too many values to unpack (expected 4)

Here is my environment.
cuda version :10.0
pytorch version : pytorch 1.4 with cudatoolkit 10.0
dgl version: v0.4.3

I wonder if the new dgl version causes the inconsistency.
Because I can run well when dgl is v0.4.2., but now I don't find how to downgrade my dgl to v0.4.2 explicitly.
Thanks for solving this :)
Or can you publish your code of how to generate the dgl graph from CIFAR and MNIST dataset?
Thank you

How to open atom_dict.pickle of ZINC dataset?

Hi. First of all, thank you for your great work for establishing a benchmark for GNN.

Now we try to visualize the data from ZINC dataset using rdkit.

However, we fail to find the proper dictionary of atom type for ZINC data.

We think that the file named atom_dict.pickle contains such contents, but we cannot open this via pickle because it causes the error as follows:

AttributeError: Can't get attribute 'Dictionary' on <module 'main' from 'open_pickle.py'>

So, can you share the method for open atom_dict.pickle file or share the dictionary for ZINC atoms? Then it will be very grateful.

Thank you.

Packet conflicts on Windows 10

I have tried installing the environment as specified in the guide, but I get a lot of packet conflicts when running conda env create -f environment_cpu.yml . Environment details: OS: Windows 10, no gpu. I don't get the same conflicts when running it on a Linux VM, any idea why?

f = pickle.load(f) ValueError: too many values to unpack (expected 4)

f = pickle.load(f)

This error occurred when I ran the following code on the command line:
!python /content/main_superpixels_graph_classification.py --dataset MNIST --gpu_id 0 --config '/content/configs/superpixels_graph_classification_GatedGCN_MNIST_100k.json' # for GPU
`
image

When I change the code to the following, there is still the same error
self.train, self.val, self.test = pickle.load(f)

How to solve this error? thanks!

Question regarding adding custom molecules dataset

Hello! Thank you for sharing this project!

I have a question concerning adding new datasets.

I have followed the pipeline described in docs to prepare a dataset for molecular regression. My dataset consists of SMILES strings and their respective scores so I'm using MoleculeCSVDataset method from the DGL library to convert molecules to graphs. I'm also using CanonicalAtomFeaturizer and CanonicalBondFeaturizer for extracting node and edge features respectively. In the end I get a feature matrix g.ndata['feat'] of size n x 74 and g.edata['feat'] of size e x 12 for each graph. However, I get shape mismatches during the forward pass.

Is my understanding of the featurization process correct or should it be done differently?

Many thanks in advance and apologies for the trivial question.

Node feat dim of ZINC dataset

Hello,thank you for this wonderful work.
I print the size of batch_x in "train_molecules_graph_regression.py", and its size is [batch_num, 1]. But in Table 10 in the paper, the dimension is 28 for ZINC dataset. Is there anything wrong with this dataset ? Looking forward to your reply, thanks ! :)

graphsage parameters

In graphsage at each layer each node updates its embedding with a fixed (S_i) number of neighbors. Is it possible in this framework to choose the S_i parameters for the 2 layers in graphsage? I don't see the possibility
Thank you anyway

GatedGCN version

I am a newbie, is the GatedGCN in the layers the GatedGCN-E version?And What is their difference?

Adjacency Matrix for SBM_Cluster Training

From my understanding, you randomly generated 10k graphs of random sizes for training.
For training with a citation network, a single graph A is passed into the GNN.
For training with SBM_Cluster, do you combine these randomly generated graphs (A_{i}) into a big diagonal block matrix? (like how we aggregate graphs for a graph classification problem, but without the pooling layer?)
IMG_0896

cleaner_main error, add jupyter_contrib_nbextensions to requirements

Runnning the notebook 'main_molecules_graph_regression.ipynb' as it is, you receive the error "CalledProcessError: Command 'jupyter nbconvert --to script main_molecules_graph_regression.ipynb' returned non-zero exit status 1." when running (last cell) the utils/cleaner_main script, in particular when it runs subprocess.check_output. The actual error is masked from the check_output function, in my case I needed to install the "jupyter_contrib_nbextensions" library.

Contribution: OpenBioLink

Hi,

We are the publisher of OpenBioLink, a large-scale biomedical heterogeneous knowledge graph, consisting of over 5. Mio edges. Would you be interested in OpenBioLink as a contribution to the existing datasets?

Question about 'z' variable

Hi! Thanks for this great work!

I was just curious as to what it he 'z' variable in line 59 of graph_transformer_layer.py? I cannot seem to find this variable in the paper. It seems to be dividing the output from the multihead attention. Is there a specific reason as to why this was necessary?

Thanks :)

Node features for CLUSTER dataset

Hi, thank you for providing a great benchmark!

I have a question about the CLUSTER dataset. As stated in the paper, the input attributes for all nodes are zero except for the 6 "seed" nodes which have one-hot input features [1,2,3,4,5,6]. Are these the only node attributes that are fed into the GNNs? I wonder if non-zero node attributes are used/needed for the non-seed nodes?

Thanks!

SBM_PATTERN data changed?

Hi, thank you for your wonderful work.
I have used the datasets ZINC, SBM_CLUSTER, SBM_PATTERN from the past February. Today, I found the paper is updated substantially and looked through the datasets.
It occurs to me that while ZINC, SBM_CLUSTER pkl file links are as they were but SBM_PATTERN link got changed. Is it safe to use the SBM_PATTERN pkl file from February still?
Or could you tell me what have changed since?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.