snap-stanford / ogb Goto Github PK

View Code? Open in Web Editor NEW

1.9K 41.0 395.0 4.34 MB

Benchmark datasets, data loaders, and evaluators for graph machine learning

Home Page: https://ogb.stanford.edu

License: MIT License

Python 100.00%

graph-machine-learning graph-neural-networks deep-learning datasets

ogb's Introduction

Overview

The Open Graph Benchmark (OGB) is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications. The OGB data loaders are fully compatible with popular graph deep learning frameworks, including PyTorch Geometric and Deep Graph Library (DGL). They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.

OGB aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.

Graph ML Tasks: We cover three fundamental graph machine learning tasks: prediction at the level of nodes, links, and graphs.

Diverse scale: Small-scale graph datasets can be processed within a single GPU, while medium- and large-scale graphs might require multiple GPUs or clever sampling/partition techniques.

Rich domains: Graph datasets come from diverse domains ranging from scientific ones to social/information networks, and also include heterogeneous knowledge graphs.

OGB is an on-going effort, and we are planning to increase our coverage in the future.

Installation

You can install OGB using Python's package manager pip. If you have previously installed ogb, please make sure you update the version to 1.3.6. The release note is available here.

Requirements

Python>=3.6
PyTorch>=1.6
DGL>=0.5.0 or torch-geometric>=2.0.2
Numpy>=1.16.0
pandas>=0.24.0
urllib3>=1.24.0
scikit-learn>=0.20.0
outdated>=0.2.0

Pip install

The recommended way to install OGB is using Python's package manager pip:

pip install ogb

python -c "import ogb; print(ogb.__version__)"
# This should print "1.3.6". Otherwise, please update the version by
pip install -U ogb

From source

You can also install OGB from source. This is recommended if you want to contribute to OGB.

git clone https://github.com/snap-stanford/ogb
cd ogb
pip install -e .

Package Usage

We highlight two key features of OGB, namely, (1) easy-to-use data loaders, and (2) standardized evaluators.

(1) Data loaders

We prepare easy-to-use PyTorch Geometric and DGL data loaders. We handle dataset downloading as well as standardized dataset splitting. Below, on PyTorch Geometric, we see that a few lines of code is sufficient to prepare and split the dataset! Needless to say, you can enjoy the same convenience for DGL!

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.loader import DataLoader

# Download and process data at './dataset/ogbg_molhiv/'
dataset = PygGraphPropPredDataset(name = 'ogbg-molhiv')

split_idx = dataset.get_idx_split() 
train_loader = DataLoader(dataset[split_idx['train']], batch_size=32, shuffle=True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size=32, shuffle=False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size=32, shuffle=False)

(2) Evaluators

We also prepare standardized evaluators for easy evaluation and comparison of different methods. The evaluator takes input_dict (a dictionary whose format is specified in evaluator.expected_input_format) as input, and returns a dictionary storing the performance metric appropriate for the given dataset. The standardized evaluation protocol allows researchers to reliably compare their methods.

from ogb.graphproppred import Evaluator

evaluator = Evaluator(name = 'ogbg-molhiv')
# You can learn the input and output format specification of the evaluator as follows.
# print(evaluator.expected_input_format) 
# print(evaluator.expected_output_format) 
input_dict = {'y_true': y_true, 'y_pred': y_pred}
result_dict = evaluator.eval(input_dict) # E.g., {'rocauc': 0.7321}

Citing OGB / OGB-LSC

If you use OGB or OGB-LSC datasets in your work, please cite our papers (Bibtex below).

@article{hu2020ogb,
  title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
  journal={arXiv preprint arXiv:2005.00687},
  year={2020}
}

@article{hu2021ogblsc,
  title={OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Ren, Hongyu and Nakata, Maho and Dong, Yuxiao and Leskovec, Jure},
  journal={arXiv preprint arXiv:2103.09430},
  year={2021}
}

ogb's People

Contributors

Stargazers

Watchers

Forkers

bowenliu16 jlqzzz lynex wangyuanhao kiminh fgerzer yelrose aspirincode jzhou316 briggsly mufeili fagan2888 yxgu2353 zzwloveai codeinging wayne980 lukecavabarrett ddkang chaitjo jonathangomesselman social-network-alignment giannipele embeddedsamurai silent567 sneakerkg wohlbier barclayii haipinglu jaykimbravekjh sts-sadr ankitshah009 zhangzhk0819 milkigit elizabeth1997 yelongshen acbull zhoujf620 nabsabraham bobycv06fpm assij feilong0309 tomzhang yzh119 xinya0817 pwgitb hjjpku nyxflower tarokiritani guyguygang qibinc thilinicooray swpper drugintelligence miracle1111111 yhchong barryzm xiaojianliu codeaudit dongkwan-kim tnakada freegliboracle hanbei969 russell-ligon jackymail augf mrdouglasny zch42 ziqiaomeng tmacmilan the-shuai monsterzeng whulizhen hypnopump alsun-oven kohpangwei keschler zbn123 schenglee 9578577 mitibmxgraph zhengchu1994 bu-lisp nucwzq tsy19025 acproject fairly fb-mphil joeloren rpatil524 2023research zillaru qema hujunxianligong melifluos gnn2qsu tony2cmu zdqf hugovergnes yuhui-zh15 hqadeer

ogb's Issues

Platform-agnostic data loaders?

Are there plans for providing a vanilla numpy array loading mechanism for these datasets? This would open up usage of OGB for platforms like TensorFlow or JAX.

Performances of MLP and standard GNNs on graphproppred

In benchmarking a few different approaches on graphproppred datasets, I got some surprising results which I thought might be interesting to others.

I'm not sure they best fit here (if not, please tell me), but it's probably the first spot people will look when first training models on OGB.

Specifically,

Naively incorporating bond information does not appear to increase performance on all but one task (BACE).
A simple MLP is surprisingly competitive to GNN approaches on most datasets, even outperforming them on one (BBBPTask)

Performance details can be found in the following table (showing ROC-AUC as computed by the OGB evaluator). Note that PCBATask is still being evaluated, having finished 80/100 runs for the GCNBlock and all runs for the MetaLayerBlock. PCBA runs are extremely slow at the moment, at ~3h/run.

	BACETask	BBBPTask	CLINTOXTask	HIVTask	MUVTask	PCBATask	TOX21Task	TOXCASTTask
MetaLayerBlock	0.748352	0.94185	0.99287	0.808189	0.795011	0.850489	0.781248	0.666357
GCNBlock	0.691575	0.918152	0.99463	0.789897	0.803885	0.848058	0.782434	0.681095
GINBlock	0.683883	0.930698	0.99754	0.822764	0.801272	NaN	0.775268	0.661059
GIN0Block	0.709341	0.923031	0.995917	0.790234	0.806313	NaN	0.778939	0.655804
GraphSAGEBlock	0.708791	0.933984	0.99498	0.793706	0.810183	NaN	0.782426	0.672336
MLPBlock	0.690659	0.953998	0.989598	0.76105	0.769725	NaN	0.770448	0.659917

And here's the ranking for each of the finished tasks, plus the total ranking. On average, all GNN methods perform similar (mean ranks between 2.42 and 3.57).

Ranking	BACETask	BBBPTask	CLINTOXTask	HIVTask	MUVTask	TOX21Task	TOXCASTTask	total
MetaLayerBlock	1	2	5	2	5	3	3	3
GCNBlock	4	6	4	5	3	1	1	3.428571
GINBlock	6	4	1	1	4	5	4	3.571429
GIN0Block	2	5	2	4	2	4	6	3.571429
GraphSAGEBlock	3	3	3	3	1	2	2	2.428571
MLPBlock	5	1	6	6	6	6	5	5

That's not an exhaustive evaluation! Some details on my current setup:

I'm doing a random search over hyperparameters, but only for 100 parameter configs
I'm training for a fixed 100 epochs, with bs 128 and Adam (lr 0.01); no weight decay or learning decay rate (both on my ToDo list, also early stopping)
Constructed models are all similar: A 100-dimensional bond and atom encoder (from OGB), followed by 1-5 layers with 16-128 units. These can use batchnorm, dropout, and jumping knowledge depending on the hyperparameters. After that, node features are globally mean-aggregated and a two-layer MLP outputs the final prediction.
The GIN models use two MLP layers with 16 or 32 units; on 8/14 task x (GIN, GIN0) the best model uses 16 units.
The MetaLayer is based on the Graph Network paper (and, of course, pt-geometric's implementation of it). Its edge model is a two-layer MLP based on edge features and adjacent node features, while the node model uses two two-layer MLPs to map node and edge feature for each adjacent node, then projecting them onto the current node (using mean aggreagation), then applying the second MLP. It can use residual connections (with appropriate up/downscaling); that's also a hyperparameter.
The GIN result on TOX21 (0.775) fits pretty well with the graph prediction baseline @rusty1s added (0.761); differences in parameters and training suffice to explain that (128 vs 32 batch_size, his use of edge information, different learning rate)

Other interesting findings:

About half of the best-performing GNN/task combinations use batchnorm, almost exclusively on HIV, PCBA (those that have been evaluated), and TOX21/TOXCAST.
Only one best-performing model used dropout (MUV/MetaLayer, which doesn't perform well compared to other GNNs)
Depth varies wildly for the best-performing HP configs: With 13 in total, 4 layers is the most common, closely folled by 1 layer (11x); the others occur between 5 and 8 times.
5/8 of the best MetaLayer parameter configs use the residual connection. Interestingly, those that don't also have 3, 4, and 5 layers, so this doesn't appear to be a necessity for deeper layers.
On BACE, BBBP, MUV, and both TOX21 and TOXCast, overfitting seems to be very easy. Regularization might dominate here.

I'm wondering whether you had similar experiences when training the benchmarking models.

ogbl-ppa version

Hi OGB team,
We found that ogbl-ppa has been updated. What is the difference between the new version and v4?

Separate leaderboard by transductive and inductive learning

Hi,

I noticed that GraphSAINT on OGB-products was running under an inductive learning setting, where the training graph only consists of training and validation nodes. However, ClusterGCN, full-batch GraphSAGE in the example implementations, and DGL's implementation of GraphSAGE all follow a transductive setting where the test nodes are included in the training graph as well (albeit without labels).

I think separating the two scenarios in different leaderboards would be fairer. What do you think?

Thanks.

SSL certificate verification problem

When running

from ogb.graphproppred import GraphPropPredDataset
dataset = GraphPropPredDataset(name='ogbg-molesol')

I receive the following error:

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1108)

I can circumvent the error by inserting the following command at the top of graphproppred/dataset.py, but I don't think it is ideal:

ssl._create_default_https_context = ssl._create_unverified_context

Since this error is intermittent, it might be a problem on my side, although I have not encountered the error before when using other packages.

OGB on DGL 0.5

Hi,
Is it possible to read OGB to a DGL graph in DGL 0.5?

Redundant self connections in ogbn-products

Dear OGB Team,

Thanks for providing these awesome datasets!

I noticed that for the ogbn-products dataset, there are some redundant self connections. For example, the edge (15064,15064) occurred twice in the edge_index array.

Here is a list of the row and column indices of the repeated edges. After removing the repeated edges, it gives me 123718152 (doubled) edges instead of the original 123718280 edges.

Best,
Tedzhouhk

The ogbn-arxiv dataset is not available?

Dear OGB team,

I am trying to process the "ogbn-arxiv" dataset, and
after creating a dataloader like
dataset = PygGraphPropPredDataset(name = "ogbn-arxiv"), only to get the following response:

  File "test_ogb_data.py", line 4, in <module>
    dataset = PygGraphPropPredDataset(name = "ogbn-arxiv")
  File "/home/zhaohuan/anaconda3/lib/python3.7/site-packages/ogb/graphproppred/dataset_pyg.py", line 25, in __init__
    raise ValueError(error_mssg)
ValueError: Invalid dataset name ogbn-arxiv.
Available datasets are as follows:
ogbg-molbace
ogbg-molbbbp
ogbg-molclintox
ogbg-molmuv
ogbg-molpcba
ogbg-molsider
ogbg-moltox21
ogbg-moltoxcast
ogbg-molhiv
ogbg-molesol
ogbg-molfreesolv
ogbg-mollipo
ogbg-molchembl
ogbg-ppa

Can you give any suggestion for this problem? Thanks very much.

Issues with DGL version of `ogbg-molpcba`

Hi @weihua916 and team, thanks for the cool work! I was playing around with graph classification on the molecular datasets, and encountered an error with ogbg-molpcba where the label provided with each graph is not a single value but a tensor instead (and has multiple nan values). Here is simple code to reproduce and demonstrate the differences between the DGL and PyG variants of the dataset:

(P.S. for comparison, I provide code to show that DGL and PyG variants of ogbg-molhiv are working as intended.)

import ogb
from ogb.graphproppred import PygGraphPropPredDataset, DglGraphPropPredDataset

# molhiv
dgl_dataset = DglGraphPropPredDataset(name = 'ogbg-molhiv')
pyg_dataset = PygGraphPropPredDataset(name = 'ogbg-molhiv')
print(dgl_dataset[0])
print(pyg_dataset[0])

# molpcba
dgl_dataset = DglGraphPropPredDataset(name = 'ogbg-molpcba')
pyg_dataset = PygGraphPropPredDataset(name = 'ogbg-molpcba')
print(dgl_dataset[0])
print(pyg_dataset[0])

For molhiv, the outputs are:

(DGLGraph(num_nodes=19, num_edges=40,
          ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)}),
 tensor([0]))

Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1])

For molpcba, the outputs are:

(DGLGraph(num_nodes=20, num_edges=44,
          ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)}),
 tensor([0., 0., nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., nan, 0.,
         0., 0., nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., nan, 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., nan, 0., 0., 0., nan, 0., nan, 0., 0., nan, 0., 0., 0.,
         0., 0., 0., 0., nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., nan, 0., 0., 0., nan, 0., 0.,
         0., 0., 1., nan, 0., 0., nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan], dtype=torch.float64))

Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1])

Note that I am using the latest version of obg (1.1.1).

Leaderboards for (unsupervised) node embedding

Hi,
great initiative !

Are you planning (or may I propose) to add a section on unsupervised node embedding ?

This is an topic on which a lot of efforts went but I am not sure we have robust comparisons of the plethora of methods developed.

Custom datasets

When developing a new model, it's usually necessary to create some small, temporary, synthetic datasets for debugging, before experimenting with real datasets. What is the recommended way of connecting up our own custom datasets to the rest of OGB? For instance, should we create our own version of GraphPropPredDataset?

Workflow for contributing additional graph resources?

Hi OGB team,

We recently released a large-scale (approx. 5 million edges) benchmark for biomedical link prediction called OpenBioLink (https://github.com/OpenBioLink/OpenBioLink). Would you be interested in adding such a dataset to OGB, and what would be the suggested workflow for making this happen?

Unable to use the new dataset ddi

Hi,

I tried to test the performance of GCN on the new link prediction dataset ddi
and got the following error

Thank you,
Omri

Automated code review for Open Graph Benchmark

Hi!
I am a member of the team developing monocodus — a service that performs automatic code review of GitHub pull requests to help organizations ensure a high quality of code.
We’ve developed some useful features to the moment, and now we’re looking for early users and feedback to find out what we should improve and which features the community needs the most.

We ran monocodus on a pre-created fork of your repo on GitHub https://github.com/monocodus-demonstrations/ogb/pulls, and it found some potential issues. I hope that this information will be useful to you and would be happy to receive any feedback here or on my email [email protected].

If you want to try our service, feel free to follow the link: https://www.monocodus.com
The service is entirely free of charge for open source projects. Hope you’ ll like it :)

ogbn_arxiv: invalid gradient error if the directed edges don't be converted to undirected

Hi @rusty1s , for ogbn_arxiv dataset, I commented out edge_index = to_undirected(edge_index, data.num_nodes) and set args.use_sage be true, while a runtime error occurred when statement loss.backward() was executed.

RuntimeError: Function torch::autograd::CppNode returned an invalid gradient at index 3 - got [169342, 256] but expected shape compatible with [169343, 256]

Any ideas about this? Thanks!

DGL Example for ogbn-proteins

Hi OGB team,

I've adapted the PyG example for ogbn-proteins to DGL. The implementation yields a similar performance:

	PyG	DGL
Best train ROCAUC	0.71	0.72
Best validation ROCAUC	0.68	0.71
Best test ROCAUC	0.65	0.67

On average the PyG example takes 510s for training per epoch and the DGL implementation takes 140s for training per epoch.

If you have not come up with a DGL example for ogbn-proteins, I'm willing to open a PR. Thanks.

[RFC] Add system metrics to the leaderboard

Hi OGB folks! With more and more GNN models being proposed and submitted to your leaderboard, I wonder would you like to add some system metrics too? From my experience, although some models are not as accurate as the others, they could be significantly faster or less system-demanding. Adding these metrics like training time or memory consumption could also motivate people to develop models that are cheaper and more practical (not always promoting GPT-X style stuffs). What do you guys think? If you think it's a good idea, I'm happy to contribute more thoughts on how to standardize the benchmarking process.

About Leaderboard Submission

Hi OGB team,

For leaderboard submission, will you also consider submissions accompanied with a short report rather than a paper? There can be cases where people develop models based on existing approaches with some slight modification for the particular dataset. In such cases, there might not be enough novelty for a full paper.

ogbg-code: AST and DAGs

In the paper you write that the dataset consists of trees. Is it correct that, by adding the next_token edges (but no inverse edges), we obtain DAGs? Or is it possible that the AST order is different from the token sequence order so that there might be cycles?

Thank you for taking the benchmarking initiative, the datasets are great!

question on batch size in GraphSAINT

Hi,
I am trying to run the graphSAINT implementation in OGB (https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/products/graph_saint.py) on the ogbn-products dataset and I wanted to clarify something about the batch size in this implementation.

In the code, the default batch size is 20000 and the function below uses this batch size (I do not pass any batch size in the command line).

loader = GraphSAINTRandomWalkSampler(sampler_data, batch_size=args.batch_size, walk_length=args.walk_length,
num_steps=args.num_steps, sample_coverage=0, save_dir=dataset.processed_dir)

However, in the train(model, loader, optimizer, device) function, in the for loop (for data in loader:), I am printing the information for each data and the output is attached below. It looks like each mini-batch contains ~75K nodes (not 20K mentioned in the batch size argument passed into the GraphSAINTRandomWalkSampler function). Could you please help me understand these two different values (75K and 20K). Does the batch_size argument in GraphSAINTRandomWalkSampler function not refer to how many nodes should be contained in a mini-batch? If so, what does it refer to?

Products-GNN does not fit on a v100

Hi,
Thank you for this wonderful project.

Are there any hacks to run the Products-GNN example? https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/products/gnn.py

The README says that it consumes large amounts of GPU memory, but currently, I can't fit it on a v100 GPU.

Thanks!

Problem of using undirected graph in ogbn-arxiv, ogbn-papers100m and ogbn-mag

Hi guys,

I see the code on the leaderboard are using undirected graphs for ogbn-arxiv, ogbn-papers100m and ogbn-mag, and I have a question about it.

The nodes in the three datasets are split by time (year of publication), however, using undirected graph may cause data leakage: we cannot predict the property of older papers using newer papers. So simply adding reverse edges for the full graph is not reasonable.

One possible way of doing this is to make papers in one year only see papers in the current year and previous years. Is it correct?

Many thanks.

full-batch vs sampling on products dataset

Hi,
For the full batch training (gnn.py), I notice that products dataset is used with T.ToSparseTensor(). In contrast, the sampling-based techniques (ClusterGCN and GraphSaint) do not involve converting products to sparse tensor format. Could you please help me understand this difference? Thank you!

Why is the result for GCN so low for ogbn-proteins?

The results for GCNs are suspiciously low, just by examination.

When I run it (with default settings), I got:

Highest Train: 82.77 ± 0.21
Highest Valid: 79.02 ± 0.30
  Final Train: 82.72 ± 0.17
   Final Test: 71.96 ± 0.59

Judging from the variance numbers reported, these numbers are very incongruous with the leaderboard result of 65.11 +-1.52

Just a thought, was it perhaps run it with epochs=500? At that point, the default GCN seems to be around 65%.

num_layer not passed to GNN

Hi team,

Thanks for the nice work!

I found the num_layer argument in graph property prediction scripts is not passed to GNN. See line and line. This occurs at all three example tasks: code, ppa and mol, which makes tuning num_layers in command line not working as expected.

why cluster-gin was removed?

Hi OGB team, hope you are doing great! I notice you removed cluster-gin for ogbn_proteins dataset two days ago, why is that? Thank you.

Replacing GCNConv as defined in ogbn-protein example with torch_geometric.nn.GCNConv causes error

Hi,
I wanted to know why the GCNConv that is provided with torch_geometric.nn does not work with the examples. Using the layer definition that is provided in the examples work fine. This the the error that I am getting. Would really like it if someone sheds some light on this issue.

`TypeError Traceback (most recent call last)
in
16 for epoch in range(1, 1 + args.epochs):
17
---> 18 loss = train(model, x, adj, y_true, train_idx, optimizer)
19
20 if epoch % args.eval_steps == 0:

in train(model, x, adj, y_true, train_idx, optimizer)
4
5 optimizer.zero_grad()
----> 6 out = model(x, adj)[train_idx]
7 loss = criterion(out, y_true[train_idx].to(torch.float))
8 loss.backward()

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)

in forward(self, x, adj)
7
8 def forward(self, x, adj):
----> 9 x = self.init(x,adj)
10 x = self.dense(x,adj)
11 x = self.out(x,adj)

/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/gcn_conv.py in forward(self, x, edge_index, edge_weight)
100 if self.normalize:
101 edge_index, norm = self.norm(edge_index, x.size(
--> 102 self.node_dim), edge_weight, self.improved, x.dtype)
103 else:
104 norm = edge_weight

/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/gcn_conv.py in norm(edge_index, num_nodes, edge_weight, improved, dtype)
71 if edge_weight is None:
72 edge_weight = torch.ones((edge_index.size(1), ), dtype=dtype,
---> 73 device=edge_index.device)
74
75 fill_value = 1 if not improved else 2

TypeError: ones() received an invalid combination of arguments - got (tuple, device=method, dtype=torch.dtype), but expected one of:

(tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

Submitting new models or datasets to OGB leaderboard

Hi Team,

I'm wondering what is the procedure of submitting new models or datasets to the OGB leaderboard. Do I simply make pull requests? How would the leaderboard page be updated?

EDIT: I'm referring to #1 and #9 since I think both issues have the same concern about contributing new datasets to OGB.

Thanks.

Potential exploding gradient issue

Hi OGB team,

Thanks for the great work in creating the benchmark!

While I am trying ogbl-ppa for my own research work, I find that the existing full batch training for GCN will frequently lead to sudden performance drop with a largely increased loss, which I suspect is due to the exploding gradient. I manage to alleviate the issue by gradient clipping, i.e.,
clid_grad_norm_(model.parameters(), 1.0)
clid_grad_norm_(predictor.parameters(), 1.0)
The performance of GCN improves from ~11% (the same as in the paper) to ~18%.

I haven't got to check whether other datasets have the same problem, but I think it is better to be aware of the issue : )

[Remark] Dataset splitting section of ogbn-arxiv

Hello OGB community,

Really nice effort of standardizing the ML on Graphs with a benchmark!

As I read and play around with OGB I noted that in the Dataset Splitting section for oghn-arxiv you essentially describe again the splitting used for the ogbn-products dataset.

You might want to update it. :)

Best,
Makis

ogbl-ddi

solved. embedding.pt has to be generated first using Node2Vec.

Naming inconsistensies

"ogbg-mol-tox21" should read "ogbg-moltox21" in this readme.

https://github.com/snap-stanford/ogb/tree/003cc8a443fbb8b9ba7518ed8e6787e0347e6ee3/examples/graphproppred/mol#dataset

ogbg-molchembl dataset fails due to default pickle_protocol

How to repro:

ds = GraphPropPredDataset('ogbg-molchembl', root='/tmp/ogb_datasets') fails at the torch.save step of the pre_process method, because pickle "cannot serialize a string larger than 4 gb".

What I've done:

I've tried setting torch.serialization.DEFAULT_PROTOCOL = 4 (which according to this adds support for large objects) before calling above, but this did not help -- I think it should be passed as arg to torch.save.

which version of ogbn-proteins dataset did you use in cluster_gin.py file?

Hello, OGB team, hope you are doing great! I just downloaded the example code of ogbn-proteins and ran the cluster_gin.py. I found out you didn't use node features and the node species information has been changed from previous one-hot encoding (version 3) to taxonomy ID. However, in cluster_gin.py file, cluster_data.data.x = cluster_data.data.x.to(torch.float) this statement is incorrect cause there is no attribute called x now. You can check when you set the argument use_node_features to be True. Another question is that it is possible for us to use one-hot encoding features provided previously? Because we have no idea about the meaning represented by taxonomy ID of each protein or does the similarity between two proteins could be expressed by the difference of their taxonomy IDs? Thank you for replying in advance and have a good one!

No negative train edges in ogbl-ppa？

Hi OGB Team,

I found that all the train edges in ogbl-ppa are positive. Does this mean I have to build negative train edges by myself?

Full batch on OGB Products goes out of memory on CUDA

When I try running full_batch.py in nodeproppred/products it throws a CUDA out of memory error.

My PyTorch Geometric version is 1.4.2 and the GPU is V100.

OGB papers 100M required Hw

Hi,

In order to run papers 100M node prediction task, what are the CPU Hw requirements?

DDR
Disk space
...

Proposal for larger data sets

Most of the graph classification datasets seems quite small. Have you looked at

(1) https://arxiv.org/abs/1906.09427
(2) https://arxiv.org/abs/1904.06046 (221 million molecules) ?

RuntimeError while run full_batch.py

Hi,
I run python full_batch.py on ogbn_arxiv dataset, while a runtime error occurred. I have tried searching for a solution, but I still don't know why it happened....
Has anyone run into the same problem can give some help? Thank you in advance.

RuntimeError:

init(torch.torch_sparse.storage.SparseStorage self, Tensor? row, Tensor? rowptr, Tensor? col, Tensor? value, (int, int)? sparse_sizes, Tensor? rowcount, Tensor? colptr, Tensor? colcount, Tensor? csr2csc, Tensor? csc2csr, bool is_sorted) -> (None):
Expected a value of type 'Optional[Tensor]' for argument 'row' but instead found type 'int'.
:
File "/home/Evan/PyEnv_3.7/lib/python3.7/site-packages/torch_sparse/storage.py", line 283
col = idx % num_cols
    return SparseStorage(row=row, rowptr=None, col=col, value=self._value,
           ~~~~~~~~~~~~~ <--- HERE
                         sparse_sizes=(num_rows, num_cols), rowcount=None,
                         colptr=None, colcount=None, csr2csc=None,

More protein information for ogbl-ppa

Hi OGB team,

Thank you for the great work. I'm working on ogbl-ppa and I'm wondering if you can share informations (such as protein display name and sequences, etc.) of proteins used in constructing ogbl-ppa? In this way, the dataset will be interesting to a broader community like people in bioinformatics.

About the training score of ogbl-ppa

Hi OGB team,
I wonder why the negative edge of the validation set is used instead of random negative sampling in the training set evaluation in the example code.

Load smaller dataset

Hi,

Is there a way to load a smaller dataset? I'm trying to run a sample test on my machine to ensure my code works and then move it to the server, but my machine only has 16GB of memory which is not enough apparently, and the python script end up being killed by the OS. I tried looking at the docs and the source code, but there doesn't seem to be anything.

$ /usr/bin/time -v python test.py
Loading necessary files...
Command terminated by signal 9
	Command being timed: "python test.py"
	User time (seconds): 155.79
	System time (seconds): 23.27
	Percent of CPU this job got: 96%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:05.40
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 12009284
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1085
	Minor (reclaiming a frame) page faults: 2240431
	Voluntary context switches: 3797
	Involuntary context switches: 6358
	Swaps: 0
	File system inputs: 2584032
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

homo_data.edge_index has redundant edges for ogbn_mag dataset

Hi OGB Team, hope you are doing great.
For ogbn-mag dataset, I found homo_data.edge_index in cluster_gcn.py file has many redundant edges with the same edge type. For example, edge (196525, 1199171) occurs many times just like the below screenshot shows. Are there any potential bugs of group_hetero_graph(data.edge_index_dict, data.num_nodes_dict) statement? Thank you for answering in advance!

Unable to import PygGraphPropPredDataset

Dear Team Members,

Thank you very much for all the efforts. However, I am having some issues whenever I try to import PygGraphPropPredDataset. I have attached a screenshot for your consideration. Could you please help me out?

Configuration:

Python 3.7.3
PyTorch 1.4.0
DGL 0.4.3
Numpy 1.18.4
pandas 1.0.1
urllib3 1.25.9
scikit-learn 0.22.1

Screenshot:

I look forward to hearing from you soon.

Thank you,
@ashikrafi

Error when loading ogbn-mag dataset with DglNodePropPredDataset

Hi and thanks for your work,
when i use DGL Loader for ogbn-mag I receive this error:
raise DGLError('Edge type "{}" does not exist.'.format(etype))
dgl._ffi.base.DGLError: Edge type "paper" does not exist.

Small error in example in docs

Thanks a lot for making this collection of datasets. I found a small error in the docs.

On https://ogb.stanford.edu/docs/nodeprop/ , in the example code of the DGL loader, the code example is

from ogb.nodeproppred import DglNodePropPredDataset

dataset = NodePropPredDataset(name = d_name)

but should be

from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset(name = d_name)

Species Information in ogbn-proteins

Hi OGB team,

According to the description of ogbn-proteins, the nodes are proteins from 8 species. I'm wondering why the species information is not included for node features. Thanks.

Evaluator for ogbl-collab

Hi, it seems that the example code initializes the evaluator for ogbl-collab with evaluator = Evaluator(name='ogbl-ppa') here.

PCA matrices of "ogbn-products"?

Hi,
Thanks again for this wonderful project.
The node features in ogbn-products are a PCA decomposition of the original sparse bag-of-words vectors.

Do you still have, by chance, the PCA matrices that can be used to convert between the BoW-vectors and the 100-dim vectors?
Without them, this dataset is "frozen" in its current state and cannot be used in transfer learning, learning from multiple datasets at the same time, or other linguistic baselines and analyses (for example, we cannot use pre-trained word embedding because we don't have the original bag-of-words).

Thanks a lot!
Uri

Nan value in the labels of "ogbg-mol-tox21" dataset

Hi OGB Team,

I'm using the ogbg-mol-tox21 dataset, but it seems there are some missing values in the labels.

ipdb> dataset = PygGraphPropPredDataset(name='ogbg-mol-tox21')
ipdb> dataset
PygGraphPropPredDataset(7831)
ipdb> dataset[0]
Data(edge_attr=[34, 3], edge_index=[2, 34], id=[1], x=[16, 9], y=[1, 12])
ipdb> dataset[0].y
tensor([[0., 0., 1., nan, nan, 0., 0., 1., 0., 0., 0., 0.]], dtype=torch.float64)
ipdb> dataset[1].y
tensor([[0., 0., 0., 0., 0., 0., 0., nan, 0., nan, 0., 0.]], dtype=torch.float64)
ipdb> dataset[3].y
tensor([[0., 0., 0., 0., 0., 0., 0., nan, 0., nan, 0., 0.]], dtype=torch.float64)

Would you please take a look at this problem?

snap-stanford / ogb Goto Github PK

ogb's Introduction

Overview

Installation

Requirements

Pip install

From source

Package Usage

(1) Data loaders

(2) Evaluators

Citing OGB / OGB-LSC

ogb's People

Contributors

Stargazers

Watchers

Forkers

ogb's Issues

Recommend Projects

Recommend Topics

Recommend Org