graph-learning-benchmarks / gli Goto Github PK

🗂 Graph Learning Indexer: a contributor-friendly and metadata-rich platform for graph learning benchmarks. Dataloading, Benchmarking, Tagging, and more!

Home Page: https://graph-learning-benchmarks.github.io/gli/

License: MIT License

Python 55.92% Jupyter Notebook 43.81% Makefile 0.27%

benchmark graph graph-datasets graph-neural-networks machine-learning graph-neural-network

gli's People

Contributors

Stargazers

Watchers

Forkers

jn-huang jiaqima greatsnoopyme tu-yiwen jupiterepoch zy-liang tingwl0122 remylau xingjian-zhang uestc-chen hookk yi-liang-leon wood-ghost lwangjt huskydoge muzz-yasir jason-csc xinyaoqiu jackliuyiyao

gli's Issues

store `time_window` directly in task.json

Current:

task_data_no_pre_stored_neg_edge = {
    "train_time_window": (1963, 2017), # window of (edge_year in train)
    "valid_time_window": (2018, 2018),
    "test_time_window": (2019, 2019)
}
np.savez_compressed("ogbl-collab_task_runtime_sampling.npz", **task_data_no_pre_stored_neg_edge)

with open("./task_runtime_sampling.json", "w") as fp:
    json.dump(task_no_pre_stored_neg_edge, fp, indent=4)

    "train_time_window": {
        "file": "ogbl-collab_task.npz",
        "key": "train_time_window"
    }

Let's store it in this way:

    "train_time_window": [1963, 2017]

Datatype issue when calling "get_single_graph" and "get_multi_graph" function in graph.py

In the function get_single_graph, when we want to assign the node feature, the original code: g.ndata[attr] = array
will give error message if the array variable has no device attribute.

For example, when the data type of array is scipy_csr_matrix, I will need to convert it to a torch tensor first (to give it device attribute). So we can either read in tensor-only array or modify it in get_single_graph.

Attached is an example.

.

Field name should be consistent

https://github.com/Graph-Learning-Benchmarks/GLB-Repo/blob/b9a5de70b48886e8d8c2cc4c0e661a015ca6c500/examples/ogb_data/graph_prediction/ogbg-molhiv/metadata.json#L27-L38
The field name should be _NodeList and _EdgeList.

[BUG] GLB installation error (dgl error)

Describe the bug
Unable to install "dgl-cu102" when installing glb.
I saw some changes have been made in setup.py and requirements.py, but I cannot successfully install glb after this change.

To Reproduce
pip install -e .

Screenshots

[FEATURE REQUEST] Distinguishing the contributions of original dataset and transformed dataset.

Is your feature request related to a problem? Please describe.
Currently, by separating data and task, we are able to distinguish the contribution of creating a dataset and the contribution of creating a task on an existing dataset. However, there is another type of contribution which transforms an existing dataset (e.g., by imputing some of the missing features) to form a new dataset. The original dataset and transformed dataset cannot have the same metadata.json or npz files so we cannot store both versions in the same folder. When the transformed dataset is more popular and we store the transformed version, it is tricky that which contribution we should add in the citation under metadata.json.

This problem is due to the fact that we still lack a mechanism of distinguishing the contributions of original dataset and transformed dataset.

Describe the solution you'd like
Perhaps the easiest solution could be having a "Previous Versions" section in the README.md for transformed dataset, with citations to the previous versions. And if some previous versions of the datasets are also stored in the repository, we can further add a link to that folder near the corresponding citation.

We do not need to be perfect about tracking the version information when submitting the datasets. We can always use the community power to supplement/correct the version tracking information as the project will be open to PR.

[FEATURE REQUEST] Change train_set to train_mask

Is your feature request related to a problem? Please describe.
After loading dataset, we need to get train_mask, val_mask and test_mask. In current implementation, we need do the following:
train_mask = g.ndata["train_set"]
However, some of dgl's dataset uses train_mask as key. For example: https://docs.dgl.ai/generated/dgl.data.CoraGraphDataset.html?highlight=coragraphdataset

Describe the solution you'd like
Change key from train_set to train_mask

Accommodate data loading for data under `datasets` folder with urls.json

Is your feature request related to a problem? Please describe.
Accommodate data loading following the change in #36 and #63

Describe the solution you'd like
Make the data loader able to download npz files from urls.json.

OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

The problem may relate to dgl. My python version is 3.9, pytorch is 1.11.0 (py3.9_cuda11.3_cudnn8.2.0_0), dgl is dgl-cuda11.3. CUDA version is 11.5(on greatlake).
This error has something to do with cuda although I didn't use gpu yet.

To reproduce, simply run example.py with "python3 example.py --task {NodeClassification,TimeDepenedentLinkPrediction,GraphClassification}"

File "/home/huangjin/GLB-Repo/example.py", line 4, in
import glb
File "/home/huangjin/GLB-Repo/glb/init.py", line 2, in
from . import dataloading
File "/home/huangjin/GLB-Repo/glb/dataloading.py", line 3, in
from dgl import DGLGraph
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/init.py", line 13, in
from .backend import load_backend, backend_name
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/backend/init.py", line 95, in
load_backend(get_preferred_backend())
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/backend/init.py", line 41, in load_backend
from .._ffi.base import load_tensor_adapter # imports DGL C library
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/_ffi/base.py", line 44, in
_LIB, _LIB_NAME, _DIR_NAME = _load_lib()
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/_ffi/base.py", line 34, in _load_lib
lib = ctypes.CDLL(lib_path[0])
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

Change `urls.json` structure

Is your feature request related to a problem?
To download npz files from remote repository, I need to know what file(s) a json needs. For example, if the user only wants to load a CORA graph, the only needed file is cora.npz, which is in metadata.json. But it is inefficient for us to collect the needed files from JSON because they may have very different structures (e.g., different task types).

Describe the solution you'd like
In urls.json, add a hierarchy:

Change

{
	"cora.npz": "https://www.dropbox.com/s/os68aa8zptwht0f/cora__cora.npz?dl=0",
	"cora_task.npz": "https://www.dropbox.com/s/79jqjylqj2fw6h3/cora__cora_task.npz?dl=0"
}

{
	"metadata": {
		"cora.npz": "https://www.dropbox.com/s/os68aa8zptwht0f/cora__cora.npz?dl=0"
	},
	"task": {
		"cora_task.npz": "https://www.dropbox.com/s/79jqjylqj2fw6h3/cora__cora_task.npz?dl=0"
	}
}

Describe alternatives you've considered
Use a depth-first-search function to collect all the required files.

Add dependency in `setup.py`

Is your feature request related to a problem? Please describe.
Currently, we need to manually install dependency (e.g., PyTorch, glb, etc.).

Describe the solution you'd like
We can specify the requirement in setup.py. So users only need to run it once to use our codes.

Describe alternatives you've considered
We can also use requirements.txt

[BUG] WN18RR has a wrong README.md

Describe the bug
WN18RR has a wrong README.md with a title "FB15K". Other OpenKE datasets seem to have the correct README.md.

Additional context
I've moved most of the OpenKE datasets from /examples/ to /datasets/ in PR #75. WN18RR is the only one being kept in both /examples/ and /datasets/ as it has the smallest npz files. Please fix the README.md in both /examples/WN18RR/ and /datasets/WN18RR/.

Current data format does not support negative samples.

The current data format does not support negative samples for LinkPrediction because they are not indexable.

[BUG] Cannot read in obgb-molfreesov dataset and Rename twitch-gamers dataset

Describe the bug
As titled.
Also, datasets/twitch-gamers has the wrong name and datasets/twitch-gamers/urls.json has the wrong urls.

To Reproduce
You can try running the command: python3 tags.py --dataset ogbg-molfreesolv --task obgb-molfreesolv_task
at GLB-Repo/glb.

Expected behavior

Move negative edges in link prediction to task.json

Is your feature request related to a problem? Please describe.
ogbl-collab is a homogeneous graph. However, we are now storing it as a heterogeneous graph, whose edges include "positive edge" and "negative edge". We would like a graph to be saved as heterogeneous only if it is heterogeneous by definition.

Describe the solution you'd like

Move the negative edges in ogbl-collab to task.json.
Change the format of LinkPrediction

[FEATURE REQUEST] Multi-fold splits or random splits in NodeClassification tasks

Is your feature request related to a problem? Please describe.

When writing train_set, val_set and test_set for the dataset, i found that not all the datasets has only one split of training set, validation set and test set. For cora, there is only one split, therefore the shape of its graph.ndata['train_mask'] is [2708]. However, for actor, there are 10 splits of the set, therefore its graph.train_mask shape is [7600,10]. Another case is that, in the non-homophilous datasets like arxiv-year, its split style is by setting the split ratio and then randomly assign nodes into one of three sets (for nodaclassification task).

Describe the solution you'd like

After the meeting with Jiaqi, Jiaqi put forward some solution to it.
For the dataset like cora, which has only one split style of datasets, everything keeps the same.

For the dataset like actor, which has multiple ways of splitting datasets, we can add one key named num_fold to the task.json to denote there are num_fold ways to split the dataset (10 for the actor dataset). Further, we can change the key of train_set, val_set and test_set into train_FOLD, val_FOLD and test_FOLD. In the actor.ipynb, we can change the dict: task_data into

task_data = {
    "train_0": train_set,
    "val_0": val_set,
    "test_0": test_set，
    "train_1": train_set,
    "val_1": val_set,
    "test_1": test_set，
    ......
    "train_9": train_set,
    "val_9": val_set,
    "test_9": test_set，
}

30 values in total.

For the dataset like arxiv-year, which has no fixed way of splitting datasets but only randomly generate the splits with some pre-set ratio, we can delete the train_set, val_set and test_set in task.json and replace them with train_ratio, val_ratio and test_ratio.

Additional Context

According to Jiaqi, the dataloader will need to adapt such change.

Heterogeneous graph should specify triplet type in metadata.json

Currently, we store the OGB link prediction data as a heterogeneous graph that has two edge groups

PositiveEdge
NegativeEdge

but single node group

Node

However, DGL requires different edge groups to have different relation triplets. So we need to either specify the triplet in metadata.json for each edge group or simply label pos/neg as features.

import dgl
import torch as th
# Create a heterograph with 3 node types and 3 edges types.
graph_data = {
   ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
   ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
   ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
}
g = dgl.heterograph(graph_data)

`GraphClassification`

https://github.com/Graph-Learning-Benchmarks/GLB-Repo/blob/c8191543b1f1d4f334f1b29ffc124ce732ccf032/examples/ogb_data/graph_prediction/ogbg-molhiv/task.json#L3

This should be GraphClassification here.

Add `.gitignore`

set "device" to gpu in read_glb_graph lead to an error

When I try to set 'device = '0' in read_glb_graph. An error will occur:

Traceback (most recent call last):
File "/home/huangjin/GLB-Repo/benchmark/gcn/train.py", line 168, in
main(args)
File "/home/huangjin/GLB-Repo/benchmark/gcn/train.py", line 43, in main
g = glb.graph.read_glb_graph(metadata_path=metadata_path[args.dataset], device=device)
File "/home/huangjin/GLB-Repo/glb/graph.py", line 127, in read_glb_graph
return get_single_graph(data, device, hetero=hetero)
File "/home/huangjin/GLB-Repo/glb/graph.py", line 58, in get_single_graph
g.ndata[attr] = _to_tensor(array)
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 84, in setitem
self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4122, in _set_n_repr
raise DGLError('Cannot assign node feature "{}" on device {} to a graph on'
dgl._ffi.base.DGLError: Cannot assign node feature "NodeFeature" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.
Namespace(dataset='citeseer', dropout=0.5, gpu=0, lr=0.01, n_epochs=200, n_hidden=16, n_layers=1, weight_decay=0.0005, self_loop=True)

Maybe we should assign:
g.ndata[attr] = _to_tensor(array, device=device) at line 58
and,
g.edata[attr] = _to_tensor(array, device=device) at line 61
to avoid the problem
Here is an example:

Some `*.npz` file is too large to host on Github

Many datasets are too large to store on Github. I suggest we delete all .npz files for being consistent. After all, they can be generated through notebooks and are not readable.

Fix typo `SUPPORTED_TASK_TYPES`

https://github.com/Graph-Learning-Benchmarks/GLB-Repo/blob/1d0502ab82d10907695c88f3156d93eb43a51709/glb/task.py#L12

Add GLB base class GLBGraph with `node_to_dense()` and `edge_to_dense()`

Is your feature request related to a problem? Please describe.
Users may want to convert node/edge features to dense tensors.

Describe the solution you'd like
Add a base class of GLB. (DGLGraph->GLBGraph) with two member functions:

node_to_dense(feat=..., node_group=...)  # 2 optional arguments
edge_to_dense(feat=..., edge_group=...)

Describe alternatives you've considered
Add a member function to_dense() that converts all features to dense.

[BUG] Cannot load `ogbg-molfreesolv`

To Reproduce

(glb-py36) jimmyzxj@voyager:~/glb/GLB-Repo$ python3 tmp.py 
Using backend: pytorch
/home/jimmyzxj/glb/GLB-Repo/datasets/ogbg-molfreesolv/ogbg-molfreesolv.npz already exists.
/home/jimmyzxj/glb/GLB-Repo/datasets/ogbg-molfreesolv/ogbg-molfreesolv_task.npz already exists.
ogbg-molfreesolv dataset
Traceback (most recent call last):
  File "tmp.py", line 2, in <module>
    g = glb.dataloading.get_glb_graph("ogbg-molfreesolv")
  File "/home/jimmyzxj/glb/GLB-Repo/glb/dataloading.py", line 64, in get_glb_graph
    return read_glb_graph(metadata_path, device=device, verbose=verbose)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 116, in read_glb_graph
    data = _dfs_read_file(pwd, data, device="cpu")
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 288, in _dfs_read_file
    data = _dfs_read_file_helper(pwd, d, device)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 301, in _dfs_read_file_helper
    entry = _dfs_read_file_helper(pwd, d[k], device=device)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 301, in _dfs_read_file_helper
    entry = _dfs_read_file_helper(pwd, d[k], device=device)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 296, in _dfs_read_file_helper
    array = file_reader.get(path, d.get("key"), device)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/utils.py", line 70, in get
    array = unwrap_array(array)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/utils.py", line 39, in unwrap_array
    return array.all()
  File "/home/jimmyzxj/.local/lib/python3.6/site-packages/numpy/core/_methods.py", line 57, in _all
    return umr_all(a, axis, dtype, out, keepdims)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

EDIT: This issue has been fixed in a previous update.

Expected behavior
glb.dataloading.get_glb_graph() should return a list of graphs.

Additional context
In addition, the metadata.json seems to have problematic structure.

{
    "description": "ogbg-molfreesolv dataset",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Numpy ndarray of shape (num_nodes, nodefeat_dim), where nodefeat_dim is the dimensionality of node features and i-th row represents the feature of i-th node. This can be None if no input node features are available.",
                "type": "int",
                "format": "SparseTensor",
                "file": "ogbg-molfreesolv.npz",
                "key": "node_feats"
            }
        },
        "Edge": {
            "_Edge": {
                "file": "ogbg-molfreesolv.npz",
                "key": "edge"
            },
            "EdgeFeature": {
                "description": "Numpy ndarray of shape (num_edges, edgefeat_dim), where edgefeat_dim is the dimensionality of edge features and i-th row represents the feature of i-th edge. This can be None if no input edge features are available.",
                "type": "int",
                "format": "SparseTensor",
                "file": "ogbg-molfreesolv.npz",
                "key": "edge_feats"
            }
        },
        "Graph": {
            "_NodeList": {
                "file": "ogbg-molfreesolv.npz",
                "key": "node_list"
            }
        },
        "GraphLabel": {
            "file": "ogbg-molfreesolv.npz",
            "type": "int",
            "format": "Tensor",
            "key": "graph_class"
        }
    },
    "citation": "@inproceedings{Wu2018Stanford,\ntitle={Moleculenet: a benchmark for molecular machine learning},\nauthor={Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande},\nbooktitle={Chemical Science},\npages={513=520},\nyear={2018}\n}",
    "is_heterogeneous": false
}

The GraphLabel should be placed inside Graph

Complete the task description

@JasonHezhengFan could you please complete the descriptions of GraphClassification and TimeDependentLinkPrediction in CONTRIBUTING.md, following the example of NodeClassification? Thanks!

Link to the section: https://github.com/Graph-Learning-Benchmarks/GLB-Repo/blob/main/CONTRIBUTING.md#overview-1

Update task.json format

Let's change the task.json to the new format:

"feature": [
        "Node/NodeFeature"
    ],

[FEATURE REQUEST] Clean up repo history

Is your feature request related to a problem? Please describe.
The current repo is relatively large due to npz files in the commit history. The HEAD is about only 10 MB while the entire history is about 140 MB.

Describe the solution you'd like
Clean up the commit history using BFG Repo-Cleaner.

OGB examples missing README.md

Hi @JasonHezhengFan, could you add the README.md files to the OGB examples?

Redundant examples

There are five examples of node classifications. They have basically the same structures. Let's delete redundant ones.
We should flatten ogb directory.

Test if README.md exists

Add a test that README.md should exist in the directory of metadata.json, similarly as test_if_has_essential_json.

Update heterogeneous graph API documentation

Hi @xingjian-zhang, could you update the GLB Data Format section in CONTRIBUTING.md for heterogeneous graph API?

edge shape in cora.ipynb[BUG]

@JasonHezhengFan
Describe the bug
The tensor edge in cora.ipynb has the shape of (10556,2). However, in CONTRIBUTING.md, the shape of such tensor should be (2,10556)

To Reproduce

# edge = torch.stack(graph.edges()).numpy().T
edge = torch.stack(graph.edges()).numpy()

Expected behavior
Not sure which shape should the edge be.

Screenshots
Code picture

CONTRIBUTING.md

Add dataloading tests

Is your feature request related to a problem? Please describe.
We want to verify that glb can load all datasets

Describe the solution you'd like

Check all datasets if glb/ changed
Check new dataset if new dataset is added but glb/ remains unchanged. (Done by PR #110 )

Additional context
This might be helpful:
https://github.com/marketplace/actions/changed-files

Create a datasets folder and move most datasets there

[FEATURE REQUEST] Update documentation for NodeClassification with multi-split or random split

Is your feature request related to a problem? Please describe.
Following issue #73, we added support for multi-split and random split in PR #95. We need to update CONTRIBUTING.md for the API of the NodeClassification task.

Potential bug in `test_if_has_essential_json`?

@JasonHezhengFan Could you have a check why test_if_has_essential_json passed for this PR? Currently the task json files are not in the same directory as metadata.json. So this test should have failed?

Originally posted by @jiaqima in #43 (comment)

Sparse torch array does not support indexing

Using backend: pytorch
OGBg-molhiv dataset.
Traceback (most recent call last):
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/memory_profiler.py", line 1349, in <module>
    exec_with_profiler(script_filename, prof, args.backend, script_args)
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/memory_profiler.py", line 1250, in exec_with_profiler
    exec(compile(f.read(), filename, 'exec'), ns, ns)
  File "example.py", line 79, in <module>
    main()
  File "example.py", line 69, in main
    g, task, datasets = prepare_dataset(*path_dict[task_name])
  File "example.py", line 50, in prepare_dataset
    g = glb.graph.read_glb_graph(metadata_path=metadata_path)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 114, in read_glb_graph
    return get_multi_graph(data, device)
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/memory_profiler.py", line 1186, in wrapper
    val = prof(func)(*args, **kwargs)
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/memory_profiler.py", line 759, in f
    return func(*args, **kwds)
  File "/home/jimmyzxj/glb/GLB-Repo/glb/graph.py", line 82, in get_multi_graph
    graphs.append(dgl.node_subgraph(g, node_list[i]))
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/dgl/subgraph.py", line 146, in node_subgraph
    induced_nodes.append(_process_nodes(ntype, nids))
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/dgl/subgraph.py", line 139, in _process_nodes
    return F.astype(F.nonzero_1d(F.copy_to(v, graph.device)), graph.idtype)
  File "/home/jimmyzxj/miniconda3/envs/py39/lib/python3.9/site-packages/dgl/backend/pytorch/tensor.py", line 307, in nonzero_1d
    x = th.nonzero(input, as_tuple=False).squeeze()
NotImplementedError: Could not run 'aten::nonzero' with arguments from the 'SparseCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::nonzero' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

CPU: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/build/aten/src/ATen/RegisterCPU.cpp:18433 [kernel]
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradLazy: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradXPU: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradMLC: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradHPU: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradNestedTensor: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/VariableType_0.cpp:8931 [autograd kernel]
Tracer: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/torch/csrc/autograd/generated/TraceType_0.cpp:10285 [kernel]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/autocast_mode.cpp:466 [backend fallback]
Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811723911/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

List of non-essential tests.

This issue maintains a list of tests that are non-essential in the sense that an error here will lead to an error in a more high-level test (e.g., test of get_glb_dataset), so the error will still be catched without these tests. But the existence of these more low-level tests could enhance the user experience for debugging contributed datasets.

Add a test ensuring that there is no unexpected attribute in metadata.json and task.json. May need to maintain a list of predefined attributes that are allowed.
Test the validity of json file formats.
- No trailing comma after the last entry value within each pair of curly braces.
- The json file can be successfully loaded.

Defer GPU loading

Is your feature request related to a problem? Please describe.
#48
#50

Describe the solution you'd like
The graph and its attributes are moved to another device as late as possible.

[FEATURE REQUEST] Add regression task for Graph and Node task

Describe the solution you'd like

Add new base class: RegressionTask
Add two classes GraphRegressionTask, NodeRegressionTask.

Negative samples in LinkPrediction cannot be indexed

adding to previous: these neg edges are not included in the original edge index (can think of it as generated negative examples, so we can't index 'em back), that's what I meant by loss info in terms of neg edges.

Originally posted by @JasonHezhengFan in #5 (comment)

[BUG] Cannot read downloaded .npz files successfully for OpenKE datasets

Describe the bug
I cannot read in downloaded .npz files for some OpenKE datasets at /GLB-Repo/datasets, such as FB13and WN18RR.
I didn't test all of the OpenKE datasets, but I could successfully run codes for cora, citeseer, and PubMed.

(A PR #86 is created to update dataset preparation in /GLB-Repo/glb/tags.py )

To Reproduce
Run python3 tags.py --metadata FB13--task task at /GLB-Repo/glb

Expected behavior

Comments
The error occurs at _dfs_read_file in graph.py.
array = file_reader.get(path, d.get("key"), device) cannot merge path, keys and device name successfully.

Add link prediction task implementation

Cannot load ogbn-mag

I found in ogbn-mag.npz, the paper_class data is a dict rather than a Tensor. Please make sure that every entry stored in .npz can be read as an array/tensor.

PaperNode_id <class 'numpy.ndarray'>
paper_feats <class 'numpy.ndarray'>
paper_class <class 'dict'>
paper_year <class 'dict'>
AuthorNode_id <class 'numpy.ndarray'>
InstitutionNode_id <class 'numpy.ndarray'>
FieldOfStudy_id <class 'numpy.ndarray'>
author_institution_id <class 'numpy.ndarray'>
author_paper_id <class 'numpy.ndarray'>
paper_paper_id <class 'numpy.ndarray'>
paper_FieldOfStudy_id <class 'numpy.ndarray'>
author_institution_edge <class 'numpy.ndarray'>
author_paper_edge       <class 'numpy.ndarray'>
paper_paper_edge        <class 'numpy.ndarray'>
paper_FieldOfStudy_edge <class 'numpy.ndarray'>
node_list               <class 'numpy.ndarray'>
edge_list               <class 'numpy.ndarray'>

[FEATURE REQUEST] README template

Is your feature request related to a problem? Please describe.
We would like the README.md in each dataset to have similar formats.

Describe the solution you'd like
Add a template README.md.

When training on GPU, device conflict occurs

When I try to train on gpu (device = '0'), I have the following error:

File "/home/huangjin/GLB-Repo/benchmark/gcn/train.py", line 164, in
main(Args)
File "/home/huangjin/GLB-Repo/benchmark/gcn/train.py", line 39, in main
data = glb.dataloading.combine_graph_and_task(g, task)
File "/home/huangjin/GLB-Repo/glb/dataloading.py", line 13, in combine_graph_and_task
return glb.dataset.node_classification_dataset_factory(graph, task)
File "/home/huangjin/GLB-Repo/glb/dataset.py", line 55, in node_classification_dataset_factory
return NodeClassificationDataset()
File "/home/huangjin/GLB-Repo/glb/dataset.py", line 27, in init
super().init(name=task.description, force_reload=True)
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/data/dgl_dataset.py", line 99, in init
self._load()
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/data/dgl_dataset.py", line 191, in _load
self.process()
File "/home/huangjin/GLB-Repo/glb/dataset.py", line 42, in process
self.g.ndata[dataset] = mask.bool()
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 84, in setitem
self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
File "/home/huangjin/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4122, in _set_n_repr
raise DGLError('Cannot assign node feature "{}" on device {} to a graph on'
dgl._ffi.base.DGLError: Cannot assign node feature "train_set" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.
Namespace(dataset='citeseer', dropout=0.5, gpu=0, lr=0.01, n_epochs=200, n_hidden=16, n_layers=1, weight_decay=0.0005, self_loop=True)

I think the problem is because that g and mask are on the different device. Maybe we need to set "device" as an argument into combine_graph_and_task, node_classification_dataset_factory etc here to change "mask" to the same device as g.

example as follows:

[BUG][Cannot load .npz files for new datasets]

Describe the bug
I can only read in actor.npz, but not the remaining ones(actor_task.npz, chameleon.npz and chameleon_task.npz, squirrel.npz and squirrel_task.npz ).

To Reproduce
python3 tags.py --dataset actor --task task
python3 tags.py --dataset chareleon --task task
You might need to run these commands multiple times to download both files.
(The code will download dataset.npz and read it first, possibly halt the running process.)

Expected behavior

Comments
I believe there are some format errors in the remaining .npz files or these files are corrupted?
The file-reading codes seem to be right because there are no problems when reading cora, citeseer, and pubmed datasets/tasks.

Graph classification task uses too much memory

Add `is_heterogeneous` attribute in metadata.json

Is your feature request related to a problem? Please describe.
#61

Describe the solution you'd like
Add a new binary attribute in metadata.json: is_heterogeneous. This adds redundancy to our data format and helps limit the abuse of heterogeneous graph storage.

Use urls.json for storage

[FEATURE REQUEST] Update tests to accommodate `/datasets` and the new dataloader

Is your feature request related to a problem? Please describe.
Most of the datasets have been moved/copied to the folder /datasets in PR #75. The dataloader is also updated in PR #81 to accommodate the new urls.json file in each dataset. The unit tests should be updated to accommodate these changes.

Describe the solution you'd like
Update unit tests to accommodate the changes mentioned above.

graph-learning-benchmarks / gli Goto Github PK

gli's People

Contributors

Stargazers

Watchers

Forkers

gli's Issues

Recommend Projects

Recommend Topics

Recommend Org