Comments (13)
from graphlearn-for-pytorch.
It may be because igbh-large dataset has 2 another node type ('conference' and 'journal') which do not exist in igbh-tiny/small/medium, and we do not process them in dataset.py. I will try to fix it.
from graphlearn-for-pytorch.
from graphlearn-for-pytorch.
@kaixuanliu How was the data partitioned? Partitioning the dataset in each of the four nodes may incur this problem as there exists randomness in the process of partitioning. If the dataset was partitioned in each node independently in your experiment, try partitioning it using one node and copy the partitioned data to the rest.
from graphlearn-for-pytorch.
I use NFS and just partition the dataset once.
from graphlearn-for-pytorch.
I use NFS and just partition the dataset once.
I see, using NFS should be fine.
But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.
I will try to reproduce this problem.
from graphlearn-for-pytorch.
But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.
Yes, I checked this, these part is ok. And I root caused the bug. Here is the problem, when we do not have neighbors in one partition, the sampled neighbor output will use input seeds, while in distributed training, we need to get the partition book of sampled output, here we will get dst node partition book using src node global id, hence it will cause index out of bounds error.
from graphlearn-for-pytorch.
Thanks for your feedback. I agree this is the problem. Will seek a solution.
from graphlearn-for-pytorch.
seems dgl process this kind of situation using a different approach:dgl reference
from graphlearn-for-pytorch.
Yes, we are considering using an empty tensor when sampling nothing.
from graphlearn-for-pytorch.
It seems just using empty tensors can fix this and no other modification is necessary in my environment. Would you like to try it first? Will push it after holiday if no further problems.
Here
if nbrs.numel() == 0:
# nbrs, nbrs_num = input_seeds, torch.ones_like(input_seeds)
# if self.with_edge:
# edge_ids = -1 * nbrs_num
nbrs = torch.tensor([], dtype=torch.int64 ,device=self.device)
nbrs_num = torch.zeros_like(input_seeds, dtype=torch.int64, device=self.device)
edge_ids = torch.tensor([], dtype=torch.int64, device=self.device) if self.with_edge else None
And before Here
Add
if output.nbr.size(0) > 0:
from graphlearn-for-pytorch.
Another minor changes needed, and I have verified it for 2 epochs in igbh-large dataset. 1 PR submitted. FYI.
from graphlearn-for-pytorch.
Closed by #49
from graphlearn-for-pytorch.
Related Issues (20)
- Cannot install from pip HOT 4
- Error handling in distributed training
- Cannot build from source for C++ operations HOT 5
- Examples have fixed default number of processes HOT 4
- RPC Failure after 1st epoch of training on IGBH-tiny and IGBH-small HOT 4
- AttributeError: module 'graphlearn_torch.py_graphlearn_torch' has no attribute 'SampleQueue' HOT 1
- Crashed when running distributed CPU training using 2 nodes HOT 3
- Why CUDA Graph not enabled in training process?
- [Doc] User Guide on Alibaba Cloud
- [Feat] Distributed Sparse Backend
- [Feat] Remove redundancy in storage and computation caused by NeighborLoader and new model implementation in PyG.
- does not support pyg sampler for hetero-graph HOT 2
- Figure out where the `None` is from
- [Bug] CUDA failure: 'invalid configuration argument' when batch_size is 1 or 2 HOT 1
- Single-node Multi-GPU training throws CUDA failure: an illegal memory access was encountered. HOT 3
- [Bug] some progress may hang at global_barrier when initializing the torch rpc cluster
- Add supports for weighted edge sampling HOT 1
- Add supports for weighted edge sampling HOT 1
- Mathematical inequivalence introduced by GLT Sampler vs. DGL Sampler? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graphlearn-for-pytorch.