🐛 Describe the bug I ran into this problem when running distribut

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

index out of bounds for partition book List about graphlearn-for-pytorch HOT 13 CLOSED

alibaba commented on September 26, 2024

index out of bounds for partition book List

from graphlearn-for-pytorch.

Comments (13)

kaixuanliu commented on September 26, 2024 1

from graphlearn-for-pytorch.

kaixuanliu commented on September 26, 2024

It may be because igbh-large dataset has 2 another node type ('conference' and 'journal') which do not exist in igbh-tiny/small/medium, and we do not process them in dataset.py. I will try to fix it.

from graphlearn-for-pytorch.

yao-matrix commented on September 26, 2024

@LiSu

from graphlearn-for-pytorch.

LiSu commented on September 26, 2024

@kaixuanliu How was the data partitioned? Partitioning the dataset in each of the four nodes may incur this problem as there exists randomness in the process of partitioning. If the dataset was partitioned in each node independently in your experiment, try partitioning it using one node and copy the partitioned data to the rest.

from graphlearn-for-pytorch.

kaixuanliu commented on September 26, 2024

I use NFS and just partition the dataset once.

from graphlearn-for-pytorch.

LiSu commented on September 26, 2024

I use NFS and just partition the dataset once.

I see, using NFS should be fine.

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

I will try to reproduce this problem.

from graphlearn-for-pytorch.

kaixuanliu commented on September 26, 2024

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

Yes, I checked this, these part is ok. And I root caused the bug. Here is the problem, when we do not have neighbors in one partition, the sampled neighbor output will use input seeds, while in distributed training, we need to get the partition book of sampled output, here we will get dst node partition book using src node global id, hence it will cause index out of bounds error.

from graphlearn-for-pytorch.

husimplicity commented on September 26, 2024

Thanks for your feedback. I agree this is the problem. Will seek a solution.

from graphlearn-for-pytorch.

kaixuanliu commented on September 26, 2024

seems dgl process this kind of situation using a different approach:dgl reference

from graphlearn-for-pytorch.

husimplicity commented on September 26, 2024

Yes, we are considering using an empty tensor when sampling nothing.

from graphlearn-for-pytorch.

husimplicity commented on September 26, 2024

It seems just using empty tensors can fix this and no other modification is necessary in my environment. Would you like to try it first? Will push it after holiday if no further problems.
Here

    if nbrs.numel() == 0:
      # nbrs, nbrs_num = input_seeds, torch.ones_like(input_seeds)
      # if self.with_edge:
      #   edge_ids = -1 * nbrs_num
      nbrs = torch.tensor([], dtype=torch.int64 ,device=self.device)
      nbrs_num = torch.zeros_like(input_seeds, dtype=torch.int64, device=self.device)
      edge_ids = torch.tensor([], dtype=torch.int64, device=self.device) if self.with_edge else None

And before Here
Add

  if output.nbr.size(0) > 0:

from graphlearn-for-pytorch.

kaixuanliu commented on September 26, 2024

Another minor changes needed, and I have verified it for 2 epochs in igbh-large dataset. 1 PR submitted. FYI.

from graphlearn-for-pytorch.

husimplicity commented on September 26, 2024

Closed by #49

from graphlearn-for-pytorch.

index out of bounds for partition book List about graphlearn-for-pytorch HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent