alibaba / graphlearn-for-pytorch Goto Github PK

A GPU-accelerated graph learning library for PyTorch, facilitating the scaling of GNN training and inference.

License: Apache License 2.0

CMake 0.69% Python 68.76% C++ 12.97% Cuda 17.25% Shell 0.32%

deep-learning distributed gpu pytorch graph-neural-networks

graphlearn-for-pytorch's Introduction

GraphLearn-for-PyTorch(GLT) is a graph learning library for PyTorch that makes distributed GNN training and inference easy and efficient. It leverages the power of GPUs to accelerate graph sampling and utilizes UVA to reduce the conversion and copying of features of vertices and edges. For large-scale graphs, it supports distributed training on multiple GPUs or multiple machines through fast distributed sampling and feature lookup. Additionally, it provides flexible deployment for distributed training to meet different requirements.

Highlighted Features
Architecture Overview
Installation
Quick Tour
- Accelarating PyG model training on a single GPU.
- Distributed training
License

Highlighted Features

GPU acceleration

GLT provides both CPU-based and GPU-based graph operators such as neighbor sampling, negative sampling, and feature lookup. For GPU training, GPU-based graph operations accelerate the computation and reduce data movement by a considerable amount.
Scalable and efficient distributed training

For distributed training, we implement multi-processing asynchronous sampling, pin memory buffer, hot feature cache, and use fast networking technologies (PyTorch RPC with RDMA support) to speed up distributed sampling and reduce communication. As a result, GLT can achieve high scalability and support graphs with billions of edges.
Easy-to-use API

Most of the APIs of GLT are compatible with PyG/PyTorch, so you can only need to modify a few lines of PyG's code to get the acceleration of the program. For GLT specific APIs, they are compatible with PyTorch and there is complete documentation and usage examples available.
Large-scale real-world GNN models

We focus on real-world scenarios and provide distributed GNN training examples on large-scale graphs. Since GLT is compatible with PyG, you can use almost any PyG's model as the base model. We will also continue to provide models with practical effects in industrial scenarios.
Easy to extend

GLT directly uses PyTorch C++ Tensors and is easy to extend just like PyTorch. There are no extra restrictions for CPU or CUDA based graph operators, and adding a new one is straightforward. For distributed operations, you can write a new one in Python using PyTorch RPC.
Flexible deployment

Graph Engine(Graph operators) and PyTorch engine(PyTorch nn modules) can be deployed either co-located or separated on different machines. This flexibility enables you to deploy GLT to your own environment or embed it in your project easily.

Architecture Overview

The main goal of GLT is to leverage hardware resources like GPU/NVLink/RDMA and characteristics of GNN models to accelerate end-to-end GNN training in both the single-machine and distributed environments.

In the case of multi-GPU training, graph sampling and CPU-GPU data transferring could easily become the major performance bottleneck. To speed up graph sampling and feature lookup, GLT implements the Unified Tensor Storage to unify the memory management of CPU and GPU. Based on this storage, GLT supports both CPU-based and GPU-based graph operators such as neighbor sampling, negative sampling, feature lookup, subgraph sampling etc. To alleviate the CPU-GPU data transferring overheads incurred by feature collection, GLT supports caching features of hot vertices in GPU memory, and accessing the remaining feature data (stored in pinned memory) via UVA. We further utilize the high-speed NVLink between GPUs expand the capacity of GPU cache.

As for distributed training, to prevent remote data access from blocking the progress of model training, GLT implements an efficient RPC framework on top of PyTorch RPC and adopts asynchronous graph sampling and feature lookup operations to hide the network latency and boost the end-to-end training throughput.

To lower the learning curve for PyG users, the APIs of key abstractions in GLT, such as dataset and dataloader, are designed to be compatible with PyG. Thus PyG users can take full advantage of GLT's acceleration capabilities by only modifying very few lines of code.

For model training, GLT supports different models to fit different scales of real-world graphs. It allows users to collocate model training and graph sampling (including feature lookup) in the same process, or separate them into different processes or even different machines. We provide two example to illustrate the training process on small graphs: single GPU training example and multi-GPU training example. For large-scale graphs, GLT separates sampling and training processes for asynchronous and parallel acceleration, and supports deployment of sampling and training processes on the same or different machines. Examples of distributed training can be found in distributed examples.

Installation

Requirements

cuda
python>=3.6
torch(PyTorch)
torch_geometric, torch_scatter, torch_sparse. Please refer to PyG for installation.

Pip Wheels

# glibc>=2.14, torch>=1.13
pip install graphlearn-torch

Build from source

Install Dependencies

git submodule update --init
sh install_dependencies.sh

Python

Build

python setup.py bdist_wheel
pip install dist/*

Build in CPU-mode

WITH_CUDA=OFF python setup.py bdist_wheel
pip install dist/*

sh scripts/run_python_ut.sh

C++

If you need to test C++ operations, you can only build the C++ part.

Build

cmake .
make -j

sh scripts/run_cpp_ut.sh

Quick Tour

Accelarating PyG model training on a single GPU.

Let's take PyG's GraphSAGE on OGBN-Products as an example, you only need to replace PyG's torch_geometric.loader.NeighborSampler by the graphlearn_torch.loader.NeighborLoader to benefit from the the acceleration of model training using GLT.

import torch
import graphlearn_torch as glt
import os.path as osp

from ogb.nodeproppred import PygNodePropPredDataset

# PyG's original code preparing the ogbn-products dataset
root = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'products')
dataset = PygNodePropPredDataset('ogbn-products', root)
split_idx = dataset.get_idx_split()
data = dataset[0]

# Enable GLT acceleration on PyG requires only replacing
# PyG's NeighborSampler with the following code.
glt_dataset = glt.data.Dataset()
glt_dataset.build(edge_index=data.edge_index,
                  feature_data=data.x,
                  sort_func=glt.data.sort_by_in_degree,
                  split_ratio=0.2,
                  label=data.y,
                  device=0)
train_loader = glt.loader.NeighborLoader(glt_dataset,
                                         [15, 10, 5],
                                         split_idx['train'],
                                         batch_size=1024,
                                         shuffle=True,
                                         drop_last=True,
                                         as_pyg_v1=True)

The complete example can be found in examples/train_sage_ogbn_products.py.

While building the glt_dataset, the GPU where the graph sampling operations are performed is specified by parameter device. By default, the graph topology are stored in pinned memory for ZERO-COPY access. Users can also choose to stored the graph topology in GPU by specifying graph_mode='CUDA in graphlearn_torch.data.Dataset.build. The split_ratio determines the fraction of feature data to be cached in GPU. By default, GLT sorts the vertices in descending order according to vertex indegree and selects vetices with higher indegree for feature caching. The default sort function used as the input parameter for graphlearn_torch.data.Dataset.build is graphlearn_torch.data.reorder.sort_by_in_degree. Users can also customize their own sort functions with compatible APIs.

Distributed training

For PyTorch DDP distributed training, there are usually several steps as follows:

First, load the graph and feature from partitions.

import torch
import os.path as osp
import graphlearn_torch as glt

# load from partitions and create distributed dataset.
# Partitions are generated by following script:
# `python partition_ogbn_dataset.py --dataset=ogbn-products --num_partitions=2`

root = osp.join(osp.dirname(osp.realpath(__file__)), '..', '..', 'data', 'products')
glt_dataset = glt.distributed.DistDataset()
glt_dataset.load(
  num_partitions=2,
  partition_idx=int(os.environ['RANK']),
  graph_dir=osp.join(root, 'ogbn-products-graph-partitions'),
  feature_dir=osp.join(root, 'ogbn-products-feature-partitions'),
  label_file=osp.join(root, 'ogbn-products-label', 'label.pt') # whole label
)
train_idx = torch.load(osp.join(root, 'ogbn-products-train-partitions',
                                'partition' + str(os.environ['RANK']) + '.pt'))

Second, create distributed neighbor loader based on the dataset above.

# distributed neighbor loader
train_loader = glt.distributed.DistNeighborLoader(
  data=glt_dataset,
  num_neighbors=[15, 10, 5],
  input_nodes=train_idx,
  batch_size=batch_size,
  drop_last=True,
  collect_features=True,
  to_device=torch.device(rank % torch.cuda.device_count()),
  worker_options=glt.distributed.MpDistSamplingWorkerOptions(
    num_workers=nsampling_proc_per_train,
    worker_devices=[torch.device('cuda', (i + rank) % torch.cuda.device_count())
                    for i in range(nsampling_proc_per_train)],
    worker_concurrency=4,
    master_addr='localhost',
    master_port=12345, # different from port in pytorch training.
    channel_size='2GB',
    pin_memory=True
  )
)

Finally, define DDP model and run.

from torch.nn.parallel import DistributedDataParallel
from torch_geometric.nn import GraphSAGE

# DDP model
model = GraphSAGE(
  in_channels=num_features,
  hidden_channels=256,
  num_layers=3,
  out_channels=num_classes,
).to(rank)
model = DistributedDataParallel(model, device_ids=[rank])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# training.
for epoch in range(0, epochs):
  model.train()
  for batch in train_loader:
    optimizer.zero_grad()
    out = model(batch.x, batch.edge_index)[:batch.batch_size].log_softmax(dim=-1)
    loss = F.nll_loss(out, batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()
  dist.barrier()

The training scripts for 2 nodes each with 2 GPUs are as follows:

# node 0:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=0 --master_addr=xxx dist_train_sage_supervised.py

# node 1:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=1 --master_addr=xxx dist_train_sage_supervised.py

Full code can be found in distributed training example.

License

Apache License 2.0

graphlearn-for-pytorch's People

Contributors

Stargazers

Watchers

graphlearn-for-pytorch's Issues

Mathematical inequivalence introduced by GLT Sampler vs. DGL Sampler?

🐛 Describe the bug

We observe that there are some differences in samplers that leads to computational / mathematical inequivalences between GLT's implementation of IGBH training vs. IGB's "official" implementation, when running the training code. More specifically, we observe that, compared to IGB's DGL sampler, GLT/PyG sampler leads to unnecessary message passing & aggregation and therefore results in different computation results.

In a high level, there are two differences that leads to the computational inequivalence here:

Given a path a->b->c, at each layer, GLT is updating all nodes at this layer and higher layer. For instance, at layer 0, we need to update node embeddings for both b and c. However, in DGL, only node b's embedding will be updated. Therefore, if we consider the final embedding generated for node c, GLT computes it as h_2(c) = h_1(c) + h_1(b) = (h_0(c) + h_0(b)) + (h_0(b) + h_0(a)), while DGL computes it as h_2(c) = h_1(c) + h_1(b) = h_0(c) + (h_0(b) + h_0(a)).
When creating blocks for layer $i-1$, DGL not only uses the destination nodes from layer $i$ as seed nodes, but also includes all subsequent $i+1\cdots n$ destination nodes as seed nodes.

Take the following graph as an example:

3 -> 9, 9 -> 12, 8 -> 12

If the path 3 -> 9 -> 12 is being sampled, then the message passing path in GLT is:

at first layer, we compute 3 -> 9 and 9 -> 12
at second layer, we compute 9 (value updated with node 3's feature) -> 12

Thus, from the above computation, we see that feature 9 is included twice.

However, in DGL, the sampling process is a bit different:

We start with a seed node set containing only our target node 12: [12, ]
At first hop (corresponding to the model's second layer), we add 9, since 9 is randomly chosen as 12's source node, added to the seed node set: [12, 9]
At second hop (corresponding to the model's first layer), we add:
node 8, since node 12 is in the seed node set
node 3, since node 9 is in the seed node set
And therefore we results in [12, 9, 8, 3] nodes, with graph 3 -> 9 -> 12 <- 8.

Therefore, in the first layer, we compute 3 -> 9 and 8 -> 12; and in the second layer, we additionally compute 9 -> 12. This computation is the algorithm that is implemented in the paper (IGB's official repository), and is mathematically different from GLT's current implementation.

Environment

GLT version: latest, built from main branch.
PyG version: 2.3.1
PyTorch version: 1.14.0
OS: Linux
Python version: 3.8.10
CUDA/cuDNN version: 11.8
Any other relevant information

[Model] R-GAT model with IGBH dataset

🚀 The feature, motivation and pitch

Add R-GAT model refer to https://github.com/snap-stanford/ogb/blob/master/examples/lsc/mag240m/rgnn.py
on IGBH dataset (https://github.com/IllinoisGraphBenchmark/IGB-Datasets).

Alternatives

No response

Additional context

No response

Add supports for weighted edge sampling

🚀 The feature, motivation and pitch

In large graphs with more than ten million edges, the need for edge sampling is inevitable. Hopefully, GLT can support sampling based on edge weights, so as to better utilize edge features.

Alternatives

No response

Additional context

No response

index out of bounds for partition book List

🐛 Describe the bug

I ran into this problem when running distributed training for igbh-large dataset. Just keep a record here, if you meet the same problem or have solved this, please let me know~

task failed: index 40437200 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 43094749 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 36034227 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 41991547 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 44125491 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 31882725 is out of bounds for dimension 0 with size 4490

and the out of bound error is form here:

cmd lines:
node0:
python dist_train_rgnn.py --num_nodes=4 --node_rank=0 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node1:
python dist_train_rgnn.py --num_nodes=4 --node_rank=1 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node2:
python dist_train_rgnn.py --num_nodes=4 --node_rank=2 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node3:
python dist_train_rgnn.py --num_nodes=4 --node_rank=3 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19

Environment

Environment
GLT version: 0.2.0(build from latest source code)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Ubuntu 22.04.2 LTS
Python version: 3.8.16
CUDA/cuDNN version: N/A

Error handling in distributed training

🚀 The feature, motivation and pitch

Properly handle the errors occurred during the training:

Stop issuing new RPC request when an previous one failed
Terminate training once an error happened
Properly clean up the shared graph data

Consider both the mp mode and the collocated mode.

Alternatives

No response

Additional context

No response

Figure out where the `None` is from

🐛 Describe the bug

On the igbh large example, one place to fix #54 is by replacing None to empty tensor. But where this None's from is yet to be figured out.

Environment

GLT version:
PyG version:
PyTorch version:
OS:
Python version:
CUDA/cuDNN version:
Any other relevant information

Cannot build from source for C++ operations

🐛 Describe the bug

Hi,

I followed the guide and tried to build the code from the source. I am using GCC 10.4.0 and the exact version of GoogleTest branch (through 'git submodule update --init')

Here is the error message:
'[ 76%] Built target test_vineyard
/usr/bin/ld: CMakeFiles/test_shm_queue.dir/test/cpp/test_shm_queue.cu.o: in function testing::AssertionResult testing::internal::CmpHelperOpFailure<int, int>(char const*, char const*, int const&, int const&, char const*)': tmpxft_000058eb_00000000-6_test_shm_queue.cudafe1.cpp:(.text._ZN7testing8internal18CmpHelperOpFailureIiiEENS_15AssertionResultEPKcS4_RKT_RKT0_S4_[_ZN7testing8internal18CmpHelperOpFailureIiiEENS_15AssertionResultEPKcS4_RKT_RKT0_S4_]+0x90): undefined reference to testing::Message::GetString() const'
'

Environment

GLT version: Latest main branch
PyG version: 2.3
PyTorch version: 2.0.0 + cu11.7
OS: ubuntu 22.04 LTS
Python version: 3.9.16
CUDA/cuDNN version: --- Driver Version: 530.30.02 , CUDA Version: 12.1
Any other relevant information

[Feat] Remove redundancy in storage and computation caused by NeighborLoader and new model implementation in PyG.

🚀 The feature, motivation and pitch

We found that the NeighborLoader version of the GraphSAGE model in PyG takes a subgraph composed of multiple hops of sampling as the input and updates it at each layer. This leads to a lot of redundancy in storage and computation, see pyg-team/pytorch_geometric#3799. And this part becomes a bottleneck when using GLT's GPU sampling.

We should remove this redundancy to improve the performance of e2e training.

Alternatives

After 2.3 version, PyG has support Hierarchical Neighborhood Sampling to extend classical Neighborhood Sampling by collecting additional information about number of sampled nodes and edges per each hop, and add num_sampled_nodes_per_hop and num_sampled_edges_per_hop in basic gnn model to trim the layer data.
This will reduce the redundancy in storage and computation, see pyg-team/pytorch_geometric#7331.

We can also support Hierarchical Neighborhood Sampling in GLT's NeighborSampler and DistNeighborSampler and modify examples to avoid performance losses caused by redundant computation.

Additional context

[Doc] User Guide on Alibaba Cloud

📚 The doc issue

This issue adds the user manual for training GNN models with GLT in single or distributed mode with the products DSW and DLC on Alibaba Cloud-PAI(Platform of Artificial Intelligence).

Suggest a potential alternative/fix

No response

Why CUDA Graph not enabled in training process?

🚀 The feature, motivation and pitch

CUDA Graph Jit Optimization could only be used in inference process. Why it couldn't be used in training process?

Alternatives

No response

Additional context

No response

Cannot install from pip

🐛 Describe the bug

I met an error when trying to install from pip:
ERROR: Could not find a version that satisfies the requirement graphlearn-torch (from versions: none)
ERROR: No matching distribution found for graphlearn-torch

Environment

GLT version:
PyG version: 2.4.0
PyTorch version: 2.1.0
OS: Window
Python version: 3.10.13
CUDA/cuDNN version: 11.8
Any other relevant information

[Feat] Distributed Sparse Backend

🚀 The feature, motivation and pitch

Background

The GNN convolutions generates a lot of memory expansion during message passing and hard to make an optimal use of parallelization resources due to the sparsity, which result in insufficient computational performance and too high peak memory.
geSpMM, geSDDMM integrate graph operator, matrix calculation and reduce operator into one sparse kernel, to reduce kernel launch times and usage of memory, and then improve performance.

Objective

Distributed Sparse Backend using Sparse Matrix Multiplication to express convolutions in GNN, replacing the commonly used Message Passing paradigm, and supporting high distributed sparse convolution.

Moreover, we can optimize the parallel implementation of the kernel based on the sparsity and feature dimensions of the input data.
When the graph data or model is too large, we can use data parallelism, model parallelism, and pipeline parallelism for distributed optimization.

Tasks

This work includes the following major tasks, we will enrich each specific task into detailed subtasks.

Phase 1: Implementations

Sparse Matrix representation: Convert graph data in GNN into sparse matrix format for efficient matrix computation like multiplication, softmax...
Sparse Matrix computation kernels: like geSpMM, geSDDMM, EdgeSoftmax..
GNN models: Implement basic GNN models and LLM-GNN models with Sparse kernels to improve computation efficiency and reduce peak memory .
Distributed sparse modules: For commonly used GNN models, using DP, MP, PP to implement the most efficient distributed sparse convs, just like Megatron.

Phase 2: Performance optimizations

Kernel optimization: Optimize parallelization of kernels for different workloads, half-precision and mixed-precision.
Computation graph capture and compilation optimization: using TorchDynamo or other techniques to capture GNN operators and dynamic sparse shapes, enrich HLO to support lowering the sparse kernels mentioned above, and optimize based on input graph.
Memory optimization: using techniques like CPU offload-ZERO.
Distributed optimization: more efficient parallelism, cache..

Alternatives

No response

Additional context

No response

Single-node Multi-GPU training throws CUDA failure: an illegal memory access was encountered.

🐛 Describe the bug

Hello, we're trying to modify the examples/igbh/train_rgnn.py so that it supports single-node multi-GPU training. However, when trying to follow the OGBN single-node multi-GPU training example, we encountered some CUDA failure /workspace/graphlearn/graphlearn_torch/csrc/cuda/unified_tensor.cu:351: 'an illegal memory access was encountered' errors when loading the first batch of data. Here is a min rep code for the issue (we removed the validation & test dataset for simplicity):

import argparse
import torch
import warnings
import os

import graphlearn_torch as glt

from dataset import IGBHeteroDataset
from rgnn import RGNN

torch.manual_seed(42)
warnings.filterwarnings("ignore")
    

def run(proc_id, devices, glt_dataset, train_idx, etypes, node_features, args):
    # examples/multi_gpu/train_sage_ogbn_papers100m.py line 35-39
    os.environ['MASTER_ADDR'] = "127.0.0.1"
    os.environ['MASTER_PORT'] = "12365"
    torch.distributed.init_process_group(
        "nccl", rank=proc_id, world_size=len(devices)
    )
    torch.cuda.set_device(proc_id)
    device = torch.device(proc_id)

    # examples/multi_gpu/train_sage_ogbn_papers100m.py line 41: splitting train_idx according to devices
    train_idx = train_idx.split(train_idx.size(0) // len(devices))[proc_id]

    train_loader = glt.loader.NeighborLoader(
        glt_dataset,
        [int(fanout) for fanout in args.fan_out.split(",")],
        ("paper", train_idx),
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=False,
        device=device
    )

    model = RGNN(
        etypes,
        node_features,
        args.hidden_channels,
        args.num_classes,
        num_layers=args.num_layers,
        dropout=0.2,
        model=args.model,
        heads=args.num_heads,
        node_type='paper').to(device)
    
    # examples/multi_gpu/train_sage_ogbn_papers100m.py line 55: torch DDP
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device.index], find_unused_parameters=True)

    loss_fcn = torch.nn.CrossEntropyLoss().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)

    for epoch in range(args.epochs):
        model.train()
        for batch in train_loader:
            batch_size = batch['paper'].batch_size
            out = model(batch.x_dict, batch.edge_index_dict)[:batch_size]
            y = batch['paper'].y[:batch_size]
            loss = loss_fcn(out, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        torch.cuda.synchronize()
        torch.distributed.barrier()


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--path', type=str, default="/data",
            help='path containing the datasets')
    parser.add_argument('--dataset_size', type=str, default='tiny',
            choices=['tiny', 'small', 'medium', 'large', 'full'],
            help='size of the datasets')
    parser.add_argument('--num_classes', type=int, default=19,
            choices=[19, 2983], help='number of classes')
    parser.add_argument('--in_memory', type=int, default=0,
            choices=[0, 1], help='0:read only mmap_mode=r, 1:load into memory')
    # Model
    parser.add_argument('--model', type=str, default='rgat',
                                            choices=['rgat', 'rsage'])
    # Model parameters
    parser.add_argument('--fan_out', type=str, default='10,10')
    parser.add_argument('--batch_size', type=int, default=5120)
    parser.add_argument('--hidden_channels', type=int, default=128)
    parser.add_argument('--learning_rate', type=int, default=0.01)
    parser.add_argument('--epochs', type=int, default=20)
    parser.add_argument('--num_layers', type=int, default=2)
    parser.add_argument('--num_heads', type=int, default=4)
    parser.add_argument('--log_every', type=int, default=5)
    parser.add_argument("--edge_dir", type=str, default='in')

    parser.add_argument("--gpu_devices", type=str, default="0,1")
    args = parser.parse_args()

    # in specifying GPU groups with NVLinks,
    # we separate each GPU id by comma, and separate the GPU groups by semicolon
    gpu_groups = [[int(idx) for idx in group.split(",")] for group in args.gpu_devices.split(";")]
    num_gpus = len(sum(gpu_groups, []))

    igbh_dataset = IGBHeteroDataset(
        args.path, 
        args.dataset_size, 
        args.in_memory,
        args.num_classes==2983)
    # init graphlearn_torch Dataset.
    glt_dataset = glt.data.Dataset(edge_dir=args.edge_dir)

    glt_dataset.init_graph(
        edge_index=igbh_dataset.edge_dict,
        graph_mode='ZERO_COPY'
        # examples/multi_gpu/train_sage_ogbn_papers100m.py line 99: graph_mode = zero copy
    )

    glt_dataset.init_node_features(
        node_feature_data=igbh_dataset.feat_dict,
        # examples/multi_gpu/train_sage_ogbn_papers100m.py line 102, default with_gpu is True
        with_gpu=True,
        # examples/multi_gpu/train_sage_ogbn_papers100m.py line 105
        split_ratio=0.15 * min(num_gpus, 4),
        # examples/multi_gpu/train_sage_ogbn_papers100m.py line 106, create DeviceGroups
        device_group_list=[
            glt.data.DeviceGroup(idx, group) 
            for idx, group in enumerate(gpu_groups)
        ]
    )

    # no change to initializing node labels
    glt_dataset.init_node_labels(node_label_data={'paper': igbh_dataset.label})

    etypes = igbh_dataset.etypes
    node_features = igbh_dataset.feat_dict['paper'].shape[1]

    # examples/multi_gpu/train_sage_ogbn_papers100m.py line 111: train_idx.share_memory_()
    train_idx = igbh_dataset.train_idx
    train_idx.share_memory_()

    torch.multiprocessing.spawn(run,
             args=(sum(gpu_groups, []), glt_dataset, train_idx, etypes, node_features, args),
             nprocs=len(sum(gpu_groups, [])))

After placing the above code under examples/igbh and running it with command python3 min_rep.py --path /data --dataset_size small --num_classes 2983 --epochs 3 --log_every 1, it outputs the following error messages:

CUDA failure /workspace/graphlearn/graphlearn_torch/csrc/cuda/unified_tensor.cu:351: 'an illegal memory access was encountered'
CUDA failure /workspace/graphlearn/graphlearn_torch/csrc/cuda/unified_tensor.cu:351: 'an illegal memory access was encountered'
Traceback (most recent call last):
  File "min_rep.py", line 158, in <module>
    torch.multiprocessing.spawn(run,
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 149, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

Could you look into this issue and share some insights on fixing it? We have been using the default dataset.py and rgnn.py provided under examples/igbh for running this code.

Environment

GLT version: v0.2.0, built from source based on the current main branch
PyG version: 2.3.1
PyTorch version: 1.14.0
OS: Linux (Docker image)
Python version: 3.8.10
CUDA/cuDNN version: 11.8
Any other relevant information

Add supports for weighted edge sampling

🚀 The feature, motivation and pitch

In large graphs with more than ten million edges, the need for edge sampling is inevitable. Hopefully, GLT can support sampling based on edge weights, so as to better utilize edge features.

Alternatives

No response

Additional context

No response

[Model] Add link prediction models.

🚀 The feature, motivation and pitch

GLT has provided both stand-alone and distributed examples of node classification models, but examples of link prediction models are not yet available. So we should add link prediction GNN examples based on LinkNeighborLoader and DistLinkNeighborLoader like PyG's. in detail, these tasks are as follows

homo GraphSAGE refer to https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup_ppi.py
hetero GraphSAGE
docs https://github.com/alibaba/graphlearn-for-pytorch/blob/main/docs/get_started/link_pred.md

Alternatives

No response

Additional context

No response

AttributeError: module 'graphlearn_torch.py_graphlearn_torch' has no attribute 'SampleQueue'

🐛 Describe the bug

When I run igbh example for distributed CPU training
python dist_train_rgnn.py --num_nodes=2 --node_rank=0 --num_training_procs=2 --master_addr=localhost --model='rgat' --dataset_size='tiny' --num_classes=19
and
python dist_train_rgnn.py --num_nodes=2 --node_rank=1 --num_training_procs=2 --master_addr=localhost --model='rgat' --dataset_size='tiny' --num_classes=19,
it returns error with info
Traceback (most recent call last): File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/kaixuan/ws/pyG-work/graphlearn-for-pytorch/examples/igbh/dist_train_rgnn.py", line 84, in run_training_proc train_loader = glt.distributed.DistNeighborLoader( File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_neighbor_loader.py", line 92, in __init__ super().__init__( File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 177, in __init__ self._channel = ShmChannel(self.worker_options.channel_capacity, File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/channel/shm_channel.py", line 48, in __init__ self._queue = pywrap.SampleQueue(capacity, shm_size) AttributeError: module 'graphlearn_torch.py_graphlearn_torch' has no attribute 'SampleQueue'

Environment

GLT version: 0.2.0 (build from latest source code by adjust WITH_CUDA to OFF in setup.py)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Rocky Linux release 8.6 (Green Obsidian)
Python version:3.8
CUDA/cuDNN version: not installed
Any other relevant information

does not support pyg sampler for hetero-graph

🐛 Describe the bug

When I replace sampler from gltorch to pyg (add as_pyg_v1=True in glt.loader.NeighborLoader), and run training example for igbh dataset, it returned error like this:
`Traceback (most recent call last):
File "train_rgnn.py", line 194, in
train(model, device, train_loader, val_loader, test_loader, args)
File "train_rgnn.py", line 72, in train
for batch in train_dataloader:
File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/loader/neighbor_loader.py",
line 104, in next
return self.sampler.sample_pyg_v1(seeds)
File "/mnt/disk1/kaixuan/anaconda3/envs/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/sampler/neighbor_sampler.py
", line 414, in sample_pyg_v1
srcs = inducer.init_node(srcs)
TypeError: init_node(): incompatible function arguments. The following argument types are supported:
1. (self: graphlearn_torch.py_graphlearn_torch.CPUHeteroInducer, seed: Dict[str, at::Tensor]) -> Dict[str, at::Tensor]

Invoked with: <graphlearn_torch.py_graphlearn_torch.CPUHeteroInducer object at 0x7f9658039270>, tensor([ 9612, 26762, 38719,
..., 11742, 47627, 47025])`

cmd line:
python train_rgnn.py --model='rgat' --dataset_size='tiny' --num_classes=19 --cpu_mode

I looked through the code, and find the definition in sample_pyg_v1 only supports homogeneous graph, while glt supports both homo and hetero graph. So do you have plan to support pyg sampler for hetero graph?

Environment

GLT version: 0.2.0(build from latest source code)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Rocky Linux release 8.7 (Green Obsidian)
Python version: 3.8.16
CUDA/cuDNN version: N/A
Any other relevant information

[Feat] In-bound (In-edge) Sampling Support

🚀 The feature, motivation and pitch

GLT now only support out-bound sampling, i.e. sampling from src to dst, while in some cases in-bound sampling, i.e. sampling from dst to src, may perform better. IMO, it is necessary to support the option of in-bound sampling.

Alternatives

No response

Additional context

No response

[CUDA] Optimize GPU memory allocation/deallocation in CUDA Operators such as neighbor sampling.

🚀 The feature, motivation and pitch

GLT's CUDA Operations such as GPU sampling contains numerous cudaFree calls, which may negatively impact performance. One potential solution is to implement a GPU memory pool to manage memory allocation and deallocation instead of directly calling cudaMalloc(Async)/cudaFree(Async).

Alternatives

No response

Additional context

No response

[PyG] Support direct data loading from PyG Data&HeteroData format

🚀 The feature, motivation and pitch

To leverage pre-processing modules and datasets from PyG, we need to implement a data loading function that transforms PyG Data&HeteroData format to GLT's format. The function is expected to be convenient and concise for users.

Alternatives

No response

Additional context

No response

[Bug] some progress may hang at global_barrier when initializing the torch rpc cluster

🐛 Describe the bug

When running the server-client mode example with two client nodes, each containing two processes, there are chances that client 0 may hang and raise a timeout error during the global_barrier call.

Log on client node 1:
-- [Client 1] Initializing client ... -- [Client 0] Initializing client ... -- [Client 1] Initializing training process group of PyTorch ... ERROR:root:Failed to respond to 'global_barrier' in time, got error Followers ['dist-train-supervised-sage-client-1', 'dist-train-supervised-sage-client-3'] timed out in _all_gather after 15.00 seconds. The first exception is RPCErr:1:RPC ran for more than set timeout (15000 ms) and will now be marked with an error -- [Client 0] Initializing training process group of PyTorch ...
Log on client node 2:
--- Launching client processes ... -- [Client 3] Initializing client ... -- [Client 2] Initializing client ... -- [Client 2] Initializing training process group of PyTorch ... -- [Client 3] Initializing training process group of PyTorch ...
The logs may indicate that client 0 does not receive responses from the other clients in time.

Environment

GLT version: 0.2.1
PyG version: /
PyTorch version: 1.13
OS: ubuntu
Python version: 3.8
CUDA/cuDNN version: /

Examples have fixed default number of processes

🐛 Describe the bug

Currently, the IGBH example sets the number of spawned processes based on solely an argparse variable. In the case of MultiGPU training, this is not quite intuitive, as from the example calls, it seems that the script is inferring the number of processes on GPU via CUDA_VISIBLE_DEVICES. However, this is not true, as this value is only derived from the argparse input, with 2 as default.

https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/dist_train_rgnn.py#L254

Environment

GLT version: 1.0
PyG version:
PyTorch version:
OS:
Python version:
CUDA/cuDNN version:
Any other relevant information

Crashed when running distributed CPU training using 2 nodes

🐛 Describe the bug

Hi @husimplicity @baoleai , I updated the latest code, and can build pure CPU version GLTorch. And I have managed to run CPU-version distributed training on one single node. However, when I run distributed training on different nodes, it returned error like this:
`Traceback (most recent call last):
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 227, in 'del'
self.shutdown()
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 230, in shutdown
if self._shutdowned:
AttributeError: 'DistNeighborLoader' object has no attribute '_shutdowned'
Exception ignored in: <function DistLoader.del at 0x148225431550>
Traceback (most recent call last):
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 227, in 'del'
self.shutdown()
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 230, in shutdown
if self._shutdowned:
AttributeError: 'DistNeighborLoader' object has no attribute '_shutdowned'
Traceback (most recent call last):
File "dist_train_rgnn.py", line 300, in
torch.multiprocessing.spawn(
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/global/panfs01/users/liukaixu/graphlearn-for-pytorch/examples/igbh/dist_train_rgnn.py", line 84, in run_training_proc
train_loader = glt.distributed.DistNeighborLoader(
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_neighbor_loader.py", line 92, in init
super().init(
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/distributed/dist_loader.py", line 177, in init
self._channel = ShmChannel(self.worker_options.channel_capacity,
File "/panfs/users/liukaixu/env/gltorch-cpu/lib/python3.8/site-packages/graphlearn_torch/channel/shm_channel.py", line 48, in init
self._queue = pywrap.SampleQueue(capacity, shm_size)
AttributeError: module 'graphlearn_torch.py_graphlearn_torch' has no attribute 'SampleQueue'`

cmd lines are:
`#node 0:
python dist_train_rgnn.py --num_nodes=2 --node_rank=0 --num_training_procs=2 --master_addr=eedq004 --model='rgat' --dataset_size='tiny' --num_classes=19 --cpu_mode

#node 1:
python dist_train_rgnn.py --num_nodes=2 --node_rank=1 --num_training_procs=2 --master_addr=eedq004 --model='rgat' --dataset_size='tiny' --num_classes=19 --cpu_mode`

where eedq004 is the hostname of node 0.

Environment

GLT version: 0.2.0 (build from latest source code)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Rocky Linux release 8.7 (Green Obsidian)
Python version: 3.8.16
CUDA/cuDNN version: No
Any other relevant information

[Workflow] Add CI/CD workflow

🚀 The feature, motivation and pitch

Add CI/CD workflows on self-hosted runner with cuda support.

Alternatives

No response

Additional context

No response

RPC Failure after 1st epoch of training on IGBH-tiny and IGBH-small

🐛 Describe the bug

After a single epoch of training on the IGBH dataset, Graphlearn consistently encounters an RPC timeout.

Rank00 | Epoch 000 | Loss 529.0624 | Train Acc 69.31 | Val Acc 74.61 | Time 0:01:51 | GPU 3415.3 MB                                                                                                                                                  
  5%|██████████▍                                                                                                                                                                                                     | 1/20 [01:51<35:09, 111.02s/it]ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error
ERROR:root:coroutine task failed: RPCErr:1:RPC ran for more than set timeout (180000 ms) and will now be marked with an error

Environment

GLT version: 1.0
PyG version:
PyTorch version: 2.0
OS: Linux
Python version: 3.10
CUDA/cuDNN version: 12.0
Any other relevant information

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

🐛 Describe the bug

There is a breaking change for indices and the indexed tensor since pytorch1.13, see pytorch/pytorch#90194

Environment

PyTorch version: >=1.13

[Bug] CUDA failure: 'invalid configuration argument' when batch_size is 1 or 2

🐛 Describe the bug

python train_rgnn.py
model size: 3.967MB
0%| | 0/20 [00:00<?, ?it/s]CUDA failure /codelab/graphlearn-for-pytorch/graphlearn_torch/csrc/cuda/random_sampler.cu:222: 'invalid configuration argument'

Environment

GLT version:
PyG version:
PyTorch version:
OS:
Python version:
CUDA/cuDNN version:
Any other relevant information

alibaba / graphlearn-for-pytorch Goto Github PK

graphlearn-for-pytorch's Introduction

Highlighted Features

Architecture Overview

Installation

Requirements

Pip Wheels

Build from source

Install Dependencies

Python

C++

Quick Tour

Accelarating PyG model training on a single GPU.

Distributed training

License

graphlearn-for-pytorch's People

Contributors

Stargazers

Watchers

Forkers

graphlearn-for-pytorch's Issues

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Alternatives

Additional context

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Alternatives

Additional context

📚 The doc issue

Suggest a potential alternative/fix

🚀 The feature, motivation and pitch

Alternatives

Additional context

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Background

Objective

Tasks

Alternatives

Additional context

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

🐛 Describe the bug