qitianwu / difformer Goto Github PK

The official implementation for ICLR23 spotlight paper "DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion"

Python 94.98% Shell 5.02%

attention diffusion diffusion-equation geometric-deep-learning graph-neural-networks graph-transformer iclr2023 image-classification large-graph node-classification

difformer's Introduction

DIFFormer: Diffusion-based (Graph) Transformers

The official implementation for ICLR23 paper "DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion".

Related material: [Paper], [Blog Chinese | English], [Video]

DIFFormer is a general-purpose encoder that can be used to compute instance representations with their latent/observed interactions accommodated.

This work is built upon NodeFormer (NeurIPS22) which is a scalable Transformer for large graphs with linear complexity.

What's news

[2023.03.01] We release the early version of our codes for node classification.

[2023.03.09] We release codes for image/text classification and spatial-temporal prediction.

[2023.07.03] I gave a talk on LOG seminar about scalable graph Transformers. See the online video here.

Model Overview

DIFFormer is motivated by an energy-constrained diffusion process which encodes a batch of instances to their structured representations. At each step, the model will first estimate pair-wise influence (i.e., attention) among arbitrary instance pairs (regardless of whether they connected by an input graph) and then update instance embeddings by feature propagation. The feed-forward process can be treated as a diffusion process that minimizes the global energy.

In specific, the DIFFormer's architecture is depicted by the following figure where one DIFFormer layer comprises of global attention, GCN convolution and residual link. The global attention is our key design including two instantiations: DIFFormer-s and DIFFormer-a.

We implement the model in difformer.py where the DIFFormer-s (resp. DIFFormer-a) corresponds to kernel = 'simple' (resp. 'sigmoid'). The differences of two model versions lie in the global attention computation where DIFFormer-s only requires $O(N)$ complexity and DIFFormer-a requires $O(N^2)$, illustrated by the figure below where the red color marks the computation bottleneck.

Where DIFFormer can be used?

We focus on three different types of tasks in our experiments: graph-based node classification, image and text classification, and spatial-temporal prediction. Beyond these scenarios, DIFFormer can be used as a general-purpose encoder for various applications including but not limited to:

Encoding node features and graph structures: given node features $X$ and graph adjacency $A$, output node embeddings $Z$ or predictions $\hat Y$

      model = DIFFormer(in_channels, hidden_channels, out_channels, use_graph=True)
      z = model(x, edge_index) # x: [num_nodes, in_channels], edge_index: [2, E], z: [num_nodes, out_channels]

Encoding instances (w/o graph structures): given instance features $X$ that are independent samples, output instance embeddings $Z$ or predictions $\hat Y$

      model = DIFFormer(in_channels, hidden_channels, out_channels, use_graph=False)
      z = model(x, edge_index=None) # x: [num_inst, in_channels], z: [num_inst, out_channels]

As plug-in encoder backbone for computing representations in latent space under a large framework for various downstream tasks (generation, prediction, decision, etc.).

Dependence

Our implementation is based on Pytorch and Pytorch Geometric. Please refer to requirements.txt in each folder for preparing the required packages.

Datasets

We apply our model to three different tasks and consider different datasets.

For node classification and image/text classification, we provide an easy access to the used datasets in the Google drive except two large graph datasets, OGBN-Proteins and Pokec, which can be automatically downloaded running the training/evaluation codes.

(for two image datasets CIFAR and STL, we use a self-supervised pretrained model (ResNet-18) to obtain the embeddings of images as input features)

For spatial-temporal prediction, the datasets can be automatically downloaded from Pytorch Geometric Temporal.

Following here for how to get the datasets ready for running our codes.

How to run our codes

Install the required package according to requirements.txt in each folder (notice that the required packages are different in each task)
Create a folder ../data and download the datasets from here (For OGBN-Proteins, Pokec and three spatial-temporal datasets, the datasets will be automatically downloaded)
To train the model from scratch and evaluate on specific datasets, one can refer to the scripts run.sh in each folder.
To directly reproduce the results on two large datasets (the training can be time-consuming), we also provide the checkpoints of DIFFormer on OGBN-Proteins and Pokec. One can download the trained models into ../model/ and run the scripts in node classification/run_test_large_graph.sh for reproducing the results.

For Pokec, to ensure obtaining the result as ours, one need to download the fixed splits from here to ../data/pokec/split_0.5_0.25.

Citation

If you find our codes useful, please consider citing our work

      @inproceedings{
        wu2023difformer,
        title={{DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion},
        author={Qitian Wu and Chenxiao Yang and Wentao Zhao and Yixuan He and David Wipf and Junchi Yan},
        booktitle={International Conference on Learning Representations (ICLR)},
        year={2023}
        }

difformer's People

Contributors

Stargazers

Watchers

difformer's Issues

num_heads!=1时，attention计算报错

如果将num_heads设为不等于1的值，在spatial-temporal/difformer.py中的第44行计算attention时，代码报错。

attention计算有bug

当kernel是simple时，full_attention_conv中的attention这个变量有bug，它的shape与kernel是sigmoid时不一样。
我认为difformer.py中的第43行应该去掉最后面的.unsqueeze(2)，这样shape好像就对了。

code

很棒的工作，看了代码我有些问题，如果要处理图片的话，需要对代码进行改动吗？对维度进行修改吗

error

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Cora datasets

I run your codes in node classification on Cora dataset.
The performance is not derived as expected.
Are there any hyperparameters to be changed in your given code?

These are the derived results when i wrote "python main.py"

Run 05:
Highest Train: 14.29
Highest Valid: 31.60
Highest Test: 31.90
Chosen epoch: 498
Final Train: 14.29
Final Test: 31.90
All runs:
Highest Train: 14.29 ± 0.00
Highest Test: 31.90 ± 0.00
Highest Valid: 31.60 ± 0.00
Final Train: 14.29 ± 0.00
Final Test: 31.90 ± 0.00

Cora，Citeseer，Pubmed

在通过代码中给到的对三个数据运行代码sh，运行出来之后并没有达到论文中的效果。会不会跟包的版本有关系。但是我用了包里提示requirement.txt 会出现site-packages/torch_sparse/_convert_cuda.so: undefined symbol:_ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

About Attention Calculation

Thanks for your sharing code !
I am currently encountering some issues.

I tried calling the function full_attention_conv

x = torch.randn(25, 4, 16)
a = full_attention_conv(x, x, x, kernel='simple', output_attn=True)

It will report an error
RuntimeError: The size of tensor a (25) must match the size of tensor b (4) at non-singleton dimension 1

DIFFormer训练太慢问题

你好，非常感谢分享这么出色的工作！最近使用DIFFormer用于时空预测，由于目前模型每次只能处理一个训练样本，无法实现批处理，导致训练很慢。这有什么好的解决办法？还请不吝赐教

Batch computation of Difformer

Hi! Thanks for demonstrating such an interesting work and sharing your code! I'm very interested in Difformer and I'd like to conduct spatial-temporal prediction tasks based on it. However, in practice, the input data structure for spatial-temporal datasets might be [batch_size, sequence_length, number_of_nodes, feature_dimension], where each mini-batch is split along the sequence dimension, rather than node dimension in typical GNN problems. So is Difformer applicable in this case? Or what possible adjustments might be needed based on your implementations?

Looking forward to your reply and I appreciate it greatly!

How to reproduce the results in Fig.2(b), increasing model depth K?

Could you provide the specific parameter settings for reproducing the results in Fig.2(b), when the model depth is large? I had problem even when the model depth is 16, i.e. --num_layers 16.

python main.py --dataset cora --method difformer --rand_split_class --lr 0.001 --weight_decay 0.01 --dropout 0.2 --num_layers 16 --hidden_channels 64 --num_heads 1 --kernel simple --use_graph --use_bn --use_residual --alpha 0.5 --runs 1 --epochs 500 --seed 123 --device 0

The output accuracy is 29.40%, and is achieved on the 8th epoch. I have tried tuning weight_decay, dropout, but nothing helps.

diffusion

你们的工作非常有趣，将diffusion 跟能量和图相结合。但是我并没有在代码中看到类似于图生成的denoise过程和计算能量降低的过程。希望您能够在代码中帮忙指出与对应论文中Energy Function 和Diffusivity S 对应的代码。
非常感谢！

cifar10数据怎么处理成features.pkl的，这个是怎么处理的

Nan when training "DIFFORMER-a"?

Hi，您的工作很不错，我很感兴趣。不过我在尝试迁移到我的任务上时发现kernel='sigmoid'的时候训练中容易出现nan。

我查看了关于full_attention_conv的代码，发现其中涉及到sigmoid的部分，在计算分母的时候，attention_normalizer可能包含0元素。这会导致difformer-a在计算attention的时候会出现nan。(相比之下，在simple中分母在line38有attention_normalizer += torch.ones_like(attention_normalizer) * N可以保证分母不会为零。)

elif kernel == 'sigmoid':
        # numerator
        attention_num = torch.sigmoid(torch.einsum("nhm,lhm->nlh", qs, ks))  # [N, L, H]

        # denominator
        all_ones = torch.ones([ks.shape[0]]).to(ks.device)
        attention_normalizer = torch.einsum("nlh,l->nh", attention_num, all_ones)
        attention_normalizer = attention_normalizer.unsqueeze(1).repeat(1, ks.shape[0], 1)  # [N, L, H]

        # compute attention and attentive aggregated results
        attention = attention_num / attention_normalizer
        attn_output = torch.einsum("nlh,lhd->nhd", attention, vs)  # [N, H, D]

请问有什么好的方法避免出现nan这种情况呢？(我考虑过给分母加上一个eps小量，或者采用torch.nan_to_num()这两种方法。想问问是否有更好的方法)

cifar stl10数据训练结果

你好，有一个问题想请教一下，我将数据导入之后，运行调试使用不同模型进行对比结果，发现使用gcn或者其他的模型呈现的结果和论文中的基本一致，但是使用difformer模型训练后得到的结果和论文中结果出入太大是什么情况呢？所用到的package是按照requirment下载的，不知道哪里出问题了？非常感谢可以解答。

Cannot load pokec dataset

Hello. It seems that scipy cannot read the mat file of the pokec dataset. Could you please help me?

Traceback (most recent call last):
File "/home/wangxiyuan/DIFFormer/node classification/main-batch.py", line 43, in
dataset = load_dataset(args.data_dir, args.dataset, args.sub_dataset)
File "/home/wangxiyuan/DIFFormer/node classification/dataset.py", line 109, in load_dataset
dataset = load_pokec_mat(data_dir)
File "/home/wangxiyuan/DIFFormer/node classification/dataset.py", line 312, in load_pokec_mat
fulldata = scipy.io.loadmat(f'{data_dir}pokec.mat')
File "/home/wangxiyuan/miniconda3/lib/python3.10/site-packages/scipy/io/matlab/_mio.py", line 225, in loadmat
MR, _ = mat_reader_factory(f, **kwargs)
File "/home/wangxiyuan/miniconda3/lib/python3.10/site-packages/scipy/io/matlab/_mio.py", line 74, in mat_reader_factory
mjv, mnv = _get_matfile_version(byte_stream)
File "/home/wangxiyuan/miniconda3/lib/python3.10/site-packages/scipy/io/matlab/_miobase.py", line 251, in _get_matfile_version
raise ValueError('Unknown mat file type, version %s, %s' % ret)
ValueError: Unknown mat file type, version 32, 99