Coder Social home page Coder Social logo

pygod-team / pygod Goto Github PK

View Code? Open in Web Editor NEW
1.3K 16.0 126.0 1.02 MB

A Python Library for Graph Outlier Detection (Anomaly Detection)

Home Page: https://pygod.org

License: BSD 2-Clause "Simplified" License

Python 100.00%
outlier-detection anomaly-detection graph-anomaly-detection machine-learning security-tools opensource deeplearning python graphmining pytorch graph-neural-networks fraud-detection toolkit

pygod's Introduction

PyGOD Logo

PyPI version Documentation status GitHub stars GitHub forks PyPI downloads testing Coverage Status License


PyGOD is a Python library for graph outlier detection (anomaly detection). This exciting yet challenging field has many key applications, e.g., detecting suspicious activities in social networks [1] and security systems [2].

PyGOD includes 10+ graph outlier detection algorithms. For consistency and accessibility, PyGOD is developed on top of PyTorch Geometric (PyG) and PyTorch, and follows the API design of PyOD. See examples below for detecting outliers with PyGOD in 5 lines!

PyGOD is featured for:

  • Unified APIs, detailed documentation, and interactive examples across various graph-based algorithms.
  • Comprehensive coverage of 10+ graph outlier detectors.
  • Full support of detections at multiple levels, such as node-, edge-, and graph-level tasks.
  • Scalable design for processing large graphs via mini-batch and sampling.
  • Streamline data processing with PyG--fully compatible with PyG data objects.

Outlier Detection Using PyGOD with 5 Lines of Code:

# train a dominant detector
from pygod.detector import DOMINANT

model = DOMINANT(num_layers=4, epoch=20)  # hyperparameters can be set here
model.fit(train_data)  # input data is a PyG data object

# get outlier scores on the training data (transductive setting)
score = model.decision_score_

# predict labels and scores on the testing data (inductive setting)
pred, score = model.predict(test_data, return_score=True)

Citing PyGOD:

Our software paper and benchmark paper are publicly available. If you use PyGOD or BOND in a scientific publication, we would appreciate citations to the following papers:

@article{JMLR:v25:23-0963,
  author  = {Kay Liu and Yingtong Dou and Xueying Ding and Xiyang Hu and Ruitong Zhang and Hao Peng and Lichao Sun and Philip S. Yu},
  title   = {{PyGOD}: A {Python} Library for Graph Outlier Detection},
  journal = {Journal of Machine Learning Research},
  year    = {2024},
  volume  = {25},
  number  = {141},
  pages   = {1--9},
  url     = {http://jmlr.org/papers/v25/23-0963.html}
}
@inproceedings{NEURIPS2022_acc1ec4a,
 author = {Liu, Kay and Dou, Yingtong and Zhao, Yue and Ding, Xueying and Hu, Xiyang and Zhang, Ruitong and Ding, Kaize and Chen, Canyu and Peng, Hao and Shu, Kai and Sun, Lichao and Li, Jundong and Chen, George H and Jia, Zhihao and Yu, Philip S},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
 pages = {27021--27035},
 publisher = {Curran Associates, Inc.},
 title = {{BOND}: Benchmarking Unsupervised Outlier Node Detection on Static Attributed Graphs},
 url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/acc1ec4a9c780006c9aafd595104816b-Paper-Datasets_and_Benchmarks.pdf},
 volume = {35},
 year = {2022}
}

or:

Liu, K., Dou, Y., Ding, X., Hu, X., Zhang, R., Peng, H., Sun, L. and Yu, P.S., 2024. PyGOD: A Python library for graph outlier detection. Journal of Machine Learning Research, 25(141), pp.1-9.
Liu, K., Dou, Y., Zhao, Y., Ding, X., Hu, X., Zhang, R., Ding, K., Chen, C., Peng, H., Shu, K., Sun, L., Li, J., Chen, G.H., Jia, Z., and Yu, P.S., 2022. BOND: Benchmarking unsupervised outlier node detection on static attributed graphs. Advances in Neural Information Processing Systems, 35, pp.27021-27035.

Installation

Note on PyG and PyTorch Installation: PyGOD depends on torch and torch_geometric (including its optional dependencies). To streamline the installation, PyGOD does NOT install these libraries for you. Please install them from the above links for running PyGOD:

  • torch>=2.0.0
  • torch_geometric>=2.3.0

It is recommended to use pip for installation. Please make sure the latest version is installed, as PyGOD is updated frequently:

pip install pygod            # normal install
pip install --upgrade pygod  # or update if needed

Alternatively, you could clone and run setup.py file:

git clone https://github.com/pygod-team/pygod.git
cd pygod
pip install .

Required Dependencies:

  • python>=3.8
  • numpy>=1.24.3
  • scikit-learn>=1.2.2
  • scipy>=1.10.1
  • networkx>=3.1

Quick Start for Outlier Detection with PyGOD

"A Blitz Introduction" demonstrates the basic API of PyGOD using the DOMINANT detector. It is noted that the API across all other algorithms are consistent/similar.


API Cheatsheet & Reference

Full API Reference: (https://docs.pygod.org). API cheatsheet for all detectors:

  • fit(data): Fit the detector with train data.
  • predict(data): Predict on test data (train data if not provided) using the fitted detector.

Key Attributes of a fitted detector:

  • decision_score_: The outlier scores of the input data. Outliers tend to have higher scores.
  • label_: The binary labels of the input data. 0 stands for inliers and 1 for outliers.
  • threshold_: The determined threshold for binary classification. Scores above the threshold are outliers.

Input of PyGOD: Please pass in a PyG Data object. See PyG data processing examples.

Implemented Algorithms

Abbr Year Backbone Sampling Ref
SCAN 2007 Clustering No [3]
GAE 2016 GNN+AE Yes [4]
Radar 2017 MF No [5]
ANOMALOUS 2018 MF No [6]
ONE 2019 MF No [7]
DOMINANT 2019 GNN+AE Yes [8]
DONE 2020 MLP+AE Yes [9]
AdONE 2020 MLP+AE Yes [9]
AnomalyDAE 2020 GNN+AE Yes [10]
GAAN 2020 GAN Yes [11]
DMGD 2020 GNN+AE Yes [12]
OCGNN 2021 GNN Yes [13]
CoLA 2021 GNN+AE+SSL Yes [14]
GUIDE 2021 GNN+AE Yes [15]
CONAD 2022 GNN+AE+SSL Yes [16]
GADNR 2024 GNN+AE Yes [17]

How to Contribute

You are welcome to contribute to this exciting project:

See contribution guide for more information.


PyGOD Team

PyGOD is a great team effort by researchers from UIC, IIT, BUAA, ASU, and CMU. Our core team members include:

Kay Liu (UIC), Yingtong Dou (UIC), Yue Zhao (CMU), Xueying Ding (CMU), Xiyang Hu (CMU), Ruitong Zhang (BUAA), Kaize Ding (ASU), Canyu Chen (IIT),

Reach out us by submitting an issue report or send an email to [email protected].


Reference

[1]Dou, Y., Liu, Z., Sun, L., Deng, Y., Peng, H. and Yu, P.S., 2020, October. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM).
[2]Cai, L., Chen, Z., Luo, C., Gui, J., Ni, J., Li, D. and Chen, H., 2021, October. Structural temporal graph neural networks for anomaly detection in dynamic graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM).
[3]Xu, X., Yuruk, N., Feng, Z. and Schweiger, T.A., 2007, August. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
[4]Kipf, T.N. and Welling, M., 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.
[5]Li, J., Dani, H., Hu, X. and Liu, H., 2017, August. Radar: Residual Analysis for Anomaly Detection in Attributed Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI).
[6]Peng, Z., Luo, M., Li, J., Liu, H. and Zheng, Q., 2018, July. ANOMALOUS: A Joint Modeling Approach for Anomaly Detection on Attributed Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI).
[7]Bandyopadhyay, S., Lokesh, N. and Murty, M.N., 2019, July. Outlier aware network embedding for attributed networks. In Proceedings of the AAAI conference on artificial intelligence (AAAI).
[8]Ding, K., Li, J., Bhanushali, R. and Liu, H., 2019, May. Deep anomaly detection on attributed networks. In Proceedings of the SIAM International Conference on Data Mining (SDM).
[9](1, 2) Bandyopadhyay, S., Vivek, S.V. and Murty, M.N., 2020, January. Outlier resistant unsupervised deep architectures for attributed network embedding. In Proceedings of the International Conference on Web Search and Data Mining (WSDM).
[10]Fan, H., Zhang, F. and Li, Z., 2020, May. AnomalyDAE: Dual autoencoder for anomaly detection on attributed networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[11]Chen, Z., Liu, B., Wang, M., Dai, P., Lv, J. and Bo, L., 2020, October. Generative adversarial attributed network anomaly detection. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM).
[12]Bandyopadhyay, S., Vishal Vivek, S. and Murty, M.N., 2020. Integrating network embedding and community outlier detection via multiclass graph description. Frontiers in Artificial Intelligence and Applications, (FAIA).
[13]Wang, X., Jin, B., Du, Y., Cui, P., Tan, Y. and Yang, Y., 2021. One-class graph neural networks for anomaly detection in attributed networks. Neural computing and applications.
[14]Liu, Y., Li, Z., Pan, S., Gong, C., Zhou, C. and Karypis, G., 2021. Anomaly detection on attributed networks via contrastive self-supervised learning. IEEE transactions on neural networks and learning systems (TNNLS).
[15]Yuan, X., Zhou, N., Yu, S., Huang, H., Chen, Z. and Xia, F., 2021, December. Higher-order Structure Based Anomaly Detection on Attributed Networks. In 2021 IEEE International Conference on Big Data (Big Data).
[16]Xu, Z., Huang, X., Zhao, Y., Dong, Y., and Li, J., 2022. Contrastive Attributed Network Anomaly Detection with Data Augmentation. In Proceedings of the 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD).
[17]Roy, A., Shu, J., Li, J., Yang, C., Elshocht, O., Smeets, J. and Li, P., 2024. GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM).

pygod's People

Contributors

aha12345678 avatar ahmed3amerai avatar canyuchen avatar cshjin avatar kaize0409 avatar kayzliu avatar oldpanda avatar parthapratimbanik avatar xiyanghu avatar xyvivian avatar yingtongdou avatar yzhao062 avatar zhiming-xu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygod's Issues

Inconsistency with BOND paper

I run main.py multiple times with DOMINANT (from https://github.com/pygod-team/pygod/tree/main/benchmark).
I find out that although the hyperparameter setting is consistent with the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), the results on inj_cora (AUC: 0.7566±0.0332 (0.7751)) and inj_amazon (AUC: 0.7147±0.0006 (0.7152)) are significantly different from what you show in table 3 from the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), which are 82.7±5.6 (84.3) on inj_cora and 81.3±1.0 (82.2) for inj_amazon.
Is there any advice that you can provide about how to reproduce the results of the BOND paper?

Degraded performance of ANEMONE and CoLA on weibo dataset

Describe the bug
The weibo dataset was retrieved as provided by the load_data() method in PyGOD. ANEMONE and CoLA are in beta and are called from pygod.models. When running the ANEMONE and CoLA methods on the weibo dataset, the average AUCROC score is less than 0.15 (ANEMONE: 0.0764±0.0273 (0.1391); CoLA: 0.0750±0.0192 (0.1442)).

To Reproduce

from pygod.models import ANEMONE, CoLA
from pygod.metrics import eval_roc_auc

model = ANEMONE()
data = load_data("weibo")
data.y = data.y.bool()
model.fit(data)
outlier_scores = model.decision_function(data)
auc_score = eval_roc_auc(data.y.numpy(), outlier_scores)

Default parameters are used for the two models. The benchmark code from benchmark/main.py was also used with few modifications; the hyperparameters that were changed are learning rate, and hidden dimensions.

Expected behavior
I would expect AUCROC scores for ANEMONE and CoLA to be above 0.5, similar to other datasets I ran the benchmark on (books, reddit, enron). It is performing significantly worse on the weibo dataset.

Additional context
Applying the fix mentioned in #43 did not seem to change performance much.

Add tutorials for hyperparamer tuning

Hi, wide collection of unsupervised algorithms is amazing. But if there aren't sufficient examples on tuning them, other developers may never use it.

I am planning to use these algorithms on publicly available graphs and write tutorials on the same.

I have major experience in deep learning but not in graph neural networks. I can pull this off with sufficient amount of help on underlying algorithms

`load_data` error in benchmark

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Running the benchmark , cd benchmark\
  2. Run python main.py in the benchmark.

Expected behavior
Within the main function, it should load the data with args.dataset. However, the path incorrect (default data path).
The cached data are not available for a fresh run.

Screenshots
image

Desktop (please complete the following information):

  • Linux

Additional context

Quick fix:
Replace

data = torch.load('data/' + args.dataset + '.pt')

with

import pygod.utils.utility import load_dta
...
data = load_data(args.dataset)

benchmark/main.py torch.mean(auc) throws error

Describe the bug

"Recall: {:.4f}±{:.4f} ({:.4f})".format(torch.mean(auc),

the above code throws the following error:

Traceback (most recent call last):
File "main.py", line 78, in
main(args)
File "main.py", line 49, in main
"Recall: {:.4f}±{:.4f} ({:.4f})".format(torch.mean(auc),
TypeError: mean(): argument 'input' (position 1) must be Tensor, not list

To Reproduce
Steps to reproduce the behavior:
just run python main.py --model dominant --dataset inj_cora from benchmark

Expected behavior
After running python main.py --model dominant --dataset inj_cora, It should show the following result:

100%|█████████████████████████████████████████████| 20/20 [05:44<00:00, 17.22s/it]
inj_cora DOMINANT AUC: 0.7666±0.0013 (0.7676) AP: 0.1830±0.0015 (0.1842) Recall: 0.2819±0.0032 (0.2899)

Desktop (please complete the following information):

  • OS: Windows 10
  • PyGOD Version 1.0.0
  • GPU: NVIDIA GeForce GTX 1050

dependency of models

could you specify specific dependency that your implemented models use in this thread.

For instance,

dominant:

  • XXX>=0.3.2

MLPAE bug when set contamination=0.03 during model initialization

File "/hdisk2/pygod_benchmark/pygod/models/mlpae.py", line 137, in fit
    self._process_decision_scores()
  File "/hdisk2/pygod_benchmark/pygod/models/base.py", line 278, in _process_decision_scores
    100 * (1 - self.contamination))
  File "<__array_function__ internals>", line 6, in percentile
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3733, in percentile
    a, q, axis, out, overwrite_input, interpolation, keepdims)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3853, in _quantile_unchecked
    interpolation=interpolation)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3404, in _ureduce
    a = np.asanyarray(a)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/_tensor.py", line 678, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

Enabling different hidden dimension for attribute autoencoder and structure autoencoder

Is your feature request related to a problem? Please describe.
For now, some detectors (e.g., GUIDE) has two separate autoencoders for attribute and structure, but two autoencoders share the same hidden layer dimension. In many cases, there are a significant difference between the dimension of the node attributes and the dimension of structure information (e.g., adjacency matrix). Using the same hidden dimension may hampers the performance of the detectors.

Describe the solution you'd like
Enabling different hidden dimension for attribute autoencoder and structure autoencoder

About node embedding function

Hi, could you please provide the function that returns the trained node embeddings so that I can input the embeddings to machine learning classifier such as SVM.

Best wish!

Pygod does not work in a subprocess

Describe the bug
Hi, I am trying to run example of PyGOD in a subprocess and it does not work for me

To Reproduce

from torch.multiprocessing import Process

import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid

import torch
from pygod.generator import gen_contextual_outliers, gen_structural_outliers
from pygod.utils import load_data
from pygod.models import AnomalyDAE



def f(data):
    model = AnomalyDAE()
    print('started model fitting')
    model.fit(data)
    print('model fit succesful')

if __name__ == '__main__':
    data = Planetoid('./data/Cora', 'Cora', transform=T.NormalizeFeatures())[0]
    data, ya = gen_contextual_outliers(data, n=100, k=50)
    data, ys = gen_structural_outliers(data, m=10, n=10)
    data.y = torch.logical_or(ys, ya).int()

    data = load_data('inj_cora')
    data.y = data.y.bool()
    p = Process(target=f, args=(data,))
    p.start()
    p.join()

Expected behavior

The model does not fit for me

Desktop (please complete the following information):

  • OS: all os and systems
  • python: 3.8

adone get unexpected keyword argument

Running examples\adone.py for replication

C:\Users\yuezh\Anaconda3\envs\torch19\python.exe C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py
training...
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py", line 35, in
model.fit(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 158, in fit
act=self.act).to(self.device)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 331, in init
act=act)
TypeError: init() got an unexpected keyword argument 'in_channels'

same parameters but result varys

Describe the bug
Very excellent work! But what I am confused about is why using the same parameters for training, the result auc can still vary so much?

To Reproduce
For example, my auc in one training with DOMINANT is 0.83, but the next time it becomes 0.90, why is the gap so big?
Hope to get your guidance

batch operation?

I think for now everything is handled as a full graph. Do we need to add funcs for batch operations or samplers?

GUIDE Bug on Cora dataset

File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 158, in fit
    x_, s_ = self.model(x, s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 369, in forward
    s_ = self.struct_ae(s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 394, in forward
    s = layer(s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 411, in forward
    out = self.propagate(edge_index, s=self.w2(s))
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Float but found Long

Query on Anomaly Prediction and Outlier labels

Hi,

Given a graph object in the prediction API, What does the outlier labels mentioned here as outlier_labels (numpy array of shape (n_samples,)) indicate from a graph perspective?

Does the contents in the numpy array as 1 or 0 indicate the Nodes in the graph that are normal or anomalous? for example Labels:
[0 0 0 ... 0 0 0] . Does each 0 value pertain to a node in graph?

So, How should this prediction output be interpreted from a graph perspective? Thanks in advance.

Connection Error when calling pygod.utils.load_data()

Describe the bug
When calling pygod.utils.load_data(), sometimes it returns the following error message:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Please refer to 1 and 2 for potential fixing approaches.

OCGNN code issue

  • The model will run 3* default_epochs during training.
  • In fit() function, the epoch and loss value print should move into the verbose condition.
  • The current performance looks weird on wiki and Cora datasets (see the shared document).
  • Unused code at lines 236, 260-263.
  • Correct the torch.Tensor type hint in docstrings.

ONE does not accept negative value

It appears that it would throw an error if the input x contains negative values. If this is expected, we should probably mention it somewhere.
image

check pygod/test/test_one.py

remove external (non-core-python) library `argparse` as a dependency

Describe the bug

The current dependencies include installing an external library: argparse.

It must be noted that argparse is a part of the core python libraries. There is no need for installing it alike other external libraries.

🔥 EDIT: The library (argparse) that you are installing from PyPI is no longer maintained as it is now a part of standard python3. See my comment here.

See further details:

argparse>=1.4.0
numpy>=1.19.4
scikit-learn>=0.22.1
scipy>=1.5.2
setuptools>=50.3.1.post20201107

Dominant model data loading and training problems

@kayzliu
When I write the Dominant example, I find the following issues. Please fix/answer them accordingly.

  1. The current process_graph function is dedicated to the BlogCatalog dataset, we need to write a general dataloader that could handle any PyG data object. The preprocessing code for BlogCatalog can be put into the dominant.py under /example.
  2. When I run model.fit(), train_loss became NaN after 5-6 epochs.
  3. How is the outlier label of BlogCatalog generated?
  4. Should we train the model on clean data and evaluate it on data with outliers?

BOND data possible inconsistency

Describe the bug
In the BOND paper, it is said that all the datasets are undirected, except Weibo.

Note that Weibo is a directed graph; the remaining datasets used in our benchmark are undirected graphs.

However, load_data function returns directed PyG graphs (only "reddit" is undirected for some reason). Here is the output of is_undirected method

inj_cora False
inj_amazon False
inj_flickr False
weibo False
reddit True
disney False
books False
enron False```

To Reproduce
Here is a colab notebook to reproduce the output above
https://colab.research.google.com/drive/1mNXh66Ac2hUduHvzKtGifC7_huBgCf-5?usp=sharing

Expected behavior
I expected the data to be consistent with what is stated in the paper. Please let me know if I misunderstood something or it's indeed a mistake. Thanks!

Is any method to track metrics like loss?

Is it possible to build logger information tracker inside logger function ?
I am looking for some visualization tools to track the loss or scores during each epoch. And I find the logger function embedded which can print those information. Do you think it is possible to add the metrics tracking function inside fit()/logger() ?

BR

Unclear if pygod is for supervised outlier detection only

It is unclear from the Readme or from the documentation whether one can perform outlier detection without having any labels. The Blitz Intro in the docs makes it clear that it works for supervised learning but how about out-of-the-box unsupervised outlier/anomaly detection?

Problem for CoLA and ANEMONE models.

The codes for masking the target nodes is wrong. The target node is the first node in subgraph after the RandomWalk sample, while you mask the last node. The performance of CoLa and ANEMONE will improve 2% by fixing the bug.

Wrong codes in CoLA(line 361~364)
batch_feature = torch.cat(
(batch_feature[:, :-1, :],
added_feat_zero_row,
batch_feature[:, -1:, :]), dim=1)

Correct codes:
batch_feature = torch.cat(
(added_feat_zero_row,
batch_feature[:, 1:, :],
batch_feature[:, 0:1, :]), dim=1)

Wrong codes in ANEMONE(line 288289 and 429430)
bf = torch.cat(
(bf[:, :-1, :], added_feat_zero_row, bf[:, -1:, :]), dim=1)

Correct codes:
bf = torch.cat(
(added_feat_zero_row, bf[:, 1:, :], bf[:, 0 : 1, :]), dim=1)

Hard to run benchmark scripts directly

Is your feature request related to a problem? Please describe.
I tried to run benchmark scripts on my local after installing the repo but failed.

Here's how I setup the environment.

First, I ran

  1. pip install -r requirements.txt
  2. python setup.py install

to install the repo and the dependencies. However, the following errors were complained when I tried to run python main.py under pygod/benchmark/

  • ModuleNotFoundError: No module named 'tqdm'
  • ModuleNotFoundError: No module named 'torch_geometric’
  • ModuleNotFoundError: No module named 'pyod’
  • ImportError: 'NeighborSampler' requires either 'pyg-lib' or 'torch-sparse’
  • AttributeError: 'DOMINANT' object has no attribute 'decision_scores_'. Did you mean: 'decision_score_'?

where the last one can be fixed by #80, but I still have to install the missing modules with commands

pip install tqdm
pip install torch
pip install torch-geometric
pip install pyod
pip install torch-sparse
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cpu.html

then I can run the benchmark script.

Describe the solution you'd like
It would be better to have a dedicated requirements.txt file inside folder pygod/benchmark/ containing all the required dependencies.

Describe alternatives you've considered
N/A

Additional context
N/A

Out of memory

Describe the bug
Hi, except that with the GCNAE model, I keep running into out of memory issues with the other models, even when setting the batch size to a very low value. It's always around 600GBs for a batch with around 400k nodes.

RuntimeError                              Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_17284\2902826942.py in <module>
      5 
      6 model = AnomalyDAE(gpu=0, batch_size=8, verbose=True, contamination=0.05)
----> 7 model.fit(batch)

~\anaconda3\lib\site-packages\pygod\models\anomalydae.py in fit(self, G, y_true)
    143         """
    144         G.node_idx = torch.arange(G.x.shape[0])
--> 145         G.s = to_dense_adj(G.edge_index)[0]
    146 
    147         # automated balancing by std

~\anaconda3\lib\site-packages\torch_geometric\utils\to_dense_adj.py in to_dense_adj(edge_index, batch, edge_attr, max_num_nodes)
     46     size = [batch_size, max_num_nodes, max_num_nodes]
     47     size += list(edge_attr.size())[1:]
---> 48     adj = torch.zeros(size, dtype=edge_attr.dtype, device=edge_index.device)
     49 
     50     flattened_size = batch_size * max_num_nodes * max_num_nodes

RuntimeError: CUDA out of memory. Tried to allocate 597.53 GiB (GPU 0; 16.00 GiB total capacity; 1.32 GiB already allocated; 12.78 GiB free; 1.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

A problem about structural reconstruction

When I am reading paper and PyGOD code, I find a problem when some algorithms aim to reconstruct structural infomation:

$$ \hat{A}=\sigma(\pmb z\pmb z^T) $$

where z is the graph embedding we have learnt, $\sigma$ is sigmoid function, and $\hat{A}$ is reconstructed adjacency matrix. One term of objective function is

$$ \Vert A-\hat{A}\Vert_F^2 $$

where $A$ is the adjacency matrix. But we should find that the diagonal elements of $\hat{A}$ is closed to 1 because

$$ \hat{A}_{ii}=\sigma(z_iz_i^T) $$

So I think we should add a self-loop on $A$ when reconstruction:

$$ \Vert(A+I)-\hat{A}\Vert_F^2 $$

In PyGOD code, I haven't found this consideration. I modified the code of DOMINANT in this way, and found performance improvement in some dataset.

LOF model

The benchmark paper compared the LOF method. Do you support this method and is it compatible with the pygod framework?

data shape issue in anaomalydae

replicate by running examples/anomalydae.py

Please make sure the example could run :)

predicting for probability
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/anomalydae.py", line 39, in
prob = model.predict_proba(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\base.py", line 176, in predict_proba
test_scores = self.decision_function(G)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 360, in decision_function
A_hat, X_hat = self.model(attrs, adj)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 169, in forward
A_hat, embed_x = self.structure_AE(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 70, in forward
embed_x = self.attention_layer(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\nn\conv\gat_conv.py", line 230, in forward
num_nodes=num_nodes)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\utils\loop.py", line 144, in add_self_loops
edge_index = torch.cat([edge_index, loop_index], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 0. Got 2 and 2708 (The offending index is 0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.