Coder Social home page Coder Social logo

graphsl's Introduction

Documentation Status MIT license PyPI version Downloads DOI DOI

GraphSL: Graph Source Localization Library

This is the source code of the GraphSL library to support the research of the graph source localization problem.

Introduction

Problem Definition

image

Graph diffusion is a fundamental task in graph learning, which aims to predict future information diffusions given information sources. Its inverse problem is graph source localization, which is an extremely important topic even though rarely explored: it focuses on the detection of information sources given their future information diffusions. As illustrated in the above figure, graph diffusion seeks to predict the information diffusion ${b,c,d,e}$ from a source node $b$, whereas graph source localization aims to identify the source node $b$ from the information diffusion ${b,c,d,e}$. Graph source localization spans a broad spectrum of promising research and real-world applications such as rumor detection, tracking of sources for computer viruses, and failure detection in smart grids. Hence, the graph source localization problem demands attention and extensive investigations from machine learning researchers.

Due to its importance, some open-source tools have been developed to support research on the graph source localization problem. Two recent examples are cosasi and RPaSDT. However, they do not support various simulations of information diffusion, and they also miss real-world benchmark datasets and state-of-the-art source localization approaches. To fill this gap, we propose a new library GraphSL: the first one to include real-world benchmark datasets and recent source localization methods to our knowledge, enabling researchers and practitioners to evaluate novel techniques against appropriate baselines easily. These methods do not require prior assumptions about the source (e.g. single source or multiple sources) and can handle graph source localization based on various diffusion simulation models such as Independent Cascade (IC) and Linear Threshold (LT). Our GraphSL library is standardized: for instance, tests of all source inference methods return a Metric object, which provides five performance metrics (accuracy, precision, recall, F-score, and area under ROC curve) for performance evaluation.

Our GraphSL library targets both developers and practical users: they are free to add algorithms and datasets for personal needs by following the guidelines in the "Contact" section of README.md.

Approaches

image

Existing methods can be categorized into two groups: Prescribed methods and Graph Neural Networks (GNN)-based methods.

Prescribed methods rely on hand-crafted rules and heuristics. For instance, LPSI assumes that nodes surrounded by larger proportions of infected nodes are more likely to be source nodes. NetSleuth employs the Minimum Description Length principle to identify the optimal set of source nodes and virus propagation ripple. OJC identifies a set of nodes (Jordan cover) that cover all observed infected nodes with the minimum radius.

GNN-based methods learn rules from graph data in an end-to-end manner by capturing graph topology and neighboring information. For example, GCNSI utilizes LPSI to enhance input and then applies Graph Convolutional Networks (GCN) for source identification; IVGD introduces a graph residual scenario to make existing graph diffusion models invertible, and it devises a new set of validity-aware layers to project inferred sources to feasible regions. SLVAE uses forward diffusion estimation and deep generative models to approximate source distribution, leveraging prior knowledge for generalization under arbitrary diffusion patterns.

Benchmark Datasets

Dataset #Node #Edge
Karate 34 78
Dolphins 62 159
Jazz 198 2,742
Network Science 1,589 2,742
Cora-ML 2,810 7,981
Power Grid 4,941 6,594

Aside from methods, we also provide six benchmark datasets to facilitate the research of the graph source localization problem. All datasets are introduced as follows:

  1. Karate: Karate depicts the social ties among members of a university karate club.

  2. Dolphins: Dolphins represents a social network of bottlenose dolphins, with edges indicating frequent associations between dolphins.

  3. Jazz: Jazz illustrates a collaboration network among Jazz musicians, where edges signify instances of playing together in a band.

  4. Network Science: Network Science portrays a coauthorship network of scientists engaged in network theory and experimentation, with each edge representing co-authorship of a paper.

  5. Cora-ML: Cora-ML is a portal network of computer science research papers obtained through machine learning techniques.

  6. Power Grid: Power Grid delineates the topology network of the Western States Power Grid in the United States.

Installation

Install GraphSL using pip:

pip install GraphSL

Or, clone the repo and install requirements:

pip install -r requirements.txt

Quickstart

Now, you can import and use GraphSL in your Python code.

from GraphSL.GNN.SLVAE.main import SLVAE
from GraphSL.GNN.IVGD.main import IVGD
from GraphSL.GNN.GCNSI.main import GCNSI
from GraphSL.Prescribed import LPSI, NetSleuth, OJC
from GraphSL.utils import load_dataset, diffusion_generation, split_dataset,download_dataset,visualize_source_prediction
import os
curr_dir = os.getcwd()
# download datasets
download_dataset(curr_dir)
# load datasets ('karate', 'dolphins', 'jazz', 'netscience', 'cora_ml', 'power_grid')
data_name = 'karate'
graph = load_dataset(data_name, data_dir=curr_dir)
# generate diffusion
dataset = diffusion_generation(graph=graph, infect_prob=0.3, diff_type='IC', sim_num=100, seed_ratio=0.2)
# split into training and test sets
adj, train_dataset, test_dataset = split_dataset(dataset)

# LPSI
print("LPSI:")
lpsi = LPSI()

# train LPSI
alpha, thres, auc, f1, pred = lpsi.train(adj, train_dataset)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# test LPSI
metric = lpsi.test(adj, test_dataset, alpha, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

# NetSleuth
print("NetSleuth:")
netSleuth = NetSleuth()

# train NetSleuth
k, auc, f1 = netSleuth.train(adj, train_dataset)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# test NetSleuth
metric = netSleuth.test(adj, test_dataset, k)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

# OJC
print("OJC:")
ojc = OJC()

# train OJC
Y, auc, f1 = ojc.train(adj, train_dataset)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# test OJC
metric = ojc.test(adj, test_dataset, Y)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

# GCNSI
print("GCNSI:")
gcnsi = GCNSI()

# train GCNSI
gcnsi_model, thres, auc, f1, pred = gcnsi.train(adj, train_dataset)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# visualize training predictions
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="GCNSI_source_prediction")


# test GCNSI
metric = gcnsi.test(adj, test_dataset, gcnsi_model, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

# IVGD
print("IVGD:")
ivgd = IVGD()

# train IVGD diffusion
diffusion_model = ivgd.train_diffusion(adj, train_dataset)

# train IVGD
ivgd_model, thres, auc, f1, pred = ivgd.train(
    adj, train_dataset, diffusion_model)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# visualize training predictions
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="IVGD_source_prediction")

# test IVGD
metric = ivgd.test(adj, test_dataset, diffusion_model, ivgd_model, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

# SLVAE
print("SLVAE:")
slave = SLVAE()

# train SLVAE
slvae_model, seed_vae_train, thres, auc, f1, pred = slave.train(
    adj, train_dataset)
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

# visualize training predictions
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="SLVAE_source_prediction")

# test SLVAE
metric = slave.infer(test_dataset, slvae_model, seed_vae_train, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

We also provide a tutorial to help you get started and check the expected results.

Documentation

Official documentation, including a detailed API reference, is available on Read the Docs.

Citation

If you use this package in your research, please consider citing our work as follows:

@article{Wang2024,
doi = {10.21105/joss.06796},
url = {https://doi.org/10.21105/joss.06796},
year = {2024}, publisher = {The Open Journal},
volume = {9},
number = {99},
pages = {6796},
author = {Junxiang Wang and Liang Zhao},
title = {GraphSL: An Open-Source Library for Graph Source Localization Approaches and Benchmark Datasets},
journal = {Journal of Open Source Software} }

Contact

We welcome your contributions! If you’d like to contribute your datasets or algorithms, please submit a pull request consisting of an atomic commit and a brief message describing your contribution.

For a new dataset, please upload it to the data folder. The file should be a dictionary object saved by pickle. It contains a key "adj_mat" with the value of a graph adjacency matrix (sprase numpy array with the CSR format).

For a new algorithm, please determine whether it belongs to prescribed methods or GNN-based methods: if it belongs to the prescribed methods, add your algorithm as a new class in the GraphSL/Prescribed.py. Otherwise, please upload it as a folder under the GraphSL/GNN folder. Typically, the algorithm should include a "train" function and a "test" function, and the "test" function should return a Metric object.

Feel free to Email me ([email protected]) if you have any questions. Bug reports and feedback can be directed to the Github issues page.

Version Log

Version 0.11 removes the memetracker and the digg datasets, improves the IVGD method, and creates random seeds for reproducibility.

Version 0.12 adds the datasets downloader.

Version 0.13 adds the visualization of source predictions.

Version 0.14 uses the num_thres (i.e. number of thresholds to try) instead of specifying the thres_list (i.e. threshold list) for LPSI, GCNSI, IVGD and SLVAE. Moreover, GCNSI, IVGD and SLVAE are improved to run on CUDA if applicable.

Version 0.15 makes all methods run on CUDA if applicable, replaces the diffusion model of IVGD and the encoder of SLVAE, and revises the generation of diffusion.

graphsl's People

Contributors

xianggebenben avatar dependabot[bot] avatar mbeyss avatar

Stargazers

 avatar  avatar  avatar Luca avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

mbeyss

graphsl's Issues

关于数据集的问题

你好,很高兴看到您的成果。在GraphSL的方法中,您提供了8个基准数据集,请问这些数据集是怎么生成的,能否给出相应的方法?

[JOSS Review] Example Usage

related to this JOSS Review

While you have a very nice example that shows how to use all of your algorithms it could really need some comments and structure. I would suggest turning this into a jupyter notebook, where you can better explain and show what you are doing.

[Joss Review] Community guidelines

related to this JOSS Review

Please expand the Contact section of the README fot include

guidelines for third parties wishing to

  1. Contribute to the software
  2. Report issues or problems with the software
  3. Seek support

3rd one seems covered and the other two are no brainers (pull request and issues in this repo). Just state it explicitly please.

format according to pep8

The code contains many violation of pep8, which make the code harder to read. Formatting should ideally be done automatically (e.g. black, flake8, autopep8), and enforced on commits (pre-commit).

related to this JOSS review

[JOSS Review] Comments on paper.md

related to this JOSS Review

From the JOSS checklist

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?

The summary is hard to understand for a reader, who does not know the graph diffusion. I suggest do make the purpose a bit clearer, e.g. by putting the explanation what graph diffusion models are useful for ("simulating information spread") earlier and elaborating this more.

  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?

Here the basic problem (graph source localization) is nicely introduced in a way that also non experts can understand . Figure 1 is rather useful for that.
The relation to others work becomes fairly clear, as you use other peoples software in a sort of unified framework.
For me, it is however not clear who the target audience is, is it developers for other algorithms for graph source localization, that want to compare their solution to others, or is it users of such algorithms, that want to find out the most suitable algorithm for their work? What I am missing for both cases is some steps how to do that, i.e. how to incorporate a new algorithm, or a new dataset. With that also the "problems the software is designed to solve" is a bit unclear, technically it solves the graph source localization problem, but it remains unclear what the exact use of GraphSL is, that other softwares in this field do not cover.

  • State of the field: Do the authors describe how this software compares to other commonly-used packages?

As it includes other packages, this is straight forward. However, as explained earlier, the usefulness of this (and the datasets) does not become clear

  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?

The writing is generally of good quality. A few minor remarks:

  • line 27: "up-to-date state-of-the-art" seems redundant

  • in Methods and Benchmark Datasets the past tense is used, when describing other approaches, here present tense should be used.

  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

This looks mostly fine. I would suggest to add reference to a recent review paper or so explaining graph source localization (techniques), to help the reader to dive deeper in to the field.
I feel like there are missing references for the data from table.

More general

  • line 33: "Independent Cascade" and "Linear Threshold" are not explained. references might also help
  • lines 33-36: the formulation here is not really clear. I would suggest to elaborate this further, This could go well with my earlier points
  • Figure 2:
    • the image has some sort of gradient to the left, background should either be white or transparent
    • the image description does not feel very useful. Suggestion: "The hierarchical structure of the GraphSL library. In total 6 algorithms are implemented, which can be devided into two categories: Prescribed Methods that rely on ... and GNN-based Methods which ..." From the image alone, along with the description the reader should be able to extract the most important information
  • Table 1:
    • Caption has to be improved
    • The relevant information is lost in to much detail (two many numbers displayed, average Degree is redundant)
      • right align numbers and reduce number of significant digits (two, maybe three)
      • I would suggest to only have Nodes and Average Degree in there (average Degree needs to be fixed number of digits, as is, I would loose all digits after the decimal separator
  • Line 55: There needs to be some sort of explanation of the datasets, what a seed deiffusion pair is and how to generate them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.