Coder Social home page Coder Social logo

tag-benchmark's Introduction

CS-TAG

CS-TAG is a project to share the public text-attributed graph (TAG) datasets and benchmark the performance of the different baseline methods. We welcome more to share datasets that are valuable for TAGs research.

Datasets ๐Ÿ””

We collect and construct 8 TAG datasets from ogbn-arxiv, amazon, dblp and goodreads. Now you can go to the 'Files and version' in CSTAG to find the datasets we upload! In each dataset folder, you can find the csv file (which save the text attribute of the dataset), pt file (which represent the dgl graph file), and the Feature folder (which save the text embedding we extract from the PLM). You can use the node initial feature we created, and you also can extract the node feature from our code. For a more detailed and clear process, please clik there.๐Ÿ˜Ž

Environments

You can quickly install the corresponding dependencies

conda env create -f environment.yml

Pipeline ๐ŸŽฎ

We describe below how to use our repository to perform the experiments reported in the paper. We are also adjusting the style of the repository to make it easier to use. (Please complete the 'Datasets and Feature part' above first)

1. GNN for Node Classification/Link Prediction

You can use 'ogbn-arxiv', 'Children', 'History', 'Fitness', 'Photo', 'Computers', 'webkb-cornell', 'webkb-texas', 'webkb-washington' and 'webkb-wisconsin' for the '--data_name'.

python GNN/GNN.py --data_name=Photo --dropout=0.2 --lr=0.005 --model_name=SAGE --n-epochs=1000 --n-hidden=256 --n-layers=3 --n-runs=5 --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy
python GNN/GNN_Link.py --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy --path=data/CSTAG/Photo/LinkPrediction/ --graph_path=data/CSTAG/Photo/Photo.pt --gnn_model=GCN

2. PLM for Classification Tasks

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/trainLM.py --att_dropout=0.1 --cla_dropout=0.1 --dataset=Computers_RS --dropout=0.1 --epochs=4 --eq_batch_size=180 --eval_patience=20000 --grad_steps=1 --label_smoothing_factor=0.1 --lr=4e-05 --model=Deberta --per_device_bsz=60 --per_eval_bsz=1000 --train_ratio=0.2 --val_ratio=0.1 --warmup_epochs=1 --gpus=0,1 --wandb_name OFF --wandb_id OFF 

3. TMLM for PreTraining

for update and debug

4. TDK for PreTraining

for update and debug

5. TCL for PreTraining

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/Train_Command/train_CL.py --PrtMode=TCL --att_dropout=0.1 --cla_dropout=0.1 --dataset=Photo_RS --dropout=0.1 --epochs=5 --eq_batch_size=60 --per_device_bsz=15 --grad_steps=2 --lr=5e-05 --model=Bert --warmup_epochs=1 --gpus=0,1 --cache_dir=exp/TCL/Photo/Bert_base/

6. TMDC for Training

for update and debug

Create Your Model

If you want to add your own model to this code base, you can follow the steps below:

Add your GNN model:

  1. In GNN/model/GNN_library, define your model (you can refer to the code for models like GCN, GAT, etc.)
  2. In the args_init() function in GNN/model/GNN_arg.py, check to see if it contains all the parameters involved in your model. If there are deficiencies, you can easily add new parameters to this function.
  3. Import the model you defined in GNN/GNN.py and add your corresponding model to the gen_model() function. You can then run the corresponding code to perform the node classification task.

Add your PLM model:

  1. Go to the LM/Model/ path and create a folder named after your model name. Define init.py and config.py in it (see how these two files are defined in other folders).
  2. Add the parameters you need to the parser() function in lm_utils.
  3. If your model can't be loaded from huggingface, please pass in the path to the folder your model corresponds to via the parameter 'pretrain_path'.

Main experiments in CS-TAG

Representation learning on the TAGs often depend on the two type models: Graph Neural Networks and Language Models. For the latter, we often use the Pretrained Language Models (PLMs) to encode the text. For the GNNs, we follow the DGL toolkit and implement them in the GNN library. For the PLMs, we follow the huggingface trainer to implement the PLMs in a same pipeline. We know that there are no absolute fair between the two type baselines.

Citation

If you use our datasets, please consider citing our work:

@article{yan2023comprehensive,
  title={A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking},
  author={Yan, Hao and Li, Chaozhuo and Long, Ruosong and Yan, Chao and Zhao, Jianan and Zhuang, Wenwen and Yin, Jun and Zhang, Peiyan and Han, Weihao and Sun, Hao and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={17238--17264},
  year={2023}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.