Coder Social home page Coder Social logo

lcs2-iiitd / hyphen Goto Github PK

View Code? Open in Web Editor NEW
24.0 4.0 2.0 497.83 MB

[NeurIPS 2022 Oral (Spotlight)] Public Wisdom Matters! Discourse-Aware Hyperbolic Fourier Co-Attention for Social-Text Classification

Home Page: https://arxiv.org/abs/2209.13017

License: MIT License

Python 99.95% Shell 0.05%
fake-news-detection fourier-transform hate-speech-detection hyperbolic-geometry text-classification explainability co-attention fake-news-dataset hgcn hyperbolic-embeddings

hyphen's Introduction

Hyphen

Implementation of Public Wisdom Matters! Discourse-Aware Hyperbolic Fourier Co-Attention for Social-Text Classification, accepted at NeurIPS 2022, as an Oral (Spotlight) paper.

Authors: Karish Grover, S.M. Phaneendra Angara, Md. Shad Akhtar, Tanmoy Chakraborty

๐Ÿ›  Dependencies and Installation

  • torch 1.11.0
  • geoopt 0.1.1 (Hyperbolic optimization and Maths)
  • Penman 1.2.1 (Abstract Meaning Representation)
  • dgl_cu111 0.8.1 (Graphs)
  • Other required packages in requirements.txt
# git clone this repository
git clone https://github.com/LCS2-IIITD/Hyphen
cd Hyphen

# install python dependencies
pip3 install -r requirements.txt

โšก๏ธ Quick inference

We have performed extensive experimentation and ablation studies across 4 tasks and 10 datasets i.e. Fake news detection (antivax, politifact, gossipcop), Hate speech detection (hasoc), Sarcasm detection (figlang_twitter, figlang_reddit), and Rumour detection (twitter16, rumoureval, pheme, twitter15).

Download the final preprocessed dataset Pickle files from here for all 10 datasets, and save them as data/{d_name}_preprocessed.pkl. Next, to run the complete Hyphen-hyperbolic model on politifact dataset, use the following script:

python3 run.py --manifold PoincareBall --lr 0.001 --dataset politifact  --batch-size 32 --epochs 5 --max-sents 20 --max-coms 10 --max-com-len 10 --max-sent-len 10 --log-path logging/run

Finally, to track the evolution of loss, accuracy, and other metrics throughout the training process, use tensorboard as follows:

tensorboard --logdir logging/run

๐Ÿ›ƒ Custom dataset processing

Start with a CSV file of your custom dataset, named d_name, containing the columns - id, text, comments, and labels.

  • Each sample of this CSV file corresponds to a news post identified by id, containing content text, belonging to the class label, and having public discourse comments where individual comments are separated by ::.
  • The comments are present in a string format, similar to {comm_1}::{comm_2}:: ... ::{comm_n}. Save this CSV file in the data folder as: data/{d_name}/{d_name}.csv.
  • We have provided these raw dataset files for all the 10 datasets in the data folder (Eg. data/politifact/politifact.csv), feel free to check them out for understanding the format better.

Abstract Meaning Representation (AMR) creation and merging

Firstly, we need to convert the user comments into AMRs. Generate the Abstract Meaning Representations for all user comments in a dataset, by running the following script from the root folder, Hyphen. Run AMR generation on a GPU for faster generation:

python3 amr/amr_gen.py --dataset {d_name} --max-comments 50

This generates the AMR graphs for all the user comments mentioned during the input, and saves them at {d_name}_amr/{d_name}_amr_csv/ in the form of {d_name}_{post_id}.csv files, where each csv files contains the generated AMR graphs for max-comments number of comments, for each social media post. Next, modify attributes and instances variable names across all AMRs:

python3 amr/amr_var.py --dataset {d_name}

This will save the resultant AMR graphs in the form of their Penman notation {d_name}.amr.penman at {d_name}_amr/{d_name}_amr_coref/. Each .penman file contains the relabelled AMR graphs. Next, we perform inter-comment coreference resolution across AMR graphs of multiple comments corresponding to one social media post.

python3 amr/amr_coref/amr_coref.py --dataset {d_name}

After coreference resolution, we get the file {d_name}_amr/{d_name}_amr_coref.json. The add edges labelled :COREF between the nodes present in a coreference cluster. Finally, we add the dummy node, and egdes and complete the final step in merging AMRs to form the macro-AMR.

python3 amr/amr_dummy.py --dataset {d_name}

This gives us the final merged AMRs for all the news posts at {d_name}_amr/{d_name}_amr_merge/{d_name}_{id}.amr.penman. Read the paper for more details on the AMR merging process. Specify the glove embedding path in amr_dgl.py. Convert the generated macro-AMRs to subgraphs in DGL format using:

python3 amr/amr_dgl.py --dataset {d_name} --test-split 0.1

This creates {d_name}.pkl, {d_name}_train.pkl and {d_name}_test.pkl files, in which every sample is of the form:

{'label':label, 'graph': dgl_graph, 'content': content, 'id': name, 'subgraphs':subgraphs}

Final dataset preprocessing

Once you have prepared the AMR graphs, we bring the news sentences and AMRs together, and pass it through one last preprocessing step, which mainly includes shuffling, a few transformations and train-test splits. Specify the glove embedding path in preprocess.py. :

python3 preprocess.py --dataset {d_name}

This will create the final dataset pickle file as data/preprocessed_{d_name}.pkl. The other intermediately processed files are available here for reference, for all 10 datasets.

๐Ÿ”‚ Training

Specify the glove embedding path in main.py. Next, to run Hyphen-hyperbolic on politifact dataset, use the following script.

python3 run.py --manifold PoincareBall --lr 0.001 --dataset politifact  --batch-size 32 --epochs 5 --max-sents 20 --max-coms 10 --max-com-len 10 --max-sent-len 10 --log-path logging/run

You may also try various ablations of Hyphen. To run Hyphen-euclidean w/o Fourier, use the following script:

python3 run.py --no-fourier --manifold Euclidean --lr 0.001 --dataset politifact  --batch-size 32 --epochs 5 --max-sents 20 --max-coms 10 --max-com-len 10 --max-sent-len 10 --log-path logging/run

To run Hyphen-hyperbolic w/o Comments, use the following script:

python3 run.py --no-comments --manifold PoincareBall --lr 0.001 --dataset politifact  --batch-size 32 --epochs 5 --max-sents 20 --max-coms 10 --max-com-len 10 --max-sent-len 10 --log-path logging/run

To run Hyphen-hyperbolic w/o Content, use the following script:

python3 run.py --no-content --manifold PoincareBall --lr 0.001 --dataset politifact  --batch-size 32 --epochs 5 --max-sents 20 --max-coms 10 --max-com-len 10 --max-sent-len 10 --log-path logging/run

Use the command-line arguments specified in run.py for experimenting with various ablations of Hyphen, and specifying the hyperparameters. Similarly, feel free to try other ablations of Hyphen, using the command-line arguments. Finally, to track the evolution of loss, accuracy, and other metrics throughout the training process, use tensorboard as follows:

tensorboard --logdir logging/run

๐Ÿ“š Sentence-level Fact-checked Annotated dataset

Hyphen fine-grained explainability evaluation - Annotated dataset release! ๐Ÿ’ฟ

We hereby release the annotated Politifact dataset. The dataset is present here. This is the first-ever sentence-level fact-checked dataset. Find more details about the released dataset, format and the annotation details in the ReadMe file.

Abstract: Fake news ๐Ÿ“ฐ is often generated by manipulating only a small part of the true information i.e. entities, relations, small parts of a sentence, or a paragraph. It is possible that certain true information is also present in the news piece to make it more appealing to the public, and thus it is crucial to distinguish between true and/or fake parts of a piece of information. Thus, we utilise and release a sentence-level fact-checked annotated dataset. We annotate the Politifact dataset with ground truth evidence corresponding to different parts of the news text, by referring to fact-checking websites Politifact and Gossipcop, and other trustable online sources.

๐Ÿ“ž Contact

If you have any questions or issues, please feel free to reach out Karish Grover at [email protected].

โœ๏ธ Citation

If you think that this work is helpful, please feel free to leave a star โญ๏ธ and cite our paper:

@article{grover2022public,
  title={Public Wisdom Matters! Discourse-Aware Hyperbolic Fourier Co-Attention for Social Text Classification},
  author={Grover, Karish and Angara, SM and Akhtar, Md Shad and Chakraborty, Tanmoy},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={9417--9431},
  year={2022}
}

hyphen's People

Contributors

karish-grover avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

yun-fu hchautran

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.