Coder Social home page Coder Social logo

ictmcg / news-environment-perception Goto Github PK

View Code? Open in Web Editor NEW
64.0 1.0 11.0 3.92 MB

Official repository for "Zoom Out and Observe: News Environment Perception for Fake News Detection", ACL 2022.

Python 14.36% Shell 0.34% Jupyter Notebook 85.30%
acl2022 fake-news-detection

news-environment-perception's Introduction

News Environment Perception

This is the official repository of the paper:

Zoom Out and Observe: News Environment Perception for Fake News Detection

Qiang Sheng, Juan Cao, Xueyao Zhang, Rundong Li, Danding Wang, and Yongchun Zhu

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

PDF / Poster / Code / Chinese Video / Chinese Blog / English Blog

Datasets

The experimental datasets where can be seen in dataset folder, including the Chinese Dataset, and the English Dataset. Note that you can download the datasets only after an "Application to Use the Datasets for News Environment Perceived Fake News Detection" has been submitted.

Code

Key Requirements

python==3.6.10
torch==1.6.0
transformers==4.0.0

Preparation

Step 1: Obtain the representations of posts and news environment items

Step1.1: Prepare the SimCSE model

Due to the space limit of the GitHub, we upload the SimCSE's training data by Google Drive. You need to download the dataset file (i.e., [dataset]_train.txt), and move it into the preprocess/SimCSE/train_SimCSE/data of this repo. Then,

cd preprocess/SimCSE/train_SimCSE

# Configure the dataset.
sh train.sh

Of course, you can also prepare the SimCSE model by your custom dataset.

Step1.2: Obtain the texts' representations

cd preprocess/SimCSE

# Configure the dataset.
sh run.sh

Step 2: Construct the macro & micro environment

Get the macro environment and rank its internal items by similarites:

cd preprocess/NewsEnv

# Configure the specific T days of the macro environment.
sh run.sh

Step 3: Prepare for the specific detectors

This step is for the preparation of the specific detectors. There are six base models in our paper, and the preparation dependencies of them are as follows:

Model Input (Tokenization) Special Preparation
Post-Only Bi-LSTM Word Embeddings -
EANN Word Embeddings Event Adversarial Training
BERT BERT's Tokens -
BERT-Emo BERT's Tokens Emotion Features
"Zoom-In" DeClarE Word Embeddings Fact-checking Articles
MAC Word Embeddings

In the table above, there are five preprocess in total: (1) Tokenization by Word Embeddings, (2) Tokenization by BERT, (3) Event Adversarial Training, (4) Emotion Features, and (5) Fact-checking Articles. We will describe the five respectively.

Tokenization by Word Embeddings

This tokenization is dependent on the external pretrained word embeddings. In our paper, we use the sgns.weibo.bigram-char (Downloading URL) for Chinese and glove.840B.300d (Downloading URL) for English.

cd preprocess/WordEmbeddings

# Configure the dataset and your local word-embeddings filepath. 
sh run.sh

Tokenization by BERT

cd preprocess/BERT

# Configure the dataset and the pretrained model
sh run.sh

Event Adversarial Training

cd preprocess/EANN

# Configure the dataset and the event number
sh run.sh

Emotion Features

cd preprocess/Emotion/code/preprocess

# Configure the dataset
sh run.sh

Fact-checking Articles

There are two preparation for fact-checking articles:

  1. Retrieve the most relevant articles for every post. Specifically, we have retrieved every post's Top10 relevant articles that should be published BEFORE the post, whose results are saved in the preprocess/BM25/data folder. If you want to learn about more implementation details, just refer to preprocess/BM25/[dataset].ipynb.
  2. Tokenize the fact-checking articles by word embeddings:
cd preprocess/WordEmbeddings

# Configure the dataset and your local word-embeddings filepath. Set the data_type as 'article'.
sh run.sh

Training and Inferring

cd model

# Configure the dataset and the parameters of the model
sh run.sh

After that, the results and classification reports will be saved in ckpts/[dataset]/[model].

Citation

If you find our dataset and code are helpful, please cite the following ACL 2022 paper:

@inproceedings{NEP,
    title = "Zoom Out and Observe: News Environment Perception for Fake News Detection",
    author = "Sheng, Qiang  and
      Cao, Juan  and
      Zhang, Xueyao  and
      Li, Rundong and
      Wang, Danding  and
      Zhu, Yongchun",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    publisher = "Association for Computational Linguistics"
}

And as the HuffPost part of the English news environment is based on the News Category Dataset, please cite the following reports as the kaggle page requires:

@dataset{misra2018news,
  title={News Category Dataset},
  author={Misra, Rishabh},
  year = {2018},
  month = {06},
  doi = {10.13140/RG.2.2.20331.18729}
}
@book{misra2021sculpting,
  author = {Misra, Rishabh and Grover, Jigyasa},
  year = {2021},
  month = {01},
  pages = {},
  title = {Sculpting Data for ML: The first act of Machine Learning},
  isbn = {9798585463570}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.