This repository contains resources for the article Capturing the Style of Fake News presented at the AI for Social Impact track at the AAAI 2020 in New York. The research was done within the HOMADOS project at the Institute of Computer Science, Polish Academy of Sciences.
The resources available here are the following:
- a corpus including credible and non-credible (fake) news documents,
- a code for credibility classifier based on stylometric features,
- a code for credibility classifier based on neural networks.
If you need any more information consult the paper or contact its author!
NOTE: A new and improved version (v2.0) of this corpus was developed to create the Credibilator browser extension and is available in its repository.
The corpus generated in this research contains 103,219 documents from 18 credible and 205 non-credible sources selected based on work of PolitiFact and Pew Research Center.
The folder NewsStyleCorpus contains the following files necessary to retrieve the pages constituting the corpus from the WayBackMachine archive:
corpusSources.tsv
: tab-separated list of all documents in the corpus, each with the website (domain) it comes from and its credibility label, original page URL and the address, under which the document is currently available at the archive,CredibilityCorpusDownloader.java
: a sample code in Java that retrieves HTML documents from the given address list and converts them to plain text, following the procedure described in the article,foldsCV.tsv
: a list of fold identifiers for the documents fromcorpusSources.tsv
(in the same order) for three CV scenarios described in the paper: document-based, topic-based and source-based.
Downloading the whole corpus takes several hours. In order to limit the load on the WayBackMachine infrastructure and retrieve all the pages (some may be temporarily unavailable), you should perform the process in stages. You can select just part of the corpus for download by modifying the address list.
The implementation of the stylometric classifier is available in two folders:
NewsFeatures
is a Java application for generating the stylometric features (through theMain.main()
procedure) for a given text corpus. It uses Stanford CoreNLP and an extended version of General Inquirer word list, to be found inNewsFeatures/resources
.R
, including a script in R for building a glmnet model on the generated features and performing evaluation according to the CV scenarios.
The folder BiLSTMAvg contains source code of the document-averaged BiLSTM neural network. The following files are included:
model.py
with the BiLSTMAvg model implemented in TensorFlow/Keras,functions.py
with utility functions,run.py
showing how to use the above to replicate the cross-validation evaluation as shown in the article.
The code was tested on Python 3.6.8 with TensorFlow 1.14. Java code for converting the News Style Corpus to a format used by BiLSTMAvg or BERT baseline is uploaded as DataConversion.java
.
The model uses word2vec embeddings trained on Google News corpus. You can download them here or use your own method for token representation.
- The corpus data are released under the CC BY-NC-SA 4.0 licence.
- The code is released under the GNU GPL 3.0 licence.
- The extended GI dictionary is based on the original General Inquirer dictionary; see its page for copyright information.