Coder Social home page Coder Social logo

misinfotext's Introduction

MisInfoText

This repository includes information and resources for automatic misinformation detection, which has become an increasingly important topic in Natural Language Processing. By definition, misinformation refers to the distribution of false information (opposite to facts) in the context of news. In order to automatically detect instances of misinformation, one approach is to combine machine learning and computational linguistic techniques to build predictive models on labeled instances of news articles (fake/real or false/true items). Such models could then be applied to an unseen news article and score its veracity with respect to its linguistic characteristics. At the Discourse Processing Lab, we have started a project to collect data suitable for automatic misinformation detection from a variety of quality sources. In particular, we make sure that the full text of each news article and a reliable veracity label indicative of its truth content is provided in our data collections.

MinInfoText repository consists of three major sections:

  • List of datasets that we have collected by scraping fact-checking websites, whose basic function is to find and tag suspicious news.
  • List of other datasets that have been published in previous NLP papers and are useful for building misinformation detection models.
  • List of potential fact-checking websites that we have not yet used in our data collection effort but could be employed in future work.

Data scraped from fact-checking websites

We are developing a large dataset containing instances of fact-checked news articles. In order to do so, we make use of automatic scrapers to crawl fact-checking websites for false/true claims and headlines, links to the original news articles spreading them and the veracity labels given by the fact-checkers. Details of this process depend on the specific structure of the fact-checking website and the amount of information they include about the sources of a discussed piece of news. Currently, we have scraped and cleaned data from four fact-checking websites (snopes, buzzfeed, politifact and emergent).

We have published three large datasets through our lab website, which still require manual verification of the alignment between the claims and the original news texts (both available in our data tables). Our annotators have manually verified this alignment for a relatively small portion of the data (Snopes312, BuzzfeedUSE and BuzzfeedTop collections) and published the resulting dataset through another github repository.

Please refer to the following papers for details on the data collection process and our automatic misinformation detection experiments:

We will soon publish a larger dataset that we have verified by recruiting annotators on the Figure Eight platform. So please keep in touch for more to come!

We also welcome suggestions for inclusion of other datasets in our list. Please see what follows and let us know if you think your dataset can be listed too!

List of veracity-labeled text collections

| Data paper | Size and type | Labeling system | Notes |

Allcott H and Gentzkow M (2017) | 156 news articles | 5-way (false to true) | Collected from Snopes, Politifact and Buzzfeed fact-checking pages, focused on 2016 US election

Ferreira W and Vlachos A (2016) | 1,612 news articles | 2-way (false/true) | Collected from Emergent.org, unbalanced, originally developed for stance-detection

Rubin VL, Conroy NJ, Chen Y and Cornwell S (2016) | 360 news articles | 2-way (satirical/legitimate) | Balanced by topic and label. A variety of topics.

Zhang AX, Ranganathan A, Metz SE, et al. (2018) | 40 news articles | Multiple (credibility indicators) | Continuous effort with the future goal of annotating 10,000 articles.

Perez-Rosas V, Kleinberg B, Lefevre A and Mihalcea R (2017) | 480 news articles | 2-way (fake/legitimate) | Balanced by topic and label. Fake items were artificially generated by Turkers.

Perez-Rosas V, Kleinberg B, Lefevre A and Mihalcea R (2017) | 200 news articles | 2-way (fake/legitimate) | Balanced by topic and label. Focused on celebrity stories.

Wang WY (2017) | 12.8K short statements | 6-way (false to true) | Collected using the Politifact API.

Thorne J, Vlachos A, Christodoulopoulos C and Mittal A (2018) | 185K short statements and supporting/refuting Wikipedia documents | 2-way (original/mutated) | Originally developed for stance-detection. Mutated claims were artificially generated.

Asr FT, Taboada M (2019) | 1,380 news articles | 4-way (false, true, mixture, no factual content) | Collected using a pivot Buzzfeed dataset. Focused on the US election topic.

Asr FT, Taboada M (2019) | 33 news articles | 4-way (false, true, mixture, no factual content) | Collected using a pivot Buzzfeed dataset. A variety of topics.

Asr FT, Taboada M (2019) | 312 news articles | 5-way (false to true) | Collected from Snopes. Balanced by label. A variety of topics. Includes stance information (articles for or against a labeled claim).

List of potential fact-checking websites

We have investigated the following sources in our search for the fact-checking websites that can be pivoted in automatic data collection. The items marked by star are either currently under scraping at our lab or may be considered in the future. Our source to find verified fact-checkers in the first place is poynter.

| Website | Notes on automatic scraping possibility |

  • BOOM - poor format to extract data from, labels not present almost all of the time
  • Check Your Fact* - there looks to be a pattern and data extracting could work, clear format
  • Factcheck.org* - good format in the debunking fact section
  • Ferret Fact Service* - data extraction may work, labels (mfalse, false, true, mtrue, halftrue)
  • Full Fact - little labels but may be able to conclude with the paragraph on the side
  • Lead Stories - no pattern suitable for automatic data extraction
  • Pesa Check* - clear labels, clear format and pattern, may be able to extract
  • PolitiFact* - currently using
  • Rappler - no labels, would be extremely difficult to extract
  • RMIT ABC Fact Check - many different labels all over the place, very hard to generalize
  • Snopes* - currently using
  • South Asia Check - no labels, format unclear
  • The Conversation FactCheck - formal unclear, pattern very inconsistent
  • TheJournal.ie Fact Check* - clear pattern and labels, possibility for data extraction
  • The Washington Post Fact Checker* - labels, clear pattern, possibility for data extraction
  • VoxCheck - very in depth analysis, few labels, not necessarily news stories but long term coverage of an event, will be very difficult to extract data
  • AP Fact Check - no labels, no apparent pattern
  • Climate Feedback* - very clear pattern, format and sources; however, a variety of labels
  • FactCheck Northern Ireland - no labels, format or pattern

misinfotext's People

Contributors

ftasr avatar maitetaboada avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.