Coder Social home page Coder Social logo

fausa / bias_detection_in_journalism Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 96.55 MB

A Natural Language Processing Project: Use NewsAPI to gather URL's of news articles, along with webscraping, gather news articles and generate a bias classifier to then run on news sources that are considered "centered" by AllSides Media to determine the validity of that classification.

Jupyter Notebook 94.19% HTML 5.74% Python 0.07%
dataset latent-dirichlet-allocation machine-learning news nonnegative-matrix-factorization sentiment-analysis support-vector-machine svc-model text-analysis topic

bias_detection_in_journalism's Introduction

Bias Detection in American Journalism

Authors: Aaron Carr, Azucena Faus, Dave Friesen

Company Industry: News Media

Company Size: 3 (startup)

GitHub Repository

-- Programming Languages:

Python, MySQL

-- Project Status: [Completed]

Abstract:

There is a general mistrust of news media where the public follows news that mimick their own political biases. But there is a growing appetite for unbiased news and growing subscriptions for centered news sources.

Problem Statement:

The question of mainstream media bias is a significant area of contention in U.S. politics. At the same time, the political 'inclination' of a news outlet often aligns with the personal perspectives of its audience. This overlap provides a unique opportunity and our project's objective: To leverage sophisticated text mining methods and supporting data science techniques to classify the political bias of leading online news sources. This classification will be based on a combination of "politically polarizing" terms as identified through an impartial (academic) source, as well as sentiment analysis context around the use of these terms (Liu et al., 2022).

Objective:

The objective of developing this classifier will be to advise our clients who are the executives at a news source that prides itself in being considered politically “centered.” We would use this classifier to analyze their content on a continuous basis and report back the overall/average political “lean” of their articles. The details regarding the metric for overall publication lean will be based on an average of left/right leaning probabilities per article. This feedback will then provide our client with actionable insights so they ensure their overall political lean metric remains centered. Note: The company is not real, but all data, analyses, and developed models are. See references for data sources.

Goals:

The success of this project will contribute to the achievement of the following goals:

  1. Using article content from Fox News, CNN, Breitbart, and The Washington Post, along with media bias information from AllSides Media (n.d.), develop a classifier with at least a 90% F1 Score on testing data that can differentiate "left" from "right" leaning articles.

  2. Train a model to predict political lean on unseen news articles that are sourced from "centered" news outlets and confirm the bias ratings from AllSides Media (n.d.).

  3. Make recommendations to our client, The Hill, on whether their online journalism remains centered or has shifted in political lean, as well as provide solutions and next steps.

Ultimately, the goal of this project is to provide clients who wish to stay true to ethical, unbiased journalism with a bias rating for their news content throughout the year, giving them the opportunity to find out which articles are causing that shift and make the necessary steps to mitigate such issues.

Name of your selected dataset:

Queried news articles from CNN, Fox News, Breitbert, and The Washington Post covering most of May and part of June, 2023.

Description of your selected dataset (data source, number of variables, size of dataset, etc.):

News articles will be sourced via a REST API called NewsAPI, which logs information on “current and historic news articles published by over 80,000 worldwide sources” (NewsAPI, n.d.). Out of the many possible attributes returned for each API query, this project will use six: source name, author, title, url, publishedAt, and content. Prior to data preprocessing, an additional feature (article_text) will be used to store the scraped data from each specific URL. The size of the final dataset (N = 4,026) was limited by both time restrictions, as well as web scraping access to specific sites or articles. Queries for topics of political interest are used to gather articles from explicitly chosen sources. Independent studies show the political lean for each of these sources (CNN, “left”; Fox News, “right”; The Washington Post, “left”; Breitbert News, “right”) and this will help with training and validation of our classifier (AllSides, n.d.; Ralph & Relman, 2018).

Data Sources:

Master Persisted Train/Test Dataset

Associated Press Dataset

The_Hill_Dataset

Methods Used

  • Exploratory data analysis (EDA)
  • Text data preprocessing (e.g., normalization, tokenization)
  • Term frequency-inverse document frequency (TF-IDF) vectorization
  • Train/test split
  • Classification
  • Machine learning
  • Hyperparameter tuning
  • Model evaluation (e.g., confusion matrix, F1)
  • Sentiment Analysis
  • Topic Modeling

References

bias_detection_in_journalism's People

Contributors

amcarr-ds avatar davefriesen avatar fausa avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

amcarr-ds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.