Media Bias x ChatGPT

A ML model and app for detecting bias in media and AI generated content on a set of topics.

This repository explores approached to classification based on topic, bias and political bias in sentences sourced from various media outlets, using the OpenAI ADA embeddings. It additionally explores the political bias of content generated by ChatGPT with the trained model on human-labeled data.

Dependencies

The code in this repository utilizes the following packages:

numpy
pandas
matplotlib
scipy
plotly
seaborn
scikit-learn
umap-learn
openai
tiktoken

The accompanying web app additionally depends on streamlit, which was used to build it and is necessary to run it locally.

Data and Modeling

The data used for training the machine learning models was obtained from the BABE dataset on Kaggle. I used the largest dataset of sentences labeled by human experts (SG2). Approaches to topic, bias and outlet bias classification were explored with sklearn and I found that:

a neigbors (distance) based topic works best for topic due to the nature of ADA embeddings
LogisticRegression was the best classifier for bias, closely followed by RandomForest and MLP classifiers
an MLP classifier performed best for the political (outlet) bias prediction of the sentences. Additionally, all model hyperparameters were tuned using a grid search with cross validation method with 5 folds. Due to the unbalanced nature of classes in the dataset, in particular in terms of topic labels, the F1-weighted score was used as the main metric to assess model performance across the board.

Repository structure

The Project.ipynb notebook in the top-level directory covers the entire process of this project with expanations and visualization.

The notebooks/ directory contains all weird and random steps I took in the process of data exploration and model selection / hyper parameter tuning, including many that didn't make it in the final project deliverable.

All models and data used can be found in the data/ and models/ directories.

The top-level .py files in the repository contain all the files needed to run the streamlit app.

ChatGPT bias

The main question I tried to answer in the final deliverable of this project is whether content generated with ChatGPT is perceived as politically biased with respect to content generally reported by media outlets. For this purpose, I prompted ChatGPT to produce several sentences on a small set of topics present in the training data. All content generated by ChatGPT was classified as non-biased, but in terms of political bias a left-leaning classification showed to be more prevalent. The accompanying interactive app can be used to test any content with a valid OpenAI API key!

Running the app locally

Clone this repo and install the dependencies with

pip install -r requirements.txt

Run the app with

streamlit run app.py

If you have an OpenAI API key and want to try out the content analyzer, all you need to do is make a new file in the app directory called .env and log your api key in it as:

OPENAI_API_KEY = 'your-api-key-here'

The app will then read the key from your local environment and open the content analyzer section!

Live App

Some of the interactive functionality, like exploring the data and fitting different models is available in a streamlit app. Unfortunately, the content analyzer can only run locally for the time being due to the OpenAI API key restrictions.

gecheline / mediabias_x_chatgpt Goto Github PK