This repository contains code and documentation to study bias in data using some wikipedia articles on politicians from different countries. Supplemental data is used from Population Research Board's Country Population Data with population information as of mid-2015, as well as by getting an estimated quality score of articles using a publicly available Machine Learning Algorithm called ORES. The provenance information along with the Copyright and Licensing information is available in this README file.
NOTE: the code, text of this repository are governed by the MIT License included in the LICENSE file
The wikipedia articles data is available under the Creative Commons BY The ORES scores data is available under the Creative Commons 0 The PRB data is copyrighted and is not available and hence not included with this repo. You may choose to download this data directly from source that is mentioned in the Data Provenance section below.
For this work, data on Wikipedia articles about policians from different countries is obtained from figshare and scores for each of the article's referenced revision number is obtained by using the ORES API
In addition, Population information for various countries as of mid-2015 is obtained from the Population Research Bureau (PRB) website
The end to end process of this analysis work is broken down into three steps:
- Data Acquisition
- Data Processing
- Analysis and Visualization
If any readers wants to reproduce the data acquisition step, they will need to download the wikipedia article data and the country population data and include them in the Raw Data Directory after cloning this repository.
The source code, data (raw and processed) can be found in the following structure:
The source code is located in thie github repo source code file and containing documentation and code for all the three steps mentioned above.
-
Raw Data contains two CSV files, one with the wikipedia article data on politicians from different countries and the other is the article ORES score for the final revision of each of these articles.
-
Processed Data is a csv formatted processed data that has the combined article, population and ORES score data.