Coder Social home page Coder Social logo

tresca-msw / localnewsdataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yinleon/localnewsdataset

0.0 1.0 0.0 1.6 MB

The documentation and scripts for the Local News Dataset

License: MIT License

Jupyter Notebook 73.59% Python 26.41%

localnewsdataset's Introduction

Local News Dataset 2018

DOI

By Leon Yin
On 2018-08-14

Introduction

This dataset is a machine-readible directory of state-level newspapers, tv stations and magazines. In addition to basic information such as the name of the outlet and state it is located in, all available information regarding web presence, social media (twitter, youtube, facebook) and their owners is scraped, too.

The sources of this dataset are usnpl.com-- newspapers and magazines by state, stationindex.com -- tv stations by state and by owner, and homepages of the media corporations Meredith, Sinclair, Nexstar, Tribune and Hearst.

This dataset was inspired by ProPublica's Congress API. I hope that this dataset will serve a similar purpose as a starting point for research and applications, as well as a bridge between datasets from social media, news articles and online communities.

While you use this dataset, if you see irregularities, questionable entries, or missing outlets please submit an issue on Github or contact me on Twitter. I'd love to hear how this dataset is put to work

Happy hunting

For an indepth introduction, specs, data sheet, and quickstart check out this Jupyter Notebook in nbs/local_news_dataset.ipynb.

What's the data look like?

name state website domain twitter youtube facebook owner medium source collection_date
0 KWHE HI http://www.kwhe.com/ kwhe.com NaN NaN NaN LeSea TV station stationindex 2018-08-02 14:55:24.612585
1 WGVK MI http://www.wgvu.org/ wgvu.org NaN NaN NaN Grand Valley State University TV station stationindex 2018-08-02 14:55:24.612585
2 KNIC-CD TX NaN NaN NaN NaN NaN Univision TV station stationindex 2018-08-02 14:55:24.612585

You can also browse the dataset on Google Sheets
Or look at the raw dataset on Github
Or just browse the Jupyter Notebook's tech specs.

How is this Repo Organized?

The nbs directory has exmaples of how to use this dataset. The dataset was created in Python. The scripts to re-create and update the dataset are in the py directory.. In addition to the state and name of each media outlet, I also collect their web domain and social (Twitter, Facebook, Youtube) IDs where available.

Methodology

Several websites are scraped using the requests and beautifulsoup Python packages. The column names are then normalized, and merged.

Gotchas

There can be several entires with the same domain.
Why? Certain city-level publications are subdomains of larger state-level sites. There is a preprocessed version for domain-level analysis here: https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018_for_domain_analysis.csv

Using the Dataset

The dataset can be downloaded from the raw GitHub file using the website, or from the commandline:

wget https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv

The dataset can also be loaded directly into a Pandas DataFrame.

import pandas as pd

url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'
df_local_news = pd.read_csv(url)

Acknowledgements

I'd like to acknowledge the work of the people behind usnpl.com and stationindex.com for compiling lists of local media outlets. Andreu Casas and Gregory Eady provided invaluable comments to improve this dataset for public release. Leon Yin is a member of the SMaPP Lab at NYU. Thank you Josh Tucker, Jonathan Nagler, Richard Bonneau and my collegue Nicole Baram.

Citation

If this dataset is helpful to you please cite it as:

@misc{leon_yin_2018_1345145,
  author       = {Leon Yin},
  title        = {Local News Dataset},
  month        = aug,
  year         = 2018,
  doi          = {10.5281/zenodo.1345145},
  url          = {https://doi.org/10.5281/zenodo.1345145}
}

localnewsdataset's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.