Coder Social home page Coder Social logo

agmadera / nsa-stories Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lifewinning/nsa-stories

0.0 2.0 0.0 607.52 MB

Every document published from the Snowden archive (updated regularly)

HTML 26.75% CSS 29.34% JavaScript 41.73% Python 2.19%

nsa-stories's Introduction

This repository, maintained by Josh Begley in collaboration with Margot Williams, contains a simple index of every document published from the Snowden archive.

I thought it might be fun to learn some simple scraping things by trying to add some contextual information to this index, so I worked on some really basic scripts to do that.

meta-tags.py adds the article description and relevant search tags that were added when the article in question was originally published.

pull-pdfs.py does the same thing as the above but also goes through the articles in nsa-archive.csv and looks for links to PDFs (or links to pages with documents hosted on The Intercept's documents page). Need to figure out how to include links that are just to DocumentCloud pages, like this page.

intercept-only.py and add-titles.py do the same thing as the above also, but only for articles on The Intercept. This is actually mostly why I started working with this data. Basically thought The Intercept's documents page was really sparse and wanted to add some contextual information about the original documents, without writing up summaries for every single document. So there's a link to the original article and a short summary of the original article alongside the document. (note that my index is missing 36 documents currently on The Intercept's documents page; this is presumably because that specific article hasn't been cited in the nsa-archive.csv)

Still need to change script to deduplicate outputs from meta-tags and pull-pdfs.

EFF and ACLU have done the bulk of this work to date. Let Josh know if you notice something that's missing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.