Coder Social home page Coder Social logo

spulec / inspectors-general Goto Github PK

View Code? Open in Web Editor NEW

This project forked from unitedstates/inspectors-general

0.0 3.0 0.0 557 KB

Collecting reports from Inspectors General across the federal government.

Home Page: http://oversight.io

License: Creative Commons Zero v1.0 Universal

inspectors-general's Introduction

Inspectors General

A project to collect reports from the offices of Inspectors General across the federal government.

Done so far:

Thanks to Matt Rumsey for compiling a spreadsheet of IG offices. The left-hand column tracks who's completed what IG. A yellow highlight means it's a high priority.

Eric Mill has written a piece explaining what the project is and why it's being done. From it:

What's an inspector general?

Just about every agency in the federal government has an independent unit, usually called the Office of the Inspector General, dedicated to independent oversight. This includes regular audits of the agency's spending, monitoring of active government contractors and investigations into wasteful or corrupt agency practices. They ask tough questions, carry guns and sue people.

Scraping IG reports

Python 3: This project uses Python 3, and is tested on Python 3.4.0. If you don't have Python 3 installed, check out pyenv and pyenv-virtualenvwrapper for easily installing and switching between multiple versions of Python.

Dependencies: You'll need to have pdftotext installed. On Ubuntu, apt-get install poppler-utils. On OS X, install it via MacPorts with port install poppler, or via Homebrew with brew install poppler.

To run an individual IG scraper, just execute its file directly. For example:

./inspectors/usps.py

This will fetch the current year's reports from the Inspector General for the US Postal Service and write them to disk, along with JSON metadata.

If you want to go back further, use --since or --year to specify a year or range:

./inspectors/usps.py --since=2009

If you want to run multiple IG scrapers in a row, use the igs script:

./igs

By default, the igs script runs all scrapers. It takes the following arguments:

  • --safe: Limit scrapers to those declared in safe.yml. The idea is for "safe" scrapers to be appropriate for clients who wish to fully automate their report pipeline, without human intervention when new IGs are added, in a stable way.
  • --only: Limit scrapers to a comma-separated list of names. For example, --only=opm,epa will run inspectors/opm.py and inspectors/epa.py in turn.

Using the data

Reports are broken up by IG, and by year. So a USPS IG report from 2013 with a scraper-determined ID of no-ar-13-010 will create the following files:

/data/usps/2013/no-ar-13-010/report.json
/data/usps/2013/no-ar-13-010/report.pdf
/data/usps/2013/no-ar-13-010/report.txt

Metadata for a report is at report.json. The original report will be saved at report.pdf (the extension will match the original, it may not be .pdf). The text from the report will be extracted to report.txt.

Common options

Every scraper will accept the following options:

  • --year: A YYYY year, only fetch reports from this year.
  • --since: A YYYY year, only fetch reports from this year onwards.
  • --debug: Print extra output to STDOUT. (Can be quite verbose when downloading.)

Contributing a Scraper

The easiest way is to start by copying scraper.py.template to inspectors/[inspector].py, where "[inspector]" is the filename-friendly handle of the IG office you want to scrape. For example, our scraper for the US Postal Service's IG is usps.py.

The template has a suggested workflow and set of methods, but all your task needs to do is:

  • start execution in a run(options) method, and
  • call inspector.save_report(report) for every report

This will automatically save reports to disk in the right place, extract text, and avoid re-downloading files. options will be a dict parsed from any included command line flags.

You will also need this line at the bottom:

utils.run(run) if (__name__ == "__main__") else None

It's encouraged to use inspectors.year_range(options) to obtain a range of desired years, and to obey that range during scraping. See an example of creating it and using it.

Scrapers are welcome to use any command line flags they want, except those used by the igs runner. Currently, that's --safe and --only.

Report metadata

The report object must be a dict that contains the following required fields:

  • inspector - The handle you chose for the IG. e.g. "usps"
  • inspector_url - The IG's primary website URL.
  • agency - The handle of the agency the report relates to. This can be the same value as inspector, but it may differ -- some IGs monitor multiple agencies.
  • agency_name - The full text name of an agency, e.g. "United States Postal Service"
  • report_id - A string usable as an ID for the report.
  • title - Title of report.
  • url - Link to report.
  • published_on - Date of publication, in YYYY-MM-DD format.

You can include any other fields you think worth keeping.

The report_id only needs to be unique within that IG, so you can make it up from other fields. It does need to come out the same every time you run the script. In other words, don't auto-increment a number -- if the IG doesn't give you a unique ID already, append other fields together into a consistent, unique ID.

Finally, err towards errors: have your scraper choke and die on unexpected input. Better to be forced to discover it that way, then for incomplete or inaccurate data to be silently saved.

Public domain

This project is dedicated to the public domain. As spelled out in CONTRIBUTING:

The project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

inspectors-general's People

Contributors

audiodude avatar bunsenmcdubbs avatar divergentdave avatar konklone avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.