Coder Social home page Coder Social logo

New Data: Nominations about congress HOT 13 CLOSED

unitedstates avatar unitedstates commented on July 28, 2024
New Data: Nominations

from congress.

Comments (13)

wilson428 avatar wilson428 commented on July 28, 2024

I can take a stab here unless @dwillis wants to go first. My original python scripts submitted POST requests by congress and crawled the HTML comments on the results pages for the nominations, which contain the most structured information. Do you have a recommendation for where in this project to start? Can I piggyback on any existing tasks?

from congress.

dwillis avatar dwillis commented on July 28, 2024

Have at it; I can take a look in a few days, I bet. Start with the bill scraper for reference.

from congress.

konklone avatar konklone commented on July 28, 2024

A quick summary of how the bill stuff works: the bill scraper's divided into two parts: bills.py and bill_info.py. bills.py takes care of paginating through lists (figuring out which IDs to go fetch details for), and then makes repeated calls to bill_info.py for details on individual bills by the IDs it identified.

bills.py makes use of a little processing function we put into utils, utils.process_set, which takes an ID (a bill ID) and a function to call for each one (bill_info.fetch_bill), and it expects each call to that function to return a small dict with a couple of keys (like 'ok'), and then produces a report when it's done of how many it processed.

bills.py isn't called directly; by offering a run method that accepts an options dict, the "run" script calls it with the name given. So you ./run bills to call bills.py's run method, where the options dict is transformed from the command line flags.

So, my advice - you could start off by doing a nomination_info.py with its own run method that takes an ID (e.g. "pn67-113" for PN67 of the 113th Congress), and write that script to go fetch details for the given nomination. Once you feel good about how that works, make a nominations.py that probably is mostly a copy of bills.py with some things changed, which uses nomination_info as its workhorse for each nomination it discovers.

The nominations pages don't look like they have a whole lot of metadata, so this is a great place to contribute, I think. Happy to help with anything as it comes up.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Thanks for the rundown, that's really helpful. Just committed a first go at parsing the nomination pages. More soon.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

I've got nominations.py working and fetching nominations. Needs work catching joint nominations and split nominations, so I don't think I should add to README yet, but testable here.

Currently only fetching civilian nominations. Military nominations appear much more rote, and much more of a pain since you'll get 800 people nominations in one swoop, so included not to include.

 ./run nominations --congress=109 --limit=10

from congress.

JoshData avatar JoshData commented on July 28, 2024

This is a great start!

from congress.

konklone avatar konklone commented on July 28, 2024

This is a super great start. I'm traveling this weekend and can't give it much real testing time, but just looking over the commits, a couple thoughts -

  • Choke on any invalid or unexpected input and crash the script by raising an Exception. No better way to keep us attentive to bugs THOMAS introduces. utils.process_set, once you use it, will auto-catch exceptions, note them at the end, and email them if you've got credentials in config.yml. If you pass --raise, it'll let it crash the script.
  • Hopefully, the little test fetch_nomination calls at the bottom could be rendered moot by using the --nomination_id flag? Its only purpose is to aid in development that way and reduce the chances of random test lines getting committed uncommented.

I'm so happy this is happening - I've been wanting to use this data for a while.

I wonder how matchable-up this is with the Plum Book?

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Thank you! I definitely want to match with plum book so that one can filter by pay grade. Also going to add "is_cabinet" variable and so forth in order to pare down the thousands of nominees into most important ones. Need to do some stuff for my day job today but can probably work in the --nomination_id flag and a few other things.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Made a little progress with converting text from nomination pages into fielded data, plus some error catching as Eric suggested. Still more to do, like fielding the result of the nomination and calculating the number of days elapsed between nomination and conclusion.

from congress.

konklone avatar konklone commented on July 28, 2024

Awesome! And thinking more on the plum book thing - you can probably take a
similar approach that we did with legislator/committee info. The project
will just always clone the newest version of
unitedstates/congress-legislators in order to do the ID crosswalking that
the govtrack XML output requires. You could do the same with the plum book
data from its own repo, if you wanted to have that data available to you in
the nominatoins script.

On Sun, Feb 3, 2013 at 2:11 PM, Chris Wilson [email protected]:

Made a little progress with converting text from nomination pages into
fielded data, plus some error catching as Eric suggested. Still more to do,
like fielding the result of the nomination and calculating the number of
days elapsed between nomination and conclusion.


Reply to this email directly or view it on GitHubhttps://github.com//issues/32#issuecomment-13051916.

Developer | sunlightfoundation.com

from congress.

konklone avatar konklone commented on July 28, 2024

I just put this scraper through a ton of work, and it now produces reliable data on single nominations, batch military nominations, normalizes committee names, and is set up to choke on any unexpected data. I've tested it from the 111th onwards, and am in the process of downloading earlier Congresses nomination data to fix any bugs on older stuff (nomination pages aren't yet in the cache directories I have).

I'll update the README to include it as one of the major things you can get with this project.

@wilson428, thank you so much for starting this -- I am so glad I did not have to solve the awful parsing problems, HTML comment extraction, URL construction, and session mgmt + POSTing stuff. THOMAS' pages for nominations are way worse than for bills, but I feel good about this data now.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Accidentally commented before with the wrong Github identity, but happy to do it. Looks like a huge improvement you made. Thx!

from congress.

konklone avatar konklone commented on July 28, 2024

For anyone using the nominations scraper, you'll want to do an update - I just patched it to drop the automatic caching of the search results page, and I fixed up the POST request it makes to include the same max range that THOMAS uses (5000) and to include military nominations.

from congress.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.