Coder Social home page Coder Social logo

openstates / openstates-scrapers Goto Github PK

View Code? Open in Web Editor NEW
844.0 50.0 459.0 30.23 MB

source for Open States scrapers

Home Page: https://openstates.org

License: GNU General Public License v3.0

Python 99.76% Shell 0.16% Dockerfile 0.07%
python government scrapers states united-states hacktoberfest

openstates-scrapers's Introduction

openstates-scrapers's People

Contributors

bfossen-ce avatar brandonlewis avatar braykuka avatar chrisyamas avatar colbyreed avatar csnardi avatar divergentdave avatar ehtishamsabir avatar estaub avatar hiteshgarg14 avatar in-vincible avatar jamesturk avatar jessemortenson avatar jmcarp avatar johnseekins avatar judgejudes avatar linzjax avatar markolson avatar mattgrayson avatar mikejs avatar mileswwatkins avatar newageairbender avatar paultag avatar rshorey avatar schneidy avatar shivansh-bajaj avatar showerst avatar sroomf avatar tamilyn avatar twneale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openstates-scrapers's Issues

TX (and others?): Missing committee_id in legislator roles

In some states, when referring to committee positions, legislator roles have the corresponding committee_id included along with the committee name. However, this doesn't appear to be the case in Texas. I'm still looking to see if it's an issue elsewhere.

Is this difficult to standardize? For StatesLege, I would like to use this ID as a very direct route to drill down to a committee's details and its membership from a legislator's list of committee positions (roles)...

Basically, I'd like the capability to treat legislator "roles" as a rough inverse of the committee "members" (where "leg_id" is always present). I another option would be to break the legislator's committee positions out of into their own array, since "roles" isn't exclusively committee positions.

Feature Requests for Event API (more standardization) (gcode #222)

Are all times still encoded as UTC or are they local times now? I was looking at Utah and I thought I saw some meeting times were 3 or 4am once I localized them ...

Can we set the "link" field to use the first sources.url if there isn't one set already?

For states besides TX, can we scrape the meeting announcement text into notes wherever possible?

Ditto for scraping committee and chamber strings for the "participants" wherever possible (like in TX).

A big huge request would be to standardize on some kind of "title" for the event. The "description" field varies greatly from state to state. Some include the type, date, and time in the string, others just have the title. Would be great if this title had normalized capitalization (I'm staring at you, Kentucky!).

Basically, I'm trying to build a consistent, concise summary text for each event to display in a limited amount of space. Ideally, I could get a lot of this for free from the API with a little more standardization on either the "description" field or the addition of a new "title" or "summary" field ...

"(S) Business & Commerce\n When: 7/28/11 - 10:00 AM\n Location: E1.016 (Hearing Room)"
"(H) Higher Education\n When: 7/30/11 - 9:00 AM\n Location: E2.543 - (CANCELED)"

Alternatively, a "title" field could simply be "Business & Commerce" and I can cook up the rest from the other data you've inserted into the event dictionary.

Greg


http://code.google.com/p/openstates/issues/detail?id=222

A bug searching for bills with a session ID that has spaces (#gcode 224)

So I'm working on making the StatesLege bill tracking multi-state / multi-session aware. But California is giving me hell. Can't that legislature figure out that I want session IDs that are succinct?

http://openstates.sunlightlabs.com/api/v1/bills/?q=budget&search_window=session:20092010&state=ca

Works great.

http://openstates.sunlightlabs.com/api/v1/bills/?q=budget&search_window=session:20092010%20Special%20Session%201&state=ca

Doesn't work at all, even though there are actually budget bills in that special session ... remove the session filter altogether and you'll find a few that should have matched up ... here's one for example.

{
    "title": "An act relating to the Budget Act of 2008.", 
    "created_at": "2010-07-09 17:28:38", 
    "updated_at": "2011-01-04 16:58:01", 
    "chamber": "lower", 
    "state": "ca", 
    "session": "20092010 Special Session 1", 
    "type": [
        "bill"
    ], 
    "subjects": [], 
    "bill_id": "AB 11"
}, 

http://code.google.com/p/openstates/issues/detail?id=224

Request - Staffer Scrapers/Data/API

If Congress can have it, maybe states can too? I'd like to see if we can start scraping for legislative staffers (names, titles, and any contact information).

I anticipate that this may be running too far down the rabbit hole, but perhaps in some states this is trivial to collect.

In the past I have collected staffer info for Texas using the House Research Organization's periodically published staffer directory:

http://www.hro.house.state.tx.us/pdf/focus/staff82.pdf

I know there are several [barf] commercial options out there that have more detailed information. In Texas, these include Telicon and the Texas State Directory.

Right now, I've built my own database of Texas staffers (using the above) ... it has names, titles, phone numbers, email addresses, and the legislator ID they work for.

Missing committee IDs in legislator committee positions

While looking at some legislators in Texas, I see committee positions/roles for them, but the committee ID attribute is missing.

{
  "last_name": "Anderson",
  "updated_at": "2011-08-30 01:59:35",
  "sources": [
    {
      "url": "http://www.legdir.legis.state.tx.us/MemberInfo.aspx?Chamber=H&Code=A2215"
    }
  ],
  "full_name": "Rodney Anderson",
  "old_roles": {},
  "+district_address": " P.O. Box 2910\nAustin, TX 78768",
  "first_name": "Rodney",
  "middle_name": "",
  "district": "106",
  "id": "TXL000370",
  "state": "tx",
  "votesmart_id": "117471",
  "party": "Republican",
  "leg_id": "TXL000370",
  "active": true,
  "transparencydata_id": "bdd473b540274af38046b19a727dc1d1",
  "photo_url": "http://www.legdir.legis.state.tx.us/FlashCardDocs/images/House/small/A2215.jpg",
  "+capital_address": " P.O. Box 2910\nAustin, TX 78768\n(512) 463-0694",
  "roles": [
    {
      "term": "82",
      "end_date": null,
      "district": "106",
      "level": "state",
      "country": "us",
      "chamber": "lower",
      "state": "tx",
      "party": "Republican",
      "type": "member",
      "start_date": null
    },
    {
      "term": "82",
      "end_date": null,
      "level": "state",
      "country": "us",
      "chamber": "lower",
      "state": "tx",
      "committee": "Economic & Small Business Development",
      "position": "member",
      "type": "committee member",
      "start_date": null
    },
    {
      "term": "82",
      "end_date": null,
      "level": "state",
      "country": "us",
      "chamber": "lower",
      "state": "tx",
      "committee": "Land & Resource Management",
      "position": "member",
      "type": "committee member",
      "start_date": null
    }
  ],
  "country": "us",
  "created_at": "2011-01-14 22:24:08",
  "level": "state",
  "chamber": "lower",
  "suffixes": ""
}

IN is missing roll call votes

Indiana only makes their votes available in scanned PDFs that currently prevent us from capturing legislator roll calls.

noted on status page

KY is missing roll call votes

Kentucky only makes their votes available in scanned PDFs that currently prevent us from capturing legislator roll calls

StatesLege: JSON array of available states, stateID, status(?)

On the initial startup of the app, it will show a list of the available states that the user can select. To do that, I'll need an API interface to a JSON array of dictionaries. I would imagine this is pretty close to the backend data used for the detailed status page you have already.

At minimum, the JSON representation would need to contain the state name and state ID.

Preferably, it would also include some value indicating the state's data status ("ready", "experimental", "unsupported", or some related integer value).

And in an ideal world, it would also have a flag for feature availability - whether or not votes, events, and subjects are available. (Presumably all states will have legislators, committees, and bills)

missing votes in TX bills (gcode #165)

gcombs 2/26/11

In Texas, bills that have official recorded don't have any votes listed when returned by openstates. Examples are SCR 17 and SCR 12. The House and Senate journal pages are in PDF (pisses me off!) but the votes are there.

mstephens 3/17/11

Some of these cases have been fixed (e.g. SCR 12) while others haven't (e.g. SCR 17). As long as the only way to get votes from TX is via their journals I'm not sure we're going to achieve 100% coverage, but we'll keep looking at it.

gcombs 3/24/11

I don't know if this part of the problem of missing votes, but it might have to do with committee:passed (if I remember correctly). We've got a test for the Committee reporting favorably without amendments, but perhaps we shouldn't test on the whole string, since they can report favorably as engrossed (with amendments).

gcombs 4/20/11

I have a perl script that I've used for previous sessions that will churn through the whole journal to pull every single record vote taken. I don't remember how good it is at picking up the associated bill ID, though.

To the extent that you might think it's useful to peruse, I'm happy to send it. Obviously it's not python, but it might behoove us to look at a bill's votes from the opposite end of the food chain ... grab all the votes we can, whenever they add a new set of journal pages, and then try to match that up to the bill ID.

It's not as elegant or efficient as looking for one set of votes surrounding a given bill, but we might get more coverage this way.

as reported by Greg Combs @ http://code.google.com/p/openstates/issues/detail?id=165

[StatesLege] Districts and Base64 ... catalog of ideas and requests

First, a little bad-good news about Base64...

  • After some quick math, base64 is only saving us about 10% on bandwidth, since our original starting data's floating precision isn't super-awesome to begin with... once we add the encoding, an 8-byte double costs almost as much as a 9 to 11 digit string. The irony is that if we simply remove the whitespace from the entire coordinates array in JSON, we save about 8%!
  • One of the other objectives of going with base64 would be that I could import the entire binary array into the iPhone map without much post-processing it at all. At this point, however, I'd say that's still not enough to justify keeping base64 around. That isn't to say that I don't appreciate all the work you put into getting us here. I do, but in terms of maintenance and such ... I don't see that many people would have much use for it, given that it's not saving anyone a whole lot.
  • By not using base64, I no longer need you to provide a numberOfPoints value, nor do we need to alter the format of the coordinates dictionary coming from the boundary-service.

Region Span: something I've mentioned in a previous email, but I'm just summarizing ...

  • If I haven't irritated you enough by frequently changing my mind, ... I liked your centroid addition so much it would be awesome if we could change it just a wee bit to something like this ...
"region": {
    "center": {"lat": 32.842977, "lon": -96.682154},
    "span": {"latDelta": 0.103297, "lonDelta": 0.159247}
    }
  • Where latDelta and lonDelta values are determined using the difference between the district shape's bounding box coordinates. Something like the following formula:
    latDelta = absoluteValue(maximumLatitude - minimumLatitude);
    lonDelta = absoluteValue(maximumLongitude - minimumLongitude);
  • Google and Apple use the span value to set the zoom level of their maps, centered around the center coordinate. So chances are other folks besides me might have a use for it.
  • On the other hand, if this is a pain in the ass to revisit, and since we won't be doing base64 afterall, I can get at these minimum and maximum values (the bounding box) with minor effort by iterating through the points array on my end. I figured if it's easy enough to do there, that'll mean we don't have to do it on the device every time someone views a district map.

Some final touches to the districts API, and legislators too

  • Can we add another direct access point to the single district details? Since we're using the boundary_id as the unique identifier for district objects, I'd like another option to hit it up directly using just a single boundary_id value, as in: /districts/sldl-tx-state-house-district-74/
  • Once the district API is announced, can we copy this district identifier over to the legislators' detail view?
  • If we go through with this, it might make more sense if we renamed that value to district_id, since anyone using this API won't really know what we mean by "boundary". Granted, I remember asking you to call it boundary_id, and the reality is that it doesn't make a bit of difference on my end whether it's "boundary_id" or "chumbawumba" ... so if you feel like renaming it, or anything else in the API, for aesthetics while it's still unpublished, feel free to do so.

ban import *

Generally, it's considered bad form to "import *" in python code outside of the console; I suggest rewriting those modules that use it; a quick regex suggests they are:

scripts/ak/get_legislation.py
scripts/al/get_legislation.py
scripts/ca/get_legislation.py
scripts/ct/get_legislation.py
scripts/fl/get_legislation.py
scripts/ga/get_legislation.py
scripts/ky/get_legislation.py
scripts/mn/get_legislation.py
scripts/mo/get_legislation.py
scripts/nc/get_legislation.py
scripts/nd/get_legislation.py
scripts/nh/get_legislation.py
scripts/pa/get_legislation.py
scripts/pyutils/unicodecsv.py
scripts/sd/get_legislation.py
scripts/tests/test_legislation.py
scripts/tx/get_legislation.py
scripts/ut/get_legislation.py
scripts/vt/get_legislation.py
scripts/wv/get_legislation.py

It should be quite simple to do so, and I can do it if it's agreed that it's a good idea.

active=false param results in empty result sets for Legislators

Searching for legislators through the URL-based interface with an active=false param produces empty result sets. The documentation for this param: http://openstates.sunlightlabs.com/api/legislators/#legislator-search suggests that the 'active' criteria indicates whether the search should be restricted to currently-active legislators. In other words, active=false should find both active and inactive legislators, producing a superset of the data returned using active=true.

term=[anything other than most recent] param seems to produce similarly empty results sets, although I haven't tested this as thoroughly yet.

Phil Hickey
LegiNation developer

MA is missing roll call votes

Massachusetts does not provide a source for individual roll call votes, unfortunately this means we are unable to provide this data for now.

State-specific (TX) Crash in district api

This is a weird one ...

This crashes:
http://openstates.sunlightlabs.com/api/v1/districts/tx/?apikey=blahblah

These don't:
http://openstates.sunlightlabs.com/api/v1/districts/ca/?apikey=blahblah
http://openstates.sunlightlabs.com/api/v1/districts/mn/?apikey=blahblah
http://openstates.sunlightlabs.com/api/v1/districts/az/?apikey=blahblah
http://openstates.sunlightlabs.com/api/v1/districts/ma/?apikey=blahblah

Don't know, but it might be because I tried something stupid when pulling up data for Texas (I accidentally added an argument that's only appropriate when we're requesting a specific district from the api... in this case, I added that argument while requesting summary results from the whole state)

The error that now returns is this (even when omitting my boneheaded argument from before):
Piston/0.2.2 (Django 1.3) crash report: Traceback (most recent call last): File "/ext/openstates/src/openstates/billy/site/api/handlers.py", line 99, in new_read obj = old_read(*args, **kwargs) File "/ext/openstates/src/openstates/billy/site/api/handlers.py", line 542, in read leg_dict[(leg['chamber'], leg['district'])].append(leg) KeyError: 'chamber'

MN is missing senate roll calls

Minnesota does not provide legislator roll calls for votes taken in the Senate so we are lacking detailed votes for the MN Senate.

Utah legislator photos either broken link or huge!

The house photo urls are completely broken.

http://www.utah.gov/house/photoStreamer.jpg?DEEBL

Should actually be:

http://le.utah.gov/images/legislator/deebl.jpg

Even with that fixed url, the photos they have are large enough to cover a pin head and that's about it. Seriously, I think my watch has more pixels.

The senate photos, however, are freaking enormous ... 500x700 pixels. I don't think I quite need to see folks in HD. All I'm asking for is a little consistency from Utah here... and about 49+2 other states.


Here's one for your viewing pleasure:

Size

ability to retrieve multiple bill objects by their bill_ids in one query (gcode #219)

reported by mjohnson @ http://code.google.com/p/openstates/issues/detail?id=219

This is how I can retrieve a list of bills with a single request in Real Time Congress' RESTful API:

Getting the fields 'actions' and 'bill_id' from ['hr1-112', 'hr2-112', 'hr3-112']:

http://api.realtimecongress.org/api/v1/bills.json?apikey=XXXXXX&bill_id__in=hr1-112%7Chr2-112%7Chr3-112&sections=actions,bill_id

This is currently not available in Open States.

Developers will use this API to make online applications that display bills in a comparative fashion. Unfortunately, the current searchable attributes and custom fieldsets in Open States do not offer a straightforward solution to retrieve specific fields of multiple bills with one request. This is because they do not guarantee the retrieval of the correct bills. The bill_id and session are a much more reliable parameters to use.

If multiple bills (only requesting a few fields from each) cannot be retrieved with one query, developers are forced to iterated through a list of bill_id's in a loop. Each instance within the iteration would hit the API server with the overhead of an additional request.

Best Regards,
Matt

lxml-ize Alaska (gcode #105)

at this point we're almost done migrating all states away from BeautifulSoup/html5lib.

only Alaska remains, it works but next update should replace BSoup with lxml

[StatesLege] Pruning or omitting old/inactive committees

Some states have committees that are no longer in service, and some have (lots of) duplicates like Florida. Same committee name but different IDs. I don't mind if we keep the old committees around so much, as long as we can have an "active" flag like we do for legislators, that way I can omit them from the query results.

StatesLege: display_name for legislative sessions

We need a shorter and prettier way to handle legislative session strings. I display them to indicate different sessions for bill search results. Texas and several others are easy, like "82" or "821" ... California must hate me because their's looks like "20102011 Special Session 4a Dot Com Alpha Gamma Zulu"). Those are unsightly when listing available sessions for data to the user in a small interface menu.


From @jamesturk

I'm thinking our best option will be to just add a display_name,
that'll often be the same as name and it is a little annoying, but the
other option is to make the programmatic name change whenever we want
to tweak the display a little. Sound good to you?


From @grgcombs

Absolutely ... I think the display_name, even if it's hand-crafted, is a good approach.

action classification issues in TX (gcode #217)

gcombs 6/10/11

Rather than post this as additional comment to issue #164 , I felt it warranted it's own thread given that's it's tangentially related, but probably more important.

I'm still struggling with parsing action types/strings ... I'm including a list of TX action-related grievances below.

** [Most Important] ... committee passage isn't getting tagged at all in Texas, as far as I can tell. I imagine that's because you have to look at the _prefix_ of the action string ... "Reported favorably .... ", as in "Reported favorably as substituted", or "Reported favorably w/o amendment(s)" ... anything that starts with "Reported favorably" should yield either the "committee:passed" or "committee:passed:favorable" type, but instead it yields "other".

** "Read 2nd time" or even "Read 2nd time & passed to 3rd reading" should yield "bill:reading:2"

** I think we need a new action type. If the bill has sufficient signatures, the governor can't veto it (whether or not he wants to) ... but governor:signed doesn't seem to fit this, nor does "bill:veto_override:passed", it seems, because that implies there was a veto to begin with. In Texas, the action to look for is "Filed without the Governor's signature"

Other thoughts on some key events that might warrant additional further scrutiny beyond just the standard "other" action type:

action: "Reported engrossed" .... this means it's getting sent to the other chamber since it's passed through this chamber.

action: "Reported enrolled" .... this means it's met the requirements from one or both chambers required ... simple resolutions don't have to go to through the opposing chamber ... But bills certainly require both chambers to successfully pass identical legislation, and that's what this action signifies (i.e., legislative job is done).

action: "Filed with the Secretary of State" ... typically means that it's ready to appear on the ballot for the next election (I usually see this when it's intended to be a constitutional amendment)

action: "Effective immediately", or "Effective in 90 days --- mm/dd/yyyy", or "Effective on mm/dd/yyyy" ... the law is effective on that date.

gcombs 6/11/11

Thanks for making some changes to pick up more action types... I've got another one...

On senate bills, they have a action string of Filed, which you pick up correctly and categorize as bill:filed. On senate resolutions (at least simple resolutions) you won't see "Filed", but you will see the usual "Received by the Secretary of the Senate" ... which you can use as bill:filed instead.

Also, (still in senate resolutions), you correctly catch "Read & adopted" as bill:passed, but you can also include "bill:introduced" if you haven't found one of those yet.

http://openstates.sunlightlabs.com/api/v1/bills/tx/82/SR%20886

mstephens 6/17/11

I think most of these have been taken care of now (besides looking into adding new action types), though some won't show up until after tonight's run.

I think we need a new action type. If the bill has sufficient signatures, the governor can't veto it (whether or not he wants to) ... but governor:signed doesn't seem to fit this, nor does "bill:veto_override:passed", it seems, because that implies there was a veto to begin with. In Texas, the action to look for is "Filed without the Governor's signature"

I'm a little confused by this one. By sufficient signatures do you mean meeting a certain threshold on the final passage vote(s), or is there a separate signing process that we're not currently capturing?

gcombs 6/17/11

There are two cases, and I'm not sure if they us this same action string (filed w/o gov sig) for both, or just one, but the end result for the bill is the same...

  1. Both chambers pass the bill (in final passage) with a vote of at least 2/3rds majority. This is what it takes to override any vetoes. So it's a law, even if he doesn't like it. It would be like a preemptive veto override.
  2. Let's say it doesn't pass by 2/3rds, he's got three options... Veto it within 10 days, sign it within 10 days, or not sign it in 10 days. The latter two choices mean it's law. The veto means it goes back to the legislature ... That is, unless their 140 days (every other year) are over ... If they've already adjourned in that 10 day window, this veto is a big one ... The post-adjournment veto means the legislature has to wait a year and a half before they can introduce the bill way at the beginning of the process again.

http://code.google.com/p/openstates/issues/detail?id=217

enable experimental support for Georgia

legislator pages posed an issue as they are irregular, worked on @ PyCon '11

via contact @ NIMSP

"
And if you're looking to grab Georgia legislative data, during the session
they post daily bill status and vote updates on the web, but they don't
advertise it in any way. Here's the links:

http://www1.legis.ga.gov/legis/2011_12/list/BillSummary.xml
http://www1.legis.ga.gov/legis/2011_12/list/HouseVotes.xml
http://www1.legis.ga.gov/legis/2011_12/list/SenateVotes.xml
"

enable experimental support for Kansas

KS has an API, there have been reports it is down more than it is up

we're also waiting to hear back from the developers there on some requested enhancements

comment by rkiddy 4/15/11

The bills in KS do not seem to make a distinction between the types of authors. Oddly, most bills are authored by committee or by a very large group of individuals (probably the entire chamber). But all the authors are listed at the same level with no distinctions.

So, do they all get the "LEAD_AUTHOR", or do they all get "COAUTHOR"? Or do they get something else? The sponsor type is a mandatory parameter, so one does have to put something.

comment by rkiddy 5/1/11

I am corresponding with Dave Larson. He said the lead developer (named Austin) was on paternity leave and would get back in touch "on Monday." This was on April 14 and I have not heard anything.

I just pinged Dave again. We will see.

comment by rkiddy 6/21/11

I got a response back from the KS legislature. Well, I had to ping them again. But they did respond. They believe that many of the problems have been fixed.

There is reason to hope. For example, getting a bill list via the v2 API works now. So something good has happened.

Will report more anon.

add mimetype across all states (was: suggest different representations of bill versions with type keys)

as reported by rkiddy 4/5/11

For example, for AB 1 in the CA's current session, here is one of the bill version objects:

    {
        "+short_title": "Education finance: CalWORKs Stage 3.", 
        "name": "20110AB198AMD", 
        "+type": [
            "bill", 
            "appropriation", 
            "fiscal committee"
        ], 
        "title": "An act relating to education finance, and making an appropriation therefor, to take effect immediately as an appropriation for the usual and current expenses of the state.", 
        "url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf", 
        "+subject": [
            "Education finance: CalWORKs Stage 3."
        ], 
        "+date": 1294963200.0
    }

Instead of:

        "url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf", 

I would suggest something like:

"url": {
    "application/pdf" = "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf",
    "text/html" = "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.html" }

This becomes more useful with other representations. For example, there are XML files available in the CA data. They are not stored in the same way as the pdf and html files, and I am not sure what MIME type to use for the particular flavor of XML they use, but the reference could be something like:

"xml/caml" = "http://www.helpfulhosting.org/ca/20112012/pubinfo_20110115_Sat/BILL_VERSION_TBL_3.lob"

comment by jturk

this would be nice to have but we can't break backwards compatibility at the moment, we'll investigate adding something like this when we roll out v2

in the meantime we could use plus fields to collect this data

comment by rkiddy 4/8/11

Noticed this in the documentation:

(http://openstates.sunlightlabs.com/docs/scrapers.html)
add_version(name, url, **kwargs)

Add a version of the text of this bill.
Parameters: 

    name – a name given to this version of the text,
    e.g. ‘As Introduced’, ‘Version 2’, ‘As amended’,
    ‘Enrolled’

    url – the location of this version on the state’s
    legislative website.

If multiple formats are provided, a good rule of thumb is
to prefer text, followed by html, followed by pdf/word/etc.

This seems to suggest that all of the references in the bills to versions should not have:

"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf"

and should have, instead:

"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.html"

The PDF is always there, but the HTML file is always there also. If it is to be preferred, can the references to the pdf files be changed?

This may be as simple as this:

diff --git a/openstates/ca/bills.py b/openstates/ca/bills.py
index fce1808..c88c700 100644
--- a/openstates/ca/bills.py
+++ b/openstates/ca/bills.py
@@ -308,7 +308,7 @@ class CABillScraper(BillScraper):

          versions = []

-            for link in page.xpath("//a[contains(@href, '.pdf')]"):
+            for link in page.xpath("//a[contains(@href, '.html')]"):
              date = link.xpath("string(../../td[2])").strip(" -")
              date = datetime.datetime.strptime(
                  date, '%m/%d/%Y').date()

comment by mstephens 4/18/11

California versions (for 2011 and beyond) now point to the HTML text

comment by rkiddy 4/18/11

That is great. Thanks for that.

For v2, I would still suggest the open-ended list of file types, as suggested above.

Just FYI, I have found that Kansas likes to publish things as ODT files. I suspect we want to be able to point to any file type a state might pick.

comment by gcombs 7/16/11

Speaking of California versions ... the HTML text header has a version name like "Introduced, blah blah blah" that looks scrapeworthy. Sadly, the names they give us for our version's "name" field is a bunch of gobbledyjunk, like "ABASDFASDFASDF12341234123"

[StatesLege] Missing legislator IDs in California Committee Positions/Members

Looking through some of California's committees, like CAC000196 and CAC000017, and a few others, and I'm noticing that several of members in the list are missing leg_id's.

From the looks of it, this is probably due to name matching when there's middle initials, as with "Ted W. Lieu" vs. "Ted Lieu", and "Norma J. Torres" vs. "Norma Torres".

Is this a simple fix?

UT legislator data issue (gcode #221)

Reported by [email protected], Jul 6, 2011

Hi,

I'm a developer with VoterVoice LLC and we just started to use the OpenStates scrapers to store the data into MondoDB. On comparing with data with what we've collected, we noticed some discrepancies.

We're a .NET shop and none of us here are well versed in Python, so we didn't (yet) delve into what the cause of the discrepancies might be.

I thought I should mention them on here in case the developers of the respective states might get a chance to correct the problems.

Pennsylvania State Representative Sandra Major (leg_id=PAL000089) has her name reversed (last_name=Sandra, first_name=Major).
Query: { "state": "pa", "last_name": "Sandra" }

Utah State Senator D. Chris Buttars (leg_id=UTL000031) is not in office any more. He has been replaced by Osmond Aaron since April 15th, but this isn't reflected in the data that is scraped.
I suspect that the legislators for the state of Utah are being scraped from this PDF (http://le.utah.gov/Documents/2010roster.pdf) which is out-dated instead of being scraped for their individual rosters from here for the Senate (http://www.utahsenate.org/aspx/roster.aspx) and here for the House (http://le.utah.gov/house2/representatives.jsp).

So far we've analyzed data in 7 states. If we come across any more discrepancies in the rest of the states, I'll post to this thread.

If there's anything meaningful we can contribute, please do let us know.

http://code.google.com/p/openstates/issues/detail?id=221

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.