as reported by rkiddy 4/5/11
For example, for AB 1 in the CA's current session, here is one of the bill version objects:
{
"+short_title": "Education finance: CalWORKs Stage 3.",
"name": "20110AB198AMD",
"+type": [
"bill",
"appropriation",
"fiscal committee"
],
"title": "An act relating to education finance, and making an appropriation therefor, to take effect immediately as an appropriation for the usual and current expenses of the state.",
"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf",
"+subject": [
"Education finance: CalWORKs Stage 3."
],
"+date": 1294963200.0
}
Instead of:
"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf",
I would suggest something like:
"url": {
"application/pdf" = "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf",
"text/html" = "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.html" }
This becomes more useful with other representations. For example, there are XML files available in the CA data. They are not stored in the same way as the pdf and html files, and I am not sure what MIME type to use for the particular flavor of XML they use, but the reference could be something like:
"xml/caml" = "http://www.helpfulhosting.org/ca/20112012/pubinfo_20110115_Sat/BILL_VERSION_TBL_3.lob"
comment by jturk
this would be nice to have but we can't break backwards compatibility at the moment, we'll investigate adding something like this when we roll out v2
in the meantime we could use plus fields to collect this data
comment by rkiddy 4/8/11
Noticed this in the documentation:
(http://openstates.sunlightlabs.com/docs/scrapers.html)
add_version(name, url, **kwargs)
Add a version of the text of this bill.
Parameters:
name – a name given to this version of the text,
e.g. ‘As Introduced’, ‘Version 2’, ‘As amended’,
‘Enrolled’
url – the location of this version on the state’s
legislative website.
If multiple formats are provided, a good rule of thumb is
to prefer text, followed by html, followed by pdf/word/etc.
This seems to suggest that all of the references in the bills to versions should not have:
"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.pdf"
and should have, instead:
"url": "http://www.leginfo.ca.gov/pub/11-12/bill/asm/ab_0001-0050/ab_1_bill_20110114_amended_asm_v98.html"
The PDF is always there, but the HTML file is always there also. If it is to be preferred, can the references to the pdf files be changed?
This may be as simple as this:
diff --git a/openstates/ca/bills.py b/openstates/ca/bills.py
index fce1808..c88c700 100644
--- a/openstates/ca/bills.py
+++ b/openstates/ca/bills.py
@@ -308,7 +308,7 @@ class CABillScraper(BillScraper):
versions = []
- for link in page.xpath("//a[contains(@href, '.pdf')]"):
+ for link in page.xpath("//a[contains(@href, '.html')]"):
date = link.xpath("string(../../td[2])").strip(" -")
date = datetime.datetime.strptime(
date, '%m/%d/%Y').date()
comment by mstephens 4/18/11
California versions (for 2011 and beyond) now point to the HTML text
comment by rkiddy 4/18/11
That is great. Thanks for that.
For v2, I would still suggest the open-ended list of file types, as suggested above.
Just FYI, I have found that Kansas likes to publish things as ODT files. I suspect we want to be able to point to any file type a state might pick.
comment by gcombs 7/16/11
Speaking of California versions ... the HTML text header has a version name like "Introduced, blah blah blah" that looks scrapeworthy. Sadly, the names they give us for our version's "name" field is a bunch of gobbledyjunk, like "ABASDFASDFASDF12341234123"