mysociety / parlparse Goto Github PK
View Code? Open in Web Editor NEWThe scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
License: Other
The scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
License: Other
Current old system runs early morning, checking for new data, and parsing it if so. As the UK data can go up much earlier, we should now check for it more regularly, and immediately parse it if found. Could also investigate fetching wrans/wms earlier too, and then potentially send alerts?
The last speech on http://www.theyworkforyou.com/debates/?id=2013-11-08a.537.0#g576.4 - the speaker has not been detected.
The member parsing part assumes that there's no brackets in the constituency name so Richmond (Yorks) fails at the moment. Need to use whatever fix the main name parser uses for this.
In several places a Person has an other_name
with an end_date
of 9999-12-31
Would it not be better to simply omit the end_date
for these?
The members/constituencies.xml file is missing (it appears to have been replaced with a JSON file) which seems to be stopping PublicWhip updating. It's also listed as a file on http://parser.theyworkforyou.com/members.html .
(Going from http://www.parliament.uk/business/publications/hansard/lords/by-date/#session=26&year=2015&month=0&day=26 ) Will have to adapt new WA parser to fetch from http://www.parliament.uk/business/publications/written-questions-answers-statements/ where they are now based.
I haven't had a chance to investigate these yet properly, but Mike Rumbles, John Mason, Claudia Beamish, Pauline McNeill, and Angus MacDonald look like they've been given new IDs after the recent election, although they already had previous ones.
format looks to be:
http://hansard.parliament.uk/$house/$date/$heading-uid/$heading-title/#contribution-$UID
where
$heading-uid
is the UID of the last major or minor heading$heading-title
looks to be the camel cased and spaces removed title of the last major/minor heading$UID
is the UID of the current tag$house
is commons, lords etcShould be able to do this by:
get_url_base
style function overridden in each classNB: This looks like it works for lords and westminster hall. PBCs seem to be different so need looking at.
e.g. http://www.theyworkforyou.com/debates/?id=2016-05-03a.108.5 has "Ayes 288279, Noes 172158." and similar in the XML source non-text.
If we have to historically change party it is annoying that we have to update potentially a lot of XML files with a new member ID. It would be easier to store a person ID for each speech instead.
Having separate member XML files, okay; combining them in a python script that spots people who are the same cross-constituency or assembly, not great.
Hopefully superseded by #45, but currently e.g. http://www.theyworkforyou.com/pbc/2015-16/Investigatory_Powers_Bill/01-0_2016-03-24a.3.0 is not spotting MP names when they begin with a "Q" question indicator.
It's quite duplicated now.
There's quite a few in 679611_2017-01-27_03:37:30/CHAN99/CHAN99/CHAN99.xml
where it's failing to spot the hon. Member for Tewksbury (Mr Roberson)
among others.
It looks to be about 8 years old at the moment. We should either update it automatically or just remove it.
If the new parser rewrites an pre Easter 2016 file, it needs to include a doctype/HTML entities header in case the file contains any.
“In yesterday's debates, hs_8Clause contained the "New Clause 1", "Clause 5" headings that precede the actual heading title; they're currently appearing as the last paragraph of the previous section.”
I'm not sure whether there's a process for merging these, beyond simply deleting 25098 and adjusting membership 80477…
We can presumably now reliably tell where the oral questions end in the XML, so we should mark that (probably easier to use new element than have a wrapper element given the flat nature) and use that on import rather than rely on the heading being a) all caps and b) in a manual list.
This would mean uc_titles
stuff could be dropped.
Basil McCrea left the UUP in February 2013 (http://www.bbc.co.uk/news/uk-northern-ireland-19828606) and formed NI21 in June 2013 (http://www.belfasttelegraph.co.uk/news/northern-ireland/basil-mccrea-and-john-mccallister-launch-new-political-party-ni21-29326127.html)
e.g. http://www.theyworkforyou.com/debates/?id=2016-01-12a.791.0#g794.1 – matched, Brandon Lewis, not matched. Further on, it is sometimes matched, sometimes not.
A parse error is raised from a recent debate.
Command:
./lazyrunall.py --date=2016-01-11 scrape parse debates
reporting:
'tag </b> tag out of place in :</b> To ask the Secretary of State if he will make a statement on safety in prisons and secure training centres.'
<< StampURL date:2016-01-11 col:573 aname:160111-0001.htm_spnew121 >>
Mark Ruskell (now MSP for Mid Scotland and Fife) was previously also MSP there in the Second Scottish Parliament, but has been given a new ID this time (25534 vs 14088)
Left the Ulster Unionists in January 2012 — http://www.bbc.co.uk/news/uk-northern-ireland-16765676
Joined UKIP in October 2012 — http://www.bbc.co.uk/news/uk-northern-ireland-19828606
It talks about all-members etc but doesn't explain where they are or how to get them. It's also lost links, I think.
The old text had "Data about members of parliament is stored in source control, so you browse it in a slightly different place." with a link to something like http://project.knowledgeforge.net/ukparse/svn/trunk/parlparse/members/
And later on it had more details in the rsync section, including a simple explanation of rsync which has gone (useful for new people), and more text on the different location of the member files: "The member data is available by rsync here. rsync ukparse.kforge.net::svn/parlparse/members/" (don't know if this is available from new location)
We could pick them up from the API easily enough, which would save a boring and easy to get wrong manual task.
http://www.data.parliament.uk/dataset/12 has the daily Hansard as it appears.
http://www.data.parliament.uk/dataset/14 has the bound volumes as they appear.
Have asked about time of update (yesterday doesn't appear to be present).
http://parser.theyworkforyou.com/ presumably should also have the standard mySociety footer added.
This is a github hosted page so the content is in the gh-pages branch of this repo.
“In general, I think it'd be good for the parsing functions to be more concrete about knowing they have processed everything they're expected to (and haven't found anything unexpected), so we can know we e.g. haven't lost speeches and the like.”
As a concrete example, parse_amendment
calls parse_para_with_member
which will strip any containing Member elements, assuming they've been handled by the caller. What if they do <Amendment>Moved by <Member>Lord Foo</Member></Amendment>
– the member would be silently dropped, leaving "Moved by" in the output.
The Shirley-Anne Somerville (ID 25537) elected in Dunfermline in the recent Scottish Parliament elections is the same person (http://www.parliament.scot/msps/currentmsps/shirley-anne-somerville-msp.aspx / https://en.wikipedia.org/wiki/Shirley-Anne_Somerville) as the Shirley-Anne Somerville (ID 14100) from Lothians in the 3rd Scottish Parliament.
Should the handle_para code be removed, or updated?
If there's no current speech, nothing is added.
The following commands fail:
$ cd pyscraper
$ rsync -az --progress --exclude '.svn' --exclude 'tmp/' --relative data.theyworkforyou.com::parldata/cmpages/regmem/regmem2013-05-07.html ../../parldata
$ python lazyrunall.py --date=2013-05-07 parse regmem
There are problems parsing the member name on this page:
http://www.publications.parliament.uk/pa/cm/cmregmem/130507/register-of-members-financial-interestscampbell_ronnie.htm
…due to the stray “Register of Members' Financial Interests” text in the h2.
See also:
http://www.publications.parliament.uk/pa/cm/cmregmem/130507/part1contents.htm#R
Willie Coffey was re-elected as MSP for Kilmarnock and Irvine Valley, but given a new ID (25499 vs 13968)
parse_opposition looks for hs_2cDebatedMotion, hs_7SmCapsHdg, hs_2GenericHdg, but we have already seen a hs_2DebatedMotion, though this ties in with:
parse_debated_motion - should it be minor rather than major? There are sometimes two backbench debates, such as on http://www.theyworkforyou.com/debates/?d=2016-01-27
This is something we’re working on at @spudmind anyway, so thought we might as well add it to parlparse.
If we submit a PR, is it the sort of thing that would make it in? Thanks!
I guess there will come a time when there is more than 1...
In https://github.com/mysociety/parlparse/blob/master/members/people.json for id uk.org.publicwhip/person/25303 shows Natalie McGarry as having a current_party of "Scottish National Party" but http://data.parliament.uk/membersdataplatform/services/mnis/members/query/name*McGarry/ and http://www.parliament.uk/biographies/commons/natalie-mcgarry/4428 shows her as Independent (see also http://www.bbc.co.uk/news/uk-scotland-scotland-politics-34914067 ).
(affecting PublicWhip http://www.publicwhip.org.uk/mp.php?mpn=Natalie_McGarry&mpc=Glasgow_East&house=commons and http://www.theyworkforyou.com/mp/25303/natalie_mcgarry/glasgow_east )
With date ranges if needed.
Hansard recently re-used a PimsId which meant we assigned a speech to a retired Bishop. We should check that the membership returned by an ID lookup is valid as otherwise TWFY complains as its lookup does check this.
At the moment we can rsync data.theyworkforyou.com::parldata
to get the contents of parldata. Ideally we also want to be able to rsync data.theyworkforyou.com::parlparse
to get the parlparse contents (ie the members XML files).
Required to finish documentation in #5.
Even structured data can have its issues, and we sometimes have to patch JSON (e.g. recently JSON containing some HTML of the form </tab<span>…</span>le>
). patchtool works fine for this, but then if a new version comes in, it's very unlikely the patch will match as the JSON is probably stored as one line. We could pretty print it on saving which would make it easier to patch. Suggestion by Duncan.
The old parser used to do this, looks like. Any hs_Paras inside a Division look to currently be ignored (so links with #54 in that we didn't know this was being dropped).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.