The parlparse from mysociety

Upload content quicker

Current old system runs early morning, checking for new data, and parsing it if so. As the UK data can go up much earlier, we should now check for it more regularly, and immediately parse it if found. Could also investigate fetching wrans/wms earlier too, and then potentially send alerts?

"Madam Deputy Speaker" not matched

The last speech on http://www.theyworkforyou.com/debates/?id=2013-11-08a.537.0#g576.4 - the speaker has not been detected.

Shut down old UK parser

Make sure scraped data is up to date for whole archive we have
Make sure scraped data is parsed for all data we have
Remove old parser code

Standing committee member parser doesn't handle Richmond (Yorks)

The member parsing part assumes that there's no brackets in the constituency name so Richmond (Yorks) fails at the moment. Need to use whatever fix the main name parser uses for this.

other_names with 9999-12-31 end_date

In several places a Person has an other_name with an end_date of 9999-12-31

Would it not be better to simply omit the end_date for these?

members/constituencies.xml missing

The members/constituencies.xml file is missing (it appears to have been replaced with a JSON file) which seems to be stopping PublicWhip updating. It's also listed as a file on http://parser.theyworkforyou.com/members.html .

Lords no longer publishing HTML Hansard written answers/statements

(Going from http://www.parliament.uk/business/publications/hansard/lords/by-date/#session=26&year=2015&month=0&day=26 ) Will have to adapt new WA parser to fetch from http://www.parliament.uk/business/publications/written-questions-answers-statements/ where they are now based.

More duplicate MSPs

I haven't had a chance to investigate these yet properly, but Mike Rumbles, John Mason, Claudia Beamish, Pauline McNeill, and Angus MacDonald look like they've been given new IDs after the recent election, although they already had previous ones.

New XML parser doesn't set URLs

format looks to be:

http://hansard.parliament.uk/$house/$date/$heading-uid/$heading-title/#contribution-$UID

where

$heading-uid is the UID of the last major or minor heading
$heading-title looks to be the camel cased and spaces removed title of the last major/minor heading
$UID is the UID of the current tag
$house is commons, lords etc

Should be able to do this by:

setting a last heading UID and last heading title variable whenever we generate a heading URL
passing in UIDs to avoid parsing hassle
having a get_url_base style function overridden in each class

NB: This looks like it works for lords and westminster hall. PBCs seem to be different so need looking at.

E&W division totals are smushed in with national ones.

e.g. http://www.theyworkforyou.com/debates/?id=2016-05-03a.108.5 has "Ayes 288279, Noes 172158." and similar in the XML source non-text.

Output person IDs in XML rather than member IDs

If we have to historically change party it is annoying that we have to update potentially a lot of XML files with a new member ID. It would be easier to store a person ID for each speech instead.

Parser misses out text when there is a tag in the questiontext tail

See http://www.theyworkforyou.com/debates/?id=2016-05-03b.14.5#g15.2

Related - #54

Use better people format than the XML

Having separate member XML files, okay; combining them in a python script that spots people who are the same cross-constituency or assembly, not great.

PBC scraping fails to spot some names

Hopefully superseded by #45, but currently e.g. http://www.theyworkforyou.com/pbc/2015-16/Investigatory_Powers_Bill/01-0_2016-03-24a.3.0 is not spotting MP names when they begin with a "Q" question indicator.

Factor JSON import resolving code together

It's quite duplicated now.

The last speech of the day is missing

e.g. bottom of http://www.theyworkforyou.com/debates/?id=2016-04-19b.889.0 or http://www.theyworkforyou.com/lords/?id=2016-04-19a.570.0

MP name/cons matching failures

There's quite a few in 679611_2017-01-27_03:37:30/CHAN99/CHAN99/CHAN99.xml where it's failing to spot the hon. Member for Tewksbury (Mr Roberson) among others.

MSP personal website information is never updated

It looks to be about 8 years old at the moment. We should either update it automatically or just remove it.

New parser should output doctype header for older rewrites

If the new parser rewrites an pre Easter 2016 file, it needs to include a doctype/HTML entities header in case the file contains any.

Improve handling of Clause headings

“In yesterday's debates, hs_8Clause contained the "New Clause 1", "Clause 5" headings that precede the actual heading title; they're currently appearing as the last paragraph of the previous section.”

Ken Macintosh duplicate (person/14035 and person/25098)

I'm not sure whether there's a process for merging these, beyond simply deleting 25098 and adjusting membership 80477…

Mark end of oral questions

We can presumably now reliably tell where the oral questions end in the XML, so we should mark that (probably easier to use new element than have a wrapper element given the flat nature) and use that on import rather than rely on the heading being a) all caps and b) in a manual list.

This would mean uc_titles stuff could be dropped.

Basil McCrea move to NI21

Basil McCrea left the UUP in February 2013 (http://www.bbc.co.uk/news/uk-northern-ireland-19828606) and formed NI21 in June 2013 (http://www.belfasttelegraph.co.uk/news/northern-ireland/basil-mccrea-and-john-mccallister-launch-new-political-party-ni21-29326127.html)

First Deputy Chairman not matched

e.g. http://www.theyworkforyou.com/debates/?id=2016-01-12a.791.0#g794.1 – matched, Brandon Lewis, not matched. Further on, it is sometimes matched, sometimes not.

Parse error in debate

A parse error is raised from a recent debate.
Command:
./lazyrunall.py --date=2016-01-11 scrape parse debates

reporting:

'tag </b> tag out of place in :</b> To ask the Secretary of State if he will make a statement on safety in prisons and secure training centres.'
<< StampURL date:2016-01-11 col:573 aname:160111-0001.htm_spnew121 >>

Duplicate record for Mark Ruskell

Mark Ruskell (now MSP for Mid Scotland and Fife) was previously also MSP there in the Second Scottish Parliament, but has been given a new ID this time (25534 vs 14088)

David McNarry moved from UUP to UKIP in 2012

Left the Ulster Unionists in January 2012 — http://www.bbc.co.uk/news/uk-northern-ireland-16765676

Joined UKIP in October 2012 — http://www.bbc.co.uk/news/uk-northern-ireland-19828606

Broken instructions on parser.theyworkforyou.com

It talks about all-members etc but doesn't explain where they are or how to get them. It's also lost links, I think.

The old text had "Data about members of parliament is stored in source control, so you browse it in a slightly different place." with a link to something like http://project.knowledgeforge.net/ukparse/svn/trunk/parlparse/members/

And later on it had more details in the rsync section, including a simple explanation of rsync which has gone (useful for new people), and more text on the different location of the member files: "The member data is available by rsync here. rsync ukparse.kforge.net::svn/parlparse/members/" (don't know if this is available from new location)

Automate addition of new lords

We could pick them up from the API easily enough, which would save a boring and easy to get wrong manual task.

Switch to using Parliament XML data

http://www.data.parliament.uk/dataset/12 has the daily Hansard as it appears.
http://www.data.parliament.uk/dataset/14 has the bound volumes as they appear.
Have asked about time of update (yesterday doesn't appear to be present).

Put standard footer on raw data page

http://parser.theyworkforyou.com/ presumably should also have the standard mySociety footer added.

This is a github hosted page so the content is in the gh-pages branch of this repo.

Improve robustness of parsing

“In general, I think it'd be good for the parsing functions to be more concrete about knowing they have processed everything they're expected to (and haven't found anything unexpected), so we can know we e.g. haven't lost speeches and the like.”

As a concrete example, parse_amendment calls parse_para_with_member which will strip any containing Member elements, assuming they've been handled by the caller. What if they do <Amendment>Moved by <Member>Lord Foo</Member></Amendment> – the member would be silently dropped, leaving "Moved by" in the output.

process_hansard output can be misleading

e.g. if a "b" version is identical to an "a" version, it still says parsed to "b" version rather than "matched" or similar
It also says "parsing" files it is skipping.

Duplicate record for Shirley-Anne Somerville

The Shirley-Anne Somerville (ID 25537) elected in Dunfermline in the recent Scottish Parliament elections is the same person (http://www.parliament.scot/msps/currentmsps/shirley-anne-somerville-msp.aspx / https://en.wikipedia.org/wiki/Shirley-Anne_Somerville) as the Shirley-Anne Somerville (ID 14100) from Lothians in the 3rd Scottish Parliament.

parse_newdebate calls non-existant function

Should the handle_para code be removed, or updated?

parse_procedure assumes current_speech

If there's no current speech, nothing is added.

Unable to parse 2013-05-07 Register of Members' Financial Interests

The following commands fail:

$ cd pyscraper
$ rsync -az --progress --exclude '.svn' --exclude 'tmp/' --relative data.theyworkforyou.com::parldata/cmpages/regmem/regmem2013-05-07.html ../../parldata
$ python lazyrunall.py --date=2013-05-07 parse regmem

There are problems parsing the member name on this page:
http://www.publications.parliament.uk/pa/cm/cmregmem/130507/register-of-members-financial-interestscampbell_ronnie.htm

…due to the stray “Register of Members' Financial Interests” text in the h2.

Update public whip to pull from new parlparse location

Duplicate record for Willie Coffey

Willie Coffey was re-elected as MSP for Kilmarnock and Irvine Valley, but given a new ID (25499 vs 13968)

Check heading structure of parse_opposition/parse_debated_motion

parse_opposition looks for hs_2cDebatedMotion, hs_7SmCapsHdg, hs_2GenericHdg, but we have already seen a hs_2DebatedMotion, though this ties in with:

parse_debated_motion - should it be minor rather than major? There are sometimes two backbench debates, such as on http://www.theyworkforyou.com/debates/?d=2016-01-27

Add Register of Lords’ Interest

This is something we’re working on at @spudmind anyway, so thought we might as well add it to parlparse.

If we submit a PR, is it the sort of thing that would make it in? Thanks!

Missing Lords speech

http://www.theyworkforyou.com/lords/?id=2016-04-19a.547.2

parse_opposition/parse_debated_motion assume 0 or 1 following

I guess there will come a time when there is more than 1...

Natalie McGarry showing as SNP

In https://github.com/mysociety/parlparse/blob/master/members/people.json for id uk.org.publicwhip/person/25303 shows Natalie McGarry as having a current_party of "Scottish National Party" but http://data.parliament.uk/membersdataplatform/services/mnis/members/query/name*McGarry/ and http://www.parliament.uk/biographies/commons/natalie-mcgarry/4428 shows her as Independent (see also http://www.bbc.co.uk/news/uk-scotland-scotland-politics-34914067 ).

(affecting PublicWhip http://www.publicwhip.org.uk/mp.php?mpn=Natalie_McGarry&mpc=Glasgow_East&house=commons and http://www.theyworkforyou.com/mp/25303/natalie_mcgarry/glasgow_east )

Move names from memberships to persons

With date ranges if needed.

Check that memberships returned using ID lookup are valid

Hansard recently re-used a PimsId which meant we assigned a speech to a retired Bishop. We should check that the membership returned by an ID lookup is valid as otherwise TWFY complains as its lookup does check this.

Convert constituencies.xml to JSON

Create a parlparse (as opposed to parldata) rsync endpoint

At the moment we can rsync data.theyworkforyou.com::parldata to get the contents of parldata. Ideally we also want to be able to rsync data.theyworkforyou.com::parlparse to get the parlparse contents (ie the members XML files).

Required to finish documentation in #5.

Store scraped JSON pretty printed

Even structured data can have its issues, and we sometimes have to patch JSON (e.g. recently JSON containing some HTML of the form </tab<span>…</span>le>). patchtool works fine for this, but then if a new version comes in, it's very unlikely the patch will match as the JSON is probably stored as one line. We could pretty print it on saving which would make it easier to patch. Suggestion by Duncan.

Output division aye/noes as speech, and any end text

The old parser used to do this, looks like. Any hs_Paras inside a Division look to currently be ignored (so links with #54 in that we didn't know this was being dropped).

mysociety / parlparse Goto Github PK

parlparse's People

Contributors

Stargazers

Watchers

Forkers

parlparse's Issues

Recommend Projects

Recommend Topics

Recommend Org