Coder Social home page Coder Social logo

parlparse's People

Contributors

ajparsons avatar andylolz avatar carlosrodriguezsevilla avatar dracos avatar edwardbetts avatar jacksonj04 avatar jenmysoc avatar mashedkeyboard avatar mhl avatar myfanwynixon avatar sagepe avatar struan avatar tmtmtmtm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parlparse's Issues

Upload content quicker

Current old system runs early morning, checking for new data, and parsing it if so. As the UK data can go up much earlier, we should now check for it more regularly, and immediately parse it if found. Could also investigate fetching wrans/wms earlier too, and then potentially send alerts?

Shut down old UK parser

  • Make sure scraped data is up to date for whole archive we have
  • Make sure scraped data is parsed for all data we have
  • Remove old parser code

More duplicate MSPs

I haven't had a chance to investigate these yet properly, but Mike Rumbles, John Mason, Claudia Beamish, Pauline McNeill, and Angus MacDonald look like they've been given new IDs after the recent election, although they already had previous ones.

New XML parser doesn't set URLs

format looks to be:

http://hansard.parliament.uk/$house/$date/$heading-uid/$heading-title/#contribution-$UID

where

  • $heading-uid is the UID of the last major or minor heading
  • $heading-title looks to be the camel cased and spaces removed title of the last major/minor heading
  • $UID is the UID of the current tag
  • $house is commons, lords etc

Should be able to do this by:

  • setting a last heading UID and last heading title variable whenever we generate a heading URL
  • passing in UIDs to avoid parsing hassle
  • having a get_url_base style function overridden in each class

NB: This looks like it works for lords and westminster hall. PBCs seem to be different so need looking at.

Output person IDs in XML rather than member IDs

If we have to historically change party it is annoying that we have to update potentially a lot of XML files with a new member ID. It would be easier to store a person ID for each speech instead.

Use better people format than the XML

Having separate member XML files, okay; combining them in a python script that spots people who are the same cross-constituency or assembly, not great.

MP name/cons matching failures

There's quite a few in 679611_2017-01-27_03:37:30/CHAN99/CHAN99/CHAN99.xml where it's failing to spot the hon. Member for Tewksbury (Mr Roberson) among others.

Improve handling of Clause headings

“In yesterday's debates, hs_8Clause contained the "New Clause 1", "Clause 5" headings that precede the actual heading title; they're currently appearing as the last paragraph of the previous section.”

Mark end of oral questions

We can presumably now reliably tell where the oral questions end in the XML, so we should mark that (probably easier to use new element than have a wrapper element given the flat nature) and use that on import rather than rely on the heading being a) all caps and b) in a manual list.

This would mean uc_titles stuff could be dropped.

Parse error in debate

A parse error is raised from a recent debate.
Command:
./lazyrunall.py --date=2016-01-11 scrape parse debates

reporting:

'tag </b> tag out of place in :</b> To ask the Secretary of State if he will make a statement on safety in prisons and secure training centres.'
<< StampURL date:2016-01-11 col:573 aname:160111-0001.htm_spnew121 >>

Duplicate record for Mark Ruskell

Mark Ruskell (now MSP for Mid Scotland and Fife) was previously also MSP there in the Second Scottish Parliament, but has been given a new ID this time (25534 vs 14088)

Broken instructions on parser.theyworkforyou.com

It talks about all-members etc but doesn't explain where they are or how to get them. It's also lost links, I think.

The old text had "Data about members of parliament is stored in source control, so you browse it in a slightly different place." with a link to something like http://project.knowledgeforge.net/ukparse/svn/trunk/parlparse/members/

And later on it had more details in the rsync section, including a simple explanation of rsync which has gone (useful for new people), and more text on the different location of the member files: "The member data is available by rsync here. rsync ukparse.kforge.net::svn/parlparse/members/" (don't know if this is available from new location)

Improve robustness of parsing

“In general, I think it'd be good for the parsing functions to be more concrete about knowing they have processed everything they're expected to (and haven't found anything unexpected), so we can know we e.g. haven't lost speeches and the like.”

As a concrete example, parse_amendment calls parse_para_with_member which will strip any containing Member elements, assuming they've been handled by the caller. What if they do <Amendment>Moved by <Member>Lord Foo</Member></Amendment> – the member would be silently dropped, leaving "Moved by" in the output.

process_hansard output can be misleading

  • e.g. if a "b" version is identical to an "a" version, it still says parsed to "b" version rather than "matched" or similar
  • It also says "parsing" files it is skipping.

Unable to parse 2013-05-07 Register of Members' Financial Interests

The following commands fail:

$ cd pyscraper
$ rsync -az --progress --exclude '.svn' --exclude 'tmp/' --relative data.theyworkforyou.com::parldata/cmpages/regmem/regmem2013-05-07.html ../../parldata
$ python lazyrunall.py --date=2013-05-07 parse regmem

There are problems parsing the member name on this page:
http://www.publications.parliament.uk/pa/cm/cmregmem/130507/register-of-members-financial-interestscampbell_ronnie.htm

…due to the stray “Register of Members' Financial Interests” text in the h2.

See also:
http://www.publications.parliament.uk/pa/cm/cmregmem/130507/part1contents.htm#R

Add Register of Lords’ Interest

This is something we’re working on at @spudmind anyway, so thought we might as well add it to parlparse.

If we submit a PR, is it the sort of thing that would make it in? Thanks!

Natalie McGarry showing as SNP

Create a parlparse (as opposed to parldata) rsync endpoint

At the moment we can rsync data.theyworkforyou.com::parldata to get the contents of parldata. Ideally we also want to be able to rsync data.theyworkforyou.com::parlparse to get the parlparse contents (ie the members XML files).

Required to finish documentation in #5.

Store scraped JSON pretty printed

Even structured data can have its issues, and we sometimes have to patch JSON (e.g. recently JSON containing some HTML of the form </tab<span>…</span>le>). patchtool works fine for this, but then if a new version comes in, it's very unlikely the patch will match as the JSON is probably stored as one line. We could pretty print it on saving which would make it easier to patch. Suggestion by Duncan.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.