Coder Social home page Coder Social logo

commons-builder's Introduction

EveryPolitician

Data about every national legislature in the world, freely available for you to use

Repo summary

These are some of the key repos in the EveryPolitician family. There are others.

The repos for many of our scrapers are kept separately in github.com/everypolitician-scrapers.

Technical blog

The EveryPolitician bot's own page is a good jumping-off point to lots of semi-technical explanations of what's going on (it has its own blog on Medium). For example:

The bot is on twitter as @everypolitician

Contributing

If you have data for us, or know where to get it, please read our page about how to contribute.

Team

EveryPolitician is a mySociety project.

commons-builder's People

Contributors

alexdutton avatar crowbot avatar jacksonj04 avatar mhl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

commons-builder's Issues

Regression in label service template: only "en" being used.

The label_service.rq.liquid template is no longer including non-"en" languages in its output. e.g., when run against Chile, the queries being generated have the following diff:

diff --git a/executive/index-query-used.rq b/executive/index-query-used.rq
index 4a7e502..b86522f 100644
--- a/executive/index-query-used.rq
+++ b/executive/index-query-used.rq
@@ -44,5 +44,5 @@ SELECT DISTINCT ?executive ?executiveLabel ?adminArea ?adminAreaLabel ?adminArea
     }
   }
 
-  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,es". }
+  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
 } ORDER BY ?primarySort ?country ?adminAreaType ?executive ?position

This will be the result of merging #58, where the template wasn't updated to use config.languages instead of languages.

include ?district in the ORDER BY

For people who have multiple districts (e.g. the mayor of rome) we're getting a lot of spurious changes depending on the order of these rows being returned by the SPARQL query.

Ordering of query results bound variables should be consistent

When running queries, [individual] returned results are sometimes [have their bound variables] in a different order resulting in unnecessary diffs which makes reviewing difficult.

These changes are apparent in both query results and the resulting Popolo JSON.

People that died before taking office are erroneously thought current.

Tancredo Neves was the 10th president of Brazil, but died before taking office. As such, he never stopped being president by virtue of never starting, and so is deemed "current" by our current queries.

I think this should be modelled by adding a "start date: no value" qualifier to the P39 to signify "never started", and the executive — and maybe legislative — query(ies) updated to take this into consideration.

@tmtmtmtm suggested off-GitHub using a "subject has role: president-elect" qualifier in this case, which I think is helpfully descriptive, but complicates the query if we also have to look for "governor-elect", "senator-elect", etc.

This would also need documenting within https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician/Political_data_model

Ignore fictional people

Harriet Jones is picked up as a current Prime Minister of the United Kingdom, despite being an instance of 'fictional human' instead of 'person'. We should ensure all the people we find are instances of 'person'.

Handle the subsequent merging of Wikidata items used in metadata

Add a stage to the build process before it generates the popolo-m17n.json, that checks to see if any of the item IDs in the index.json files or in the boundary CSV files have since been merged, and a command that will update them in that situation. I think we're going to want to use the id-mapping-store to represent somehow that these new items represent the same thing as the old one, but I think that needs some discussion before we act on it.

Hard to debug query timeouts when query not yet written to disk

In Executive.list and Legislature.list, there are two occurences where the query is run before it is written to disk, which means that if it times out or fails, it's difficult to determine what the failed query was.

This was discovered trying to debug #45.

The query should be recorded before these methods attempt to run it.

Italian senators for life result in an additional legislative/index.json entry

Italy's senate (https://www.wikidata.org/wiki/Q633872) has two has parts, "member of the Italian senate" (https://www.wikidata.org/wiki/Q13653224) and "senatore a vita (senator for life)" (https://www.wikidata.org/wiki/Q826589), with the latter a subclass of the former. This means two entries in legislative/index.json at the moment, which is Bad. I think we want to exclude the latter, and then update the legislative membership query to also consider subclasses of the specific position item.

Generate summaries for each legislature and term

At the moment each term is stored in a folder such as legislative/Q12345/Q67890. Whilst entirely sensible for machines who can quickly parse the index files, this isn't much use to humans who may be interested in what each folder contains.

Proposal

As part of the build process, each of these subfolders (both the legislature and the term) should gain a summary.json (similar to EveryPolitician) with labels and headline statistics, as well as a readme.md with a human readable version of the same.

Why?

The machine readable summaries are immediately useful for downstream tools such as the Commons Explorer, and the human readable version makes pointing interested parties directly at a legislature or term folder on GitHub more feasible.

Support Hong Kong districts

Hong Kong districts (e.g. Central and Western District) aren't FLACSen, but instead are districts of Hong Kong. As it stands, there's no easy way to pull these out.

They're also not related with P17 (country) to the Hong Kong entity, but there is a P131+ chain (located in the territorial administrative entity).

Further note of caution: Hong Kong is a FLACS of China.

Issues with generate_legislative_index for South Korea

Running generate_legislative_index against everypolitician/proto-commons-south-korea results in a file that only lists The National Assembly (Q494162), and none of the FLACSen or cities.

If we loosen the query to also include P1001 (applies to jurisdiction) we'd pick up a few Provincial/City Government → Province/City relationships, but…

These Provincial/City governments are modelled as instances of local government, not legislature or legislative house as we had before. local government is also applied to e.g. Incheon Metropolitan City Office of Education.

I think we should add legislature types to the Provincial/City Governments, alongside their existing local government types.

We can probably cope with being looseness of ?body (wdt:P194/wdt:P527?)|^wdt:P1001 ?legislature, unless anyone thinks wdt:P194 should be the One True Way.

http://tinyurl.com/ybey28qp provides context.

https://gist.github.com/alexsdutton/0eb41f525d916453a0639bc4ea512a06 is legislature/index.json with both of these things loosened. It would include unhelpfully the Offices of Education if they had P1001s, which it would be reasonable for them to have.

Not enough flexibility for exec/leg index generation for Mexico

For Mexico, the intention is to include the largest nine SLACSen (second-level administrative country subdivisions) and not include any cities, as the cities don't have the legislatures or executives.

The index generation scripts rely upon select_admin_areas_for_country, which currently pick out the country, the FLACSen, and cities with populations over 250k.

We could achieve this by some combination of:

  1. Adding additional admin area Wikidata IDs to be included to config.json
  2. Adding population thresholds for SLACS and cities, with defaults of ∞ and 250k, varied for Mexico
  3. Manually curate Mexico's index.json files to include the required executives and legislatures.

I don't like (3) when the assumption is becoming that the index files are generated. (1) is simple and generic, but doesn't support encoding the why of why those admin areas are included (and it's JSON, so no comments).

Roles with multiple superclasses lead to duplicate membership objects

The President of Brazil has two superclasses: head of government, and president. The superclasses have disjoint inheritances, so we can't exclude one for being in the inheritance graph of the other.

At the moment our executive query returns multiple sets of bindings, one for each role superclass. We need a way to pick the more sensible of the two, and only output one membership object.

This problem may also apply to legislative roles.

Ability to pull in constituency/area metadata from Wikidata

commons-builder should be able to find constituency information for legislatures from Wikidata, so that:

  • we can do most of the data processing async from finding boundary shapefiles
  • we can ensure that the data in the shapefile attributes matches what's in Wikidata more easily (for completeness checks)
  • we can check the consistency of data in Wikidata

This will involve generating the CSV files (or a revamped form of the same data) and the associations between areas and positions. I'd also like to see seat counts on position/area pairs, so we can check we have enough seats as well as constituencies (though functional and at-large constituencies may complicate this).

Factor out queries into templates

The queries would be far more maintainable as erb templates, and we wouldn't butt up against Rubocop's class length cop for wikidata_queries (as we are doing for the proposed implementation in #50).

Accessing missing query template variables should raise an exception

As mentioned in 5463265, the regression in #59 wouldn't have occurred if attempting to access variables missing from the template context raised an exception.

I think this is precluded at the moment because using strict_variables (see Shopify/liquid#804) causes errors as per Shopify/liquid#828. The commit that fixes this (Shopify/liquid@5149cde) is part of a merged PR (Shopify/liquid#829) but isn't in master or the tagged release, and I haven't yet worked out where it went.

Null parent_ids should be warned about

The only area that should have a null parent_id is the country area, but often we have other areas with null parent_ids slipping through. This should be an easy thing to check.

Overrideable defaults for various modelling deviations

We have a number of cases where the data we're interested in for a particular country doesn't quite follow what we normally do. These include:

  • Wanting to pull out regional representation for something other than FLACS (#49, #53)
  • Varying population thresholds for cities (Brazil, to 1m)
  • Filtering by something other than country for relevant regional and local representation (#53)

I think therefore that there should be a few overrideable config options in config.json, that we can extend as we go:

  • regional admin area superclass (default of FLACS, which we can vary to "district of Hong Kong" for Hong Kong, and "SLACS" for Brazil)
  • city population threshold (default of 250k, which we can up to 1m for Brazil)
  • additional admin area IDs (default of [], and it's already supported on a branch, but it would be implemented in a way consistent with the new config options.

The defaults would live in Commons::Builder::Config. The Config object would be passed to the query generation methods (and the additional_admin_area_ids would be picked up from that, instead of being passed explicitly).

Term-specific position item IDs shouldn't appear in output

Countries that have term-specific position items end up with these items' IDs up in the output, when we'd prefer the generic one. Using the generic one in the output would then match up against a static position item ID in the boundary index data, and makes the output more consistent for consumers, who we want to shield from the vagaries of term-specific positions.

Executive.list breaks when faced with multiple positions for one exec

In Northern Ireland, we — rightly or wrongly — pick up two HoG positions, the First Minister and the Deputy First Minister. This leads to Executive.list creating a new Executive for each and then trying to order them. As the Executive#executive_item_id is the same, it tries to order the positions, which are incomparable.

executives_unsorted.sort_by { |h| [h.executive_item_id, h.positions] }

Note: It's not enough to just implement Position#<=>, as it would still be returning duplicate Executives, when there should be one Executive with multiple positions.

generate_executive_index often times out

The SPARQL query behind generate_executive_index often times out. Until now it's been possible to run it again and mostly have it succeed, but this isn't particularly sustainable or reliable.

The query should be simplified or split, or otherwise changed to give a higher probability of it working first time.

Example when run against Mexico (Q96):

bundler: failed to load command: generate_executive_index (/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index)
RestClient::Exceptions::ReadTimeout: Timed out reading data from server
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:733:in `rescue in transmit'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:647:in `transmit'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient.rb:71:in `post'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/wikidata.rb:16:in `perform'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/executive.rb:31:in `list'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/bin/generate_executive_index:11:in `<top (required)>'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `load'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `<top (required)>'```

Use executive position_ids to generate the role in wikidata queries, not the superclass

As @mhl noted in e59dc73, we're now specifying the specific roles for executive positions in executive/index.json, e.g. 'Mayor of Busan', rather than the superclass 'mayor'. However, the queries set that position_id as the superclass, meaning they'll usually return data where the superclass is the same as the position itself. Update the queries so that they set the role, not the superclass, so we get a useful superclass from them.

Change COPYRIGHT to LICENCE

Currently License conditions are recorded in the same directory as shapefiles - i.e. in ./Boundaries//-COPYRIGHT
I think this should be changed to 'LICENCE' as it is more accurately describes what is contained, and, while perhaps a small thing, could effect how 'open' this data is seen to be.

Generate legislative/index.json automatically from Wikidata for a given country

Currently the legislative/index.json in proto-commons repos is being authored by hand in bits as part of the process of including each new directory of boundary files. We would like to generate it automatically from Wikidata to the extent that that's possible. Our first thought is that the mechanism for doing this should be a script in the bin directory of this repo.

In general, we think this is going to involve using the country Wikidata ID specified in the config.json file of the proto-commons repository and formulating a set of queries that use that as a starting point to find the Wikidata IDs and names of various associated entities (legislatures at various different levels, and the roles and terms associated with them).

The file should contain an item for the national level legislature(s), the legislature associated with each first level administrative country subdivision (FLACS) for the country, and an item for every city with a population over 250k people.

The first thing that this script will need to do is to find the national level legislature(s) for the country. There should be Wikidata queries already defined in the Legislative Explorer that can be used or adapted for this. It also will need to find the Wikidata item for the role of being a member of that legislature (Owen can give some guidance on useful queries here). It will also need to get term information if appropriate, and if not, then fall back to using a start and end date. There's a good example term query here. For identifying the FLACS, and their legislatures, it looks like again, the Legislative Explorer should have some starting queries that can be used or modified. Owen should be able to help with a query to find from wikidata the cities with population over 250k.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.