everypolitician / commons-builder Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 288 KB

Build scripts for Democratic Commons repositories

License: MIT License

Ruby 84.59% Shell 0.15% Liquid 15.26%

commons-builder's Introduction

EveryPolitician

Data about every national legislature in the world, freely available for you to use

everypolitician.org — data | about
Report an issue

Repo summary

These are some of the key repos in the EveryPolitician family. There are others.

everypolitician (this repo): contains no code, but is where issues/tickets for the whole project live
everypolitician-data: where the data is stored -- but if you want to download it, get it from:
- human? go via the EveryPolitician website
- program? use the RawGit CDN, via links in countries.json, which we explain here
viewer-static: the live website http://everypolitician.org (gh-pages)
viewer-sinatra: Sinatra app for generating a dynamic version EveryPolitician website
webhook-manager: sends out EveryPolitician WebHooks: register your URL here!
everypolitician-docs: documentation at http://docs.everypolitician.org/ (gh-pages)

rebuilder rebuilds data from source

libraries for easily manipulating EveryPolitician data (useful for all devs, but we use the Ruby ones ourselves, of course!):
- Ruby: everypolitician-ruby and everypolitician-popolo.
- Python: everypolitician and everypolitician-popolo
handy gems we use when getting the data: wikidata-fetcher, wikisnakker, twitter_username_extractor, facebook_username_extractor, twitter_list, scraped_page_archive
gender-balance: repo for the Gender Balance website that crowdsources gender data for EveryPolitician
data_pr_change_summarizer: code used by the bot to review a data PR and leave a helpful summary as a comment

The repos for many of our scrapers are kept separately in github.com/everypolitician-scrapers.

Technical blog

The EveryPolitician bot's own page is a good jumping-off point to lots of semi-technical explanations of what's going on (it has its own blog on Medium). For example:

how the website is built (spoiler: viewer-sinatra → viewer-static)
how webhooks are used (you can easily register your app!)
how the scrapers run (many live on morph.io)

The bot is on twitter as @everypolitician

Contributing

If you have data for us, or know where to get it, please read our page about how to contribute.

Team

EveryPolitician is a mySociety project.

commons-builder's People

Contributors

Stargazers

Watchers

commons-builder's Issues

position-data query doesn't pick up everything legislative/executive index queries do

See e.g. everypolitician/proto-commons-united-kingdom#20 (review).

Also relevant:

Do we need to revisit superintendencies of Brazil - they’re more civil servant than head of government. The pragmatic reason for modelling them this way is that the position metadata query. ACTION: Write a ticket to make the position metadata queries not reliant on roles being a descendent of legislator or head of government

Regression in label service template: only "en" being used.

The label_service.rq.liquid template is no longer including non-"en" languages in its output. e.g., when run against Chile, the queries being generated have the following diff:

diff --git a/executive/index-query-used.rq b/executive/index-query-used.rq
index 4a7e502..b86522f 100644
--- a/executive/index-query-used.rq
+++ b/executive/index-query-used.rq
@@ -44,5 +44,5 @@ SELECT DISTINCT ?executive ?executiveLabel ?adminArea ?adminAreaLabel ?adminArea
     }
   }
 
-  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,es". }
+  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
 } ORDER BY ?primarySort ?country ?adminAreaType ?executive ?position

This will be the result of merging #58, where the template wasn't updated to use config.languages instead of languages.

If a person has a facebook url and multiple memberships, deduplicate the facebook urls

- persons[person_id][:links] << link if link
+ if link
+   persons[person_id][:links] << link unless persons[person_id][:links].include? link
+ end

Check whether the json has labels in all the expected languages

It would be nice to be able to see if there are labels missing for particular elements compared to the languages configured on the repository.

include ?district in the ORDER BY

For people who have multiple districts (e.g. the mayor of rome) we're getting a lot of spurious changes depending on the order of these rows being returned by the SPARQL query.

[g_l_i] Legislatures should only appear once in the index

Currently, cities which are also FLACSen are appearing twice in the index - they should appear once, as a FLACS.

The index generators should exclude FLACSen that aren't FLACS anymore

An example of this is Chaco (https://www.wikidata.org/wiki/Q15977818) in Paraguay, which has an end date qualifier on its P31 (instance of Department of Paraguay) and has a P576 (dissolved, abolished or demolished).

including seat counts causes Canada to time out

This is the reason for the reversion of 3ebd095

We suspect this is caused by terms that are missing data, e.g. not start date, end date or "replaces", but needs further investigation.

Ordering of query results bound variables should be consistent

When running queries, [individual] returned results ~~are~~ sometimes [have their bound variables] in a different order resulting in unnecessary diffs which makes reviewing difficult.

These changes are apparent in both query results and the resulting Popolo JSON.

People that died before taking office are erroneously thought current.

Tancredo Neves was the 10th president of Brazil, but died before taking office. As such, he never stopped being president by virtue of never starting, and so is deemed "current" by our current queries.

I think this should be modelled by adding a "start date: no value" qualifier to the P39 to signify "never started", and the executive — and maybe legislative — query(ies) updated to take this into consideration.

@tmtmtmtm suggested off-GitHub using a "subject has role: president-elect" qualifier in this case, which I think is helpfully descriptive, but complicates the query if we also have to look for "governor-elect", "senator-elect", etc.

This would also need documenting within https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician/Political_data_model

Ignore fictional people

Harriet Jones is picked up as a current Prime Minister of the United Kingdom, despite being an instance of 'fictional human' instead of 'person'. We should ensure all the people we find are instances of 'person'.

If the `boundaries` directory of a repo contains a `build` directory, look inside that for metadata and boundaries directories

See everypolitician/proto-commons-south-korea#3 for an example

Handle the subsequent merging of Wikidata items used in metadata

Add a stage to the build process before it generates the popolo-m17n.json, that checks to see if any of the item IDs in the index.json files or in the boundary CSV files have since been merged, and a command that will update them in that situation. I think we're going to want to use the id-mapping-store to represent somehow that these new items represent the same thing as the old one, but I think that needs some discussion before we act on it.

Hard to debug query timeouts when query not yet written to disk

In Executive.list and Legislature.list, there are two occurences where the query is run before it is written to disk, which means that if it times out or fails, it's difficult to determine what the failed query was.

This was discovered trying to debug #45.

The query should be recorded before these methods attempt to run it.

if someone has multiple Facebook IDs in Wikidata, they appear twice in persons

Doesn't handle wikidata language codes with a hyphen

This gets included in variable names in the SPARQL query generated which then results in a estClient::BadRequest: 400 Bad Request error when that query is used in a request to the wikidata query service.

Italian senators for life result in an additional legislative/index.json entry

Italy's senate (https://www.wikidata.org/wiki/Q633872) has two has parts, "member of the Italian senate" (https://www.wikidata.org/wiki/Q13653224) and "senatore a vita (senator for life)" (https://www.wikidata.org/wiki/Q826589), with the latter a subclass of the former. This means two entries in legislative/index.json at the moment, which is Bad. I think we want to exclude the latter, and then update the legislative membership query to also consider subclasses of the specific position item.

Generate summaries for each legislature and term

At the moment each term is stored in a folder such as legislative/Q12345/Q67890. Whilst entirely sensible for machines who can quickly parse the index files, this isn't much use to humans who may be interested in what each folder contains.

Proposal

As part of the build process, each of these subfolders (both the legislature and the term) should gain a summary.json (similar to EveryPolitician) with labels and headline statistics, as well as a readme.md with a human readable version of the same.

Why?

The machine readable summaries are immediately useful for downstream tools such as the Commons Explorer, and the human readable version makes pointing interested parties directly at a legislature or term folder on GitHub more feasible.

Support Hong Kong districts

Hong Kong districts (e.g. Central and Western District) aren't FLACSen, but instead are districts of Hong Kong. As it stands, there's no easy way to pull these out.

They're also not related with P17 (country) to the Hong Kong entity, but there is a P131+ chain (located in the territorial administrative entity).

Further note of caution: Hong Kong is a FLACS of China.

Allow update or build of a single legislature or executive

Cross check the indexes of a proto-commons repository

Each position_item_id referenced in legislative/index.json or executive/index.json should appear at least once in boundaries/build/index.json- produce a warning for each one that doesn't appear.

Issues with generate_legislative_index for South Korea

Running generate_legislative_index against everypolitician/proto-commons-south-korea results in a file that only lists The National Assembly (Q494162), and none of the FLACSen or cities.

If we loosen the query to also include P1001 (applies to jurisdiction) we'd pick up a few Provincial/City Government → Province/City relationships, but…

These Provincial/City governments are modelled as instances of local government, not legislature or legislative house as we had before. local government is also applied to e.g. Incheon Metropolitan City Office of Education.

I think we should add legislature types to the Provincial/City Governments, alongside their existing local government types.

We can probably cope with being looseness of ?body (wdt:P194/wdt:P527?)|^wdt:P1001 ?legislature, unless anyone thinks wdt:P194 should be the One True Way.

http://tinyurl.com/ybey28qp provides context.

https://gist.github.com/alexsdutton/0eb41f525d916453a0639bc4ea512a06 is legislature/index.json with both of these things loosened. It would include unhelpfully the Offices of Education if they had P1001s, which it would be reasonable for them to have.

Support for "nature of statement: expected" qualifiers on term dates

To support membership dates with more nuance, commons-builder should pay attention to nature of statement: expected qualifiers on term dates, and encode them as expected_start and expected_end properties on memberships attached to those terms.

Not enough flexibility for exec/leg index generation for Mexico

For Mexico, the intention is to include the largest nine SLACSen (second-level administrative country subdivisions) and not include any cities, as the cities don't have the legislatures or executives.

The index generation scripts rely upon select_admin_areas_for_country, which currently pick out the country, the FLACSen, and cities with populations over 250k.

We could achieve this by some combination of:

Adding additional admin area Wikidata IDs to be included to config.json
Adding population thresholds for SLACS and cities, with defaults of ∞ and 250k, varied for Mexico
Manually curate Mexico's index.json files to include the required executives and legislatures.

I don't like (3) when the assumption is becoming that the index files are generated. (1) is simple and generic, but doesn't support encoding the why of why those admin areas are included (and it's JSON, so no comments).

Enable commons-integrity checks to be run automatically from commons-builder

Presumably when the build or update command is run

Roles with multiple superclasses lead to duplicate membership objects

The President of Brazil has two superclasses: head of government, and president. The superclasses have disjoint inheritances, so we can't exclude one for being in the inheritance graph of the other.

At the moment our executive query returns multiple sets of bindings, one for each role superclass. We need a way to pick the more sensible of the two, and only output one membership object.

This problem may also apply to legislative roles.

where we're using p:/ps: instead of wdt:, remove any deprecated claims

Ability to pull in constituency/area metadata from Wikidata

commons-builder should be able to find constituency information for legislatures from Wikidata, so that:

we can do most of the data processing async from finding boundary shapefiles
we can ensure that the data in the shapefile attributes matches what's in Wikidata more easily (for completeness checks)
we can check the consistency of data in Wikidata

This will involve generating the CSV files (or a revamped form of the same data) and the associations between areas and positions. I'd also like to see seat counts on position/area pairs, so we can check we have enough seats as well as constituencies (though functional and at-large constituencies may complicate this).

Factor out finding executive/legislative positions

The query fragments to find executive and legislative positions based should be factored out so that we don't duplicate them, e.g. in constituency queries (#63) and position data queries (#69).

Factor out queries into templates

The queries would be far more maintainable as erb templates, and we wouldn't butt up against Rubocop's class length cop for wikidata_queries (as we are doing for the proposed implementation in #50).

remove Gemfile.lock

@tmtmtmtm pointed out that you shouldn't generally include a Gemfile.lock in a repository that builds a gem: http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/

Accessing missing query template variables should raise an exception

As mentioned in 5463265, the regression in #59 wouldn't have occurred if attempting to access variables missing from the template context raised an exception.

I think this is precluded at the moment because using strict_variables (see Shopify/liquid#804) causes errors as per Shopify/liquid#828. The commit that fixes this (Shopify/liquid@5149cde) is part of a merged PR (Shopify/liquid#829) but isn't in master or the tagged release, and I haven't yet worked out where it went.

The current built queries do not always include expected sub-regional legislatures

For example, in the United Kingdom (https://github.com/everypolitician/proto-commons-united-kingdom/blob/master/legislative/index-query-used.rq) the query does not include any city legislatures which would be expected.

Either queries need to be more open to different ways of modelling, or it needs to be configurable per-country to allow for queries which reflect how a country is actually modelled.

Include information on start dates and expected end dates for terms

[Description by @alexsdutton]

It would be useful downstream to be able to infer expected end dates for memberships. We can facilitate this by including start and (expected) end dates on terms in the legislative index. Downstream consumers can then associate these dates with memberships in the relevant popolo file.

Generate executive/index.json automatically from Wikidata for a given country

A fuller description of this enhancement to come later (Thursday).

http://tinyurl.com/ybkeb9bb is the beginnings of a query; we can borrow the query part for FLACS and cities from #13.

Sort order for position-data query not well-enough defined.

See everypolitician/proto-commons-india#67 (comment), in which it is apparent that ?positionSuperclass doesn't sort consistently.

Expose number of seats on a per-term basis

We'd like to include a % complete metric when viewing membership data for terms in commons-explorer, so the commons-builder query and model should be extended to expose these.

The data model is at https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician/Political_data_model, and the relevant parts can be found by searching for "number of seats".

Null parent_ids should be warned about

The only area that should have a null parent_id is the country area, but often we have other areas with null parent_ids slipping through. This should be an easy thing to check.

Overrideable defaults for various modelling deviations

We have a number of cases where the data we're interested in for a particular country doesn't quite follow what we normally do. These include:

Wanting to pull out regional representation for something other than FLACS (#49, #53)
Varying population thresholds for cities (Brazil, to 1m)
Filtering by something other than country for relevant regional and local representation (#53)

I think therefore that there should be a few overrideable config options in config.json, that we can extend as we go:

regional admin area superclass (default of FLACS, which we can vary to "district of Hong Kong" for Hong Kong, and "SLACS" for Brazil)
city population threshold (default of 250k, which we can up to 1m for Brazil)
additional admin area IDs (default of [], and it's already supported on a branch, but it would be implemented in a way consistent with the new config options.

The defaults would live in Commons::Builder::Config. The Config object would be passed to the query generation methods (and the additional_admin_area_ids would be picked up from that, instead of being passed explicitly).

Configure threshold for inclusion of cities/regions on a per-country basis.

In a country's config.json it should be possible to specify the population threshold at which a city or region is considered for inclusion.

commons-builder/lib/commons/builder/queries/select_admin_areas_for_country.rq.liquid

Line 17 in e21c67c

FILTER (?population > 250000)

Term-specific position item IDs shouldn't appear in output

Countries that have term-specific position items end up with these items' IDs up in the output, when we'd prefer the generic one. Using the generic one in the output would then match up against a static position item ID in the boundary index data, and makes the output more consistent for consumers, who we want to shield from the vagaries of term-specific positions.

Admin areas that have been dissolved are not ignored

For example, our executive index query for India returns results for Province of East Punjab, which was dissolved, and whose inclusion leads to unnecessary warnings.

Executive.list breaks when faced with multiple positions for one exec

In Northern Ireland, we — rightly or wrongly — pick up two HoG positions, the First Minister and the Deputy First Minister. This leads to Executive.list creating a new Executive for each and then trying to order them. As the Executive#executive_item_id is the same, it tries to order the positions, which are incomparable.

commons-builder/lib/commons/builder/executive.rb

Line 63 in 40886d4

executives_unsorted.sort_by { |h| [h.executive_item_id, h.positions] }

Note: It's not enough to just implement Position#<=>, as it would still be returning duplicate Executives, when there should be one Executive with multiple positions.

generate_executive_index often times out

The SPARQL query behind generate_executive_index often times out. Until now it's been possible to run it again and mostly have it succeed, but this isn't particularly sustainable or reliable.

The query should be simplified or split, or otherwise changed to give a higher probability of it working first time.

Example when run against Mexico (Q96):

bundler: failed to load command: generate_executive_index (/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index)
RestClient::Exceptions::ReadTimeout: Timed out reading data from server
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:733:in `rescue in transmit'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:647:in `transmit'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient.rb:71:in `post'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/wikidata.rb:16:in `perform'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/executive.rb:31:in `list'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/bin/generate_executive_index:11:in `<top (required)>'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `load'
  /home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `<top (required)>'```

Handle legislatures where membership is expressed as being for a specific term.

This is the case in Estonia, (e.g. https://www.wikidata.org/wiki/Q616356) and the UK (https://www.wikidata.org/wiki/Q234182)

Use executive position_ids to generate the role in wikidata queries, not the superclass

As @mhl noted in e59dc73, we're now specifying the specific roles for executive positions in executive/index.json, e.g. 'Mayor of Busan', rather than the superclass 'mayor'. However, the queries set that position_id as the superclass, meaning they'll usually return data where the superclass is the same as the position itself. Update the queries so that they set the role, not the superclass, so we get a useful superclass from them.

Legislatures and executives are only included in output popolo if there are memberships associated with them

Like area information, we want to include all legislatures as organizations, whether or not there are currently memberships associated with them.

Change COPYRIGHT to LICENCE

Currently License conditions are recorded in the same directory as shapefiles - i.e. in ./Boundaries//-COPYRIGHT
I think this should be changed to 'LICENCE' as it is more accurately describes what is contained, and, while perhaps a small thing, could effect how 'open' this data is seen to be.

Output a warning message if any id keys appear more than once in the build process output

This would be if an ID appears more than once within a JSON output file.

Generate legislative/index.json automatically from Wikidata for a given country

Currently the legislative/index.json in proto-commons repos is being authored by hand in bits as part of the process of including each new directory of boundary files. We would like to generate it automatically from Wikidata to the extent that that's possible. Our first thought is that the mechanism for doing this should be a script in the bin directory of this repo.

In general, we think this is going to involve using the country Wikidata ID specified in the config.json file of the proto-commons repository and formulating a set of queries that use that as a starting point to find the Wikidata IDs and names of various associated entities (legislatures at various different levels, and the roles and terms associated with them).

The file should contain an item for the national level legislature(s), the legislature associated with each first level administrative country subdivision (FLACS) for the country, and an item for every city with a population over 250k people.

The first thing that this script will need to do is to find the national level legislature(s) for the country. There should be Wikidata queries already defined in the Legislative Explorer that can be used or adapted for this. It also will need to find the Wikidata item for the role of being a member of that legislature (Owen can give some guidance on useful queries here). It will also need to get term information if appropriate, and if not, then fall back to using a start and end date. There's a good example term query here. For identifying the FLACS, and their legislatures, it looks like again, the Legislative Explorer should have some starting queries that can be used or modified. Owen should be able to help with a query to find from wikidata the cities with population over 250k.