everypolitician / commons-builder Goto Github PK
View Code? Open in Web Editor NEWBuild scripts for Democratic Commons repositories
License: MIT License
Build scripts for Democratic Commons repositories
License: MIT License
The only area that should have a null parent_id
is the country area, but often we have other areas with null parent_id
s slipping through. This should be an easy thing to check.
Add a stage to the build process before it generates the popolo-m17n.json, that checks to see if any of the item IDs in the index.json files or in the boundary CSV files have since been merged, and a command that will update them in that situation. I think we're going to want to use the id-mapping-store to represent somehow that these new items represent the same thing as the old one, but I think that needs some discussion before we act on it.
Countries that have term-specific position items end up with these items' IDs up in the output, when we'd prefer the generic one. Using the generic one in the output would then match up against a static position item ID in the boundary index data, and makes the output more consistent for consumers, who we want to shield from the vagaries of term-specific positions.
Presumably when the build
or update
command is run
The President of Brazil has two superclasses: head of government, and president. The superclasses have disjoint inheritances, so we can't exclude one for being in the inheritance graph of the other.
At the moment our executive query returns multiple sets of bindings, one for each role superclass. We need a way to pick the more sensible of the two, and only output one membership object.
This problem may also apply to legislative roles.
As mentioned in 5463265, the regression in #59 wouldn't have occurred if attempting to access variables missing from the template context raised an exception.
I think this is precluded at the moment because using strict_variables
(see Shopify/liquid#804) causes errors as per Shopify/liquid#828. The commit that fixes this (Shopify/liquid@5149cde) is part of a merged PR (Shopify/liquid#829) but isn't in master or the tagged release, and I haven't yet worked out where it went.
This is the case in Estonia, (e.g. https://www.wikidata.org/wiki/Q616356) and the UK (https://www.wikidata.org/wiki/Q234182)
In Northern Ireland, we — rightly or wrongly — pick up two HoG positions, the First Minister and the Deputy First Minister. This leads to Executive.list
creating a new Executive
for each and then trying to order them. As the Executive#executive_item_id
is the same, it tries to order the positions, which are incomparable.
Note: It's not enough to just implement Position#<=>
, as it would still be returning duplicate Executive
s, when there should be one Executive
with multiple positions.
Currently License conditions are recorded in the same directory as shapefiles - i.e. in ./Boundaries//-COPYRIGHT
I think this should be changed to 'LICENCE' as it is more accurately describes what is contained, and, while perhaps a small thing, could effect how 'open' this data is seen to be.
As @mhl noted in e59dc73, we're now specifying the specific roles for executive positions in executive/index.json
, e.g. 'Mayor of Busan', rather than the superclass 'mayor'. However, the queries set that position_id
as the superclass, meaning they'll usually return data where the superclass is the same as the position itself. Update the queries so that they set the role, not the superclass, so we get a useful superclass from them.
At the moment each term is stored in a folder such as legislative/Q12345/Q67890
. Whilst entirely sensible for machines who can quickly parse the index
files, this isn't much use to humans who may be interested in what each folder contains.
As part of the build process, each of these subfolders (both the legislature and the term) should gain a summary.json
(similar to EveryPolitician) with labels and headline statistics, as well as a readme.md
with a human readable version of the same.
The machine readable summaries are immediately useful for downstream tools such as the Commons Explorer, and the human readable version makes pointing interested parties directly at a legislature or term folder on GitHub more feasible.
The queries would be far more maintainable as erb templates, and we wouldn't butt up against Rubocop's class length cop for wikidata_queries
(as we are doing for the proposed implementation in #50).
The label_service.rq.liquid
template is no longer including non-"en" languages in its output. e.g., when run against Chile, the queries being generated have the following diff:
diff --git a/executive/index-query-used.rq b/executive/index-query-used.rq
index 4a7e502..b86522f 100644
--- a/executive/index-query-used.rq
+++ b/executive/index-query-used.rq
@@ -44,5 +44,5 @@ SELECT DISTINCT ?executive ?executiveLabel ?adminArea ?adminAreaLabel ?adminArea
}
}
- SERVICE wikibase:label { bd:serviceParam wikibase:language "en,es". }
+ SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} ORDER BY ?primarySort ?country ?adminAreaType ?executive ?position
This will be the result of merging #58, where the template wasn't updated to use config.languages
instead of languages
.
To support membership dates with more nuance, commons-builder
should pay attention to nature of statement: expected
qualifiers on term dates, and encode them as expected_start
and expected_end
properties on memberships attached to those terms.
The SPARQL query behind generate_executive_index
often times out. Until now it's been possible to run it again and mostly have it succeed, but this isn't particularly sustainable or reliable.
The query should be simplified or split, or otherwise changed to give a higher probability of it working first time.
Example when run against Mexico (Q96):
bundler: failed to load command: generate_executive_index (/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index)
RestClient::Exceptions::ReadTimeout: Timed out reading data from server
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:733:in `rescue in transmit'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:647:in `transmit'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/rest-client-2.0.2/lib/restclient.rb:71:in `post'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/wikidata.rb:16:in `perform'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/lib/commons/builder/executive.rb:31:in `list'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bundler/gems/commons-builder-f7551dc64c31/bin/generate_executive_index:11:in `<top (required)>'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `load'
/home/user/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/bin/generate_executive_index:23:in `<top (required)>'```
This would be if an ID appears more than once within a JSON output file.
Tancredo Neves was the 10th president of Brazil, but died before taking office. As such, he never stopped being president by virtue of never starting, and so is deemed "current" by our current queries.
I think this should be modelled by adding a "start date: no value" qualifier to the P39 to signify "never started", and the executive — and maybe legislative — query(ies) updated to take this into consideration.
@tmtmtmtm suggested off-GitHub using a "subject has role: president-elect" qualifier in this case, which I think is helpfully descriptive, but complicates the query if we also have to look for "governor-elect", "senator-elect", etc.
This would also need documenting within https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician/Political_data_model
Currently the legislative/index.json
in proto-commons
repos is being authored by hand in bits as part of the process of including each new directory of boundary files. We would like to generate it automatically from Wikidata to the extent that that's possible. Our first thought is that the mechanism for doing this should be a script in the bin
directory of this repo.
In general, we think this is going to involve using the country Wikidata ID specified in the config.json
file of the proto-commons
repository and formulating a set of queries that use that as a starting point to find the Wikidata IDs and names of various associated entities (legislatures at various different levels, and the roles and terms associated with them).
The file should contain an item for the national level legislature(s), the legislature associated with each first level administrative country subdivision (FLACS) for the country, and an item for every city with a population over 250k people.
The first thing that this script will need to do is to find the national level legislature(s) for the country. There should be Wikidata queries already defined in the Legislative Explorer that can be used or adapted for this. It also will need to find the Wikidata item for the role of being a member of that legislature (Owen can give some guidance on useful queries here). It will also need to get term information if appropriate, and if not, then fall back to using a start and end date. There's a good example term query here. For identifying the FLACS, and their legislatures, it looks like again, the Legislative Explorer should have some starting queries that can be used or modified. Owen should be able to help with a query to find from wikidata the cities with population over 250k.
See everypolitician/proto-commons-india#67 (comment), in which it is apparent that ?positionSuperclass
doesn't sort consistently.
We have a number of cases where the data we're interested in for a particular country doesn't quite follow what we normally do. These include:
I think therefore that there should be a few overrideable config options in config.json
, that we can extend as we go:
The defaults would live in Commons::Builder::Config
. The Config
object would be passed to the query generation methods (and the additional_admin_area_ids
would be picked up from that, instead of being passed explicitly).
Like area information, we want to include all legislatures as organizations, whether or not there are currently memberships associated with them.
Italy's senate (https://www.wikidata.org/wiki/Q633872) has two has part
s, "member of the Italian senate" (https://www.wikidata.org/wiki/Q13653224) and "senatore a vita (senator for life)" (https://www.wikidata.org/wiki/Q826589), with the latter a subclass of the former. This means two entries in legislative/index.json
at the moment, which is Bad. I think we want to exclude the latter, and then update the legislative membership query to also consider subclasses of the specific position item.
See everypolitician/proto-commons-south-korea#3 for an example
In Executive.list
and Legislature.list
, there are two occurences where the query is run before it is written to disk, which means that if it times out or fails, it's difficult to determine what the failed query was.
This was discovered trying to debug #45.
The query should be recorded before these methods attempt to run it.
See e.g. everypolitician/proto-commons-united-kingdom#20 (review).
Also relevant:
Do we need to revisit superintendencies of Brazil - they’re more civil servant than head of government. The pragmatic reason for modelling them this way is that the position metadata query. ACTION: Write a ticket to make the position metadata queries not reliant on roles being a descendent of legislator or head of government
Harriet Jones is picked up as a current Prime Minister of the United Kingdom, despite being an instance of 'fictional human' instead of 'person'. We should ensure all the people we find are instances of 'person'.
Hong Kong districts (e.g. Central and Western District) aren't FLACSen, but instead are districts of Hong Kong. As it stands, there's no easy way to pull these out.
They're also not related with P17 (country) to the Hong Kong entity, but there is a P131+ chain (located in the territorial administrative entity).
Further note of caution: Hong Kong is a FLACS of China.
In a country's config.json
it should be possible to specify the population threshold at which a city or region is considered for inclusion.
Currently, cities which are also FLACSen are appearing twice in the index - they should appear once, as a FLACS.
An example of this is Chaco (https://www.wikidata.org/wiki/Q15977818) in Paraguay, which has an end date qualifier on its P31 (instance of Department of Paraguay) and has a P576 (dissolved, abolished or demolished).
For people who have multiple districts (e.g. the mayor of rome) we're getting a lot of spurious changes depending on the order of these rows being returned by the SPARQL query.
@tmtmtmtm pointed out that you shouldn't generally include a Gemfile.lock in a repository that builds a gem: http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
It would be nice to be able to see if there are labels missing for particular elements compared to the languages configured on the repository.
We'd like to include a % complete metric when viewing membership data for terms in commons-explorer, so the commons-builder query and model should be extended to expose these.
The data model is at https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician/Political_data_model, and the relevant parts can be found by searching for "number of seats".
When running queries, [individual] returned results are sometimes [have their bound variables] in a different order resulting in unnecessary diffs which makes reviewing difficult.
These changes are apparent in both query results and the resulting Popolo JSON.
For example, our executive index query for India returns results for Province of East Punjab, which was dissolved, and whose inclusion leads to unnecessary warnings.
Running generate_legislative_index
against everypolitician/proto-commons-south-korea results in a file that only lists The National Assembly (Q494162), and none of the FLACSen or cities.
If we loosen the query to also include P1001 (applies to jurisdiction) we'd pick up a few Provincial/City Government → Province/City relationships, but…
These Provincial/City governments are modelled as instances of local government, not legislature or legislative house as we had before. local government is also applied to e.g. Incheon Metropolitan City Office of Education.
I think we should add legislature types to the Provincial/City Governments, alongside their existing local government types.
We can probably cope with being looseness of ?body (wdt:P194/wdt:P527?)|^wdt:P1001 ?legislature
, unless anyone thinks wdt:P194
should be the One True Way.
http://tinyurl.com/ybey28qp provides context.
https://gist.github.com/alexsdutton/0eb41f525d916453a0639bc4ea512a06 is legislature/index.json
with both of these things loosened. It would include unhelpfully the Offices of Education if they had P1001s, which it would be reasonable for them to have.
- persons[person_id][:links] << link if link
+ if link
+ persons[person_id][:links] << link unless persons[person_id][:links].include? link
+ end
commons-builder should be able to find constituency information for legislatures from Wikidata, so that:
This will involve generating the CSV files (or a revamped form of the same data) and the associations between areas and positions. I'd also like to see seat counts on position/area pairs, so we can check we have enough seats as well as constituencies (though functional and at-large constituencies may complicate this).
This is the reason for the reversion of 3ebd095
We suspect this is caused by terms that are missing data, e.g. not start date, end date or "replaces", but needs further investigation.
For Mexico, the intention is to include the largest nine SLACSen (second-level administrative country subdivisions) and not include any cities, as the cities don't have the legislatures or executives.
The index generation scripts rely upon select_admin_areas_for_country
, which currently pick out the country, the FLACSen, and cities with populations over 250k.
We could achieve this by some combination of:
config.json
index.json
files to include the required executives and legislatures.I don't like (3) when the assumption is becoming that the index files are generated. (1) is simple and generic, but doesn't support encoding the why of why those admin areas are included (and it's JSON, so no comments).
A fuller description of this enhancement to come later (Thursday).
http://tinyurl.com/ybkeb9bb is the beginnings of a query; we can borrow the query part for FLACS and cities from #13.
This gets included in variable names in the SPARQL query generated which then results in a estClient::BadRequest: 400 Bad Request
error when that query is used in a request to the wikidata query service.
[Description by @alexsdutton]
It would be useful downstream to be able to infer expected end dates for memberships. We can facilitate this by including start and (expected) end dates on terms in the legislative index. Downstream consumers can then associate these dates with memberships in the relevant popolo file.
Each position_item_id
referenced in legislative/index.json
or executive/index.json
should appear at least once in boundaries/build/index.json
- produce a warning for each one that doesn't appear.
For example, in the United Kingdom (https://github.com/everypolitician/proto-commons-united-kingdom/blob/master/legislative/index-query-used.rq) the query does not include any city legislatures which would be expected.
Either queries need to be more open to different ways of modelling, or it needs to be configurable per-country to allow for queries which reflect how a country is actually modelled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.