Ensure id present for all Greek public bodies

@okfngr just noticed that in #43 PR a lot of public bodies were missing a key field (now called id).

Would it be possible to generate and add an id field to all records - an id field is required and is necessary for the frontend to work.

We also seem to be missing jurisdiction codes (which are required) for

gr/dpa
gr/adae
gr/asep
gr/esr
gr/synigoros
gr/minedu
gr/neagenia
gr/gsae
gr/culture
gr/gss
gr/gsrt
gr/minedu
gr/minedu
gr/minedu
gr/gak
gr/gak
gr/iky

Add the Italian (IT) public bodies from the Public Administration Index

add the list of the italian public administrations using the data available on the CSV maintained by the Italian Public Administration Index

Organisation identifiers (for discussion)

This is an idea that I've been thinking about for a while. I discussed it with @rgrp a couple of weeks ago and wanted to share it with the list to see what everyone thinks.

The short version: could public bodies be used to generate usable organisation identifiers?

Background

The IATI Standard is an XML based format for sharing detailed information about aid projects. Fundamentally, the model shows resource flows from one organisation to another, with various classifications in between and many financial transactions as part of each project. So like this:

activity (DFID -> World Health Organisation)
  - transaction (GBP 500 disbursed on 2013-05-01)
  - transaction (GBP 500 disbursed on 2013-07-05)

For the private sector and NGOs, the methodology for uniquely identifying organisations is:

Jurisdiction-National registration body-Number
e.g. for Oxfam GB, registered at the Charity Commission, with reg number 202918:
GB-CHC-202918

For governments, the following methodology is used:
Jurisdiction-OECD/DAC Agency code
e.g. for the UK's Department for International Development:
GB-1

For multilaterals, we use the following methodology:
OECD/DAC Channel code
e.g. for the World Bank's International Development Association (IDA):
44002

Problems

Agency codes

Agency codes only include donor agencies. So the Ministry of Finance in Botswana, for example, does not have a code.
Agency codes don't even include all donor agencies: for example, parts of the European Commission or the United States, even though they give aid, don't have their own identifier - they're categorised under Miscellaneous.
The process for adding new agency codes is slow (even if it took a day, that might be too long)

Channel codes

Channel codes only contain a subset of all of the multilateral / international / intergovernmental organisations in the world, and many of them are not listed in a very usable way. For example, the World Health Organisation has two codes:
a) World Health Organisation - core voluntary contributions account
b) World Health Organisation - assessed contributions
--> but there isn't one for just "World Health Organisation", for example if you're contracting them to deliver a project.

Many organisations publishing IATI data will therefore struggle to provide unique organisation identifiers for many of the public sector / international organisations that they are working with.

Rationale

Official lists of organisations should be used if possible.
Official lists of organisations don't exist in most cases.
The exact identifier assigned to an organisation is not fundamentally important (whether it's BW-1 or BW-21, the Botswana Ministry of Finance just needs a code).
Organisation identifiers should be cross-mapped to other codes / identifiers for those organisations so that the data is easily interoperable.

Proposal

Fuzzy reconciliation / text matching of organisations, with an API that assigns an existing identifier where available, and creates a new one where it's not available

Organisations (initially, preferably those with a large amount of data) throw four key pieces of data at the API:

organisation name (text) - e.g. MINISTRY OF FINANCE
organisation country (code) - e.g. BW (for Botswana)
language (code) - e.g. en
last recorded transaction with this organisation (date) - e.g. 2013-07-05

the API responds with one of the following (possibly using HTTP status codes?):
a) Organisation found => use code BW-1
b) Organisation not found => created code BW-21

it also stores the data about the last recorded transaction, so that other people know that that organisation may have existed on that date.

Another source could be Charts of Accounts, existing lists (like those that exist on PB already), budget documents, and structured spending data, e.g. from OpenSpending.

Dealing with duplicates

This will probably lead to some duplicates being created. There could be some manual reconciliation for this. Organisations could have a primary identifier and several secondary identifiers that were used by duplicate organisations..

Dealing with changing organisations

Organisations can be created / deleted / merged in the real world. This should probably lead to:
a) created - a new identifier gets created;
b) merged - a new identifier gets created for the new organisation; and (manually) the old organisations are linked / related to the new organisation;
c) deleted - the identifier continues to exist, because old (and possibly future) data will still refer to it. However, it should be (manually) marked as no longer existing, pointing to a successor organisation of one exists (with some flag to explain whether it's a wholly .

Questions

Does this sound sensible? Is it a good idea? Is there a better alternative?
Will the fuzzy matching be accurate enough to be useful? Is it likely to assign organisations an incorrect code?
How should the identifiers be identified as being created by Public Bodies - just a prefix like PB-?

OECD-DAC codelists:

http://www.oecd.org/dac/stats/dacandcrscodelists.htm
IATI Standard:
http://iatistandard.org

Check we have everything from https://www.gov.uk/government/organisations

https://www.gov.uk/government/organisations

Use info from http://datahub.io/dataset/uk-public-bodies

Licence for whatdotheyknow data

Has the licence for the whatdotheyknow list of public bodies been established? We asked a few months ago and they didn't have one, although no doubt with a good nudge they would be happy to.

Instructions for data contributors

This should probably go on the wiki once finished.

Fields

key names:
- should be url suitable: alphanumeric + '-' only
- use - rather than _
- use abbreviations where appropriate
use iso formatted date / times

To discuss

Do we need last modified and created?
Do we want both parent and parent_key?

What Public Bodies

National or local departments or agencies
(Probably) Not every school of fire station in existence.

Asides

Write up a description of the columns

Clean up and extend US data

Some duplicates at the moment and also many fewer bodies than there probably are!

https://github.com/okfn/publicbodies/blob/master/data/us.csv

Specify source code license

Is the license for the source code of this project (not the data, as that is a separate issue) specified somewhere? I couln't find it. Please include a (preferrably) open source license or, if there is already one, make it more evident (e.g. mention on the README and/or include a COPYING.txt file).

Note: it may be necessary to:

Suggest a propositional license here; and
Obtain consent from each contributor of source code in this project to license his/her work under said proposed license.

United States csv

I'm on it!

Consider adding related-bodies/related-agencies to schema

To make the data set more useful, I think adding a field to the schema for related bodies/agencies would be very useful. Perhaps the field is populated by the values key field.

Thoughts?

Home page issue with Firefox

it looks like with Firefox the two main DIVs, the one with the jurisdictions and the sidebar on the right overlap a bit. On both Chrome and Safari are instead well positioned

Document contributor workflow

Search support

Options

JS solr (lunrjs etc)
Separate solr
Google custom search (require us to build a site-map or list everything on the front page)
No search

Normalize dates to ISO 8601

Integrate Swiss Federal Data

Wrote a quick scraper for the directory of Swiss federal entities, see https://scraperwiki.com/scrapers/public_bodies_of_the_swiss_federation/

Names or extracted in German only, but are available in French and Italian as well
Parsing of addresses/phone numbers could be improved
Not sure if everything needed is covered and present in the right form, just tried to guess from the CSV files available - feedback very welcome!

Link to CSV files broken

#51 made a change uppercasing jurisdiction codes, but links on the front page are lowercase.

Implement hierarchy browser

For countries for which we have a good tree structure being able to browse that tree in the UI would be very helpful.

Requirements:

Go from a public body to its parent body (done)
See a list of child bodies per public body
Present overview per jurisdiction in tree / forrest form

Data for Quebec

I have a scraper for Quebec's public bodies (my boss authored it, and wants to contribute). It's written in ruby, and can be seen here. How do we go about integrating this?https://gist.github.com/jpmckinney/5022490

Use Info from OpenTED

Re-Add google analytics

Lost them in node upgrade ...

German Public Bodies from FragDenStaat.de

The ever growing list of German public bodies on FragDenStaat.de can be accessed via the FragDenStaat.de API:

https://fragdenstaat.de/api/v1/publicbody/?format=json

It's a bit verbose. If CSV is a better fit, I can also provide a dump.

`npm run-script make` throws an error

It looks like npm install is enough to install the site now. npm run-script make fails since there’s no longer a site directory.

Switch to simple web app with templating

e.g. nodejs + nunjucks + deploy on heroku

Note we would still just load raw csv when app loads - heroku 512 MB limit should be fine give amount of data we have so far ...

datapackage.json

Lower case country leads to dead page

The search leads to links like publicbodies.org/gb which is an empty page but the front page leads to publicbodies.org/GB. These need to be harmonised.

I'm slightly confused by the website tagline

The Public Bodies tagline is "A URL for every part of government"

yet very non-government entities pop-up on the UK list e.g. ASDA

It would be less catchy a tagline but perhaps, "A URL for every FoI-able public sector organisation" might be more accurate, less confusing?

Push data to CKAN DataStore for querying

Support for sending corrections / additions

Several options:

Fork and pull (good for bulk corrections and submissions)
We could load the CSVs into google docs and have people edit then remerge
- perhaps we can / should have them permanently there
Submission of individual corrections (feedback form style) - Suggest the google forms hack approach (we'll just submit stuff into gforms via js ...) - cf http://github.com/okfn/opendatacensus which uses this technique for city submissions

List Bodies On Per-Country Pages

The index page is quite long, and atm ~75% is probably not relevant to a given user. I spent ~15 minutes working on splitting them out. Should I continue? Thoughts?

Integrate EU WhoIsWho data

http://europa.eu/whoiswho/public/

Lifecycle issues

Public bodies change frequently and it would be good to agree how to deal with this. I think having a sense of permanence for URLs is useful, so I suggest:

Suggest:

URLs for a body must never change
Title should not change. If a body changes its name then it should be handled as if it died and a new one was created.
When a body dies it should be marked as inactive.
If a body takes over the main role of a previous body, then the old body should have a 'redirect' to the new body stored with it.
If a body's abbreviation or other property changes then that is ok (e.g. DBIS -> BIS)

New United States data source

https://github.com/GSA-OCSIT/govt-urls

Source: http://www.infodocket.com/2014/01/29/reference-list-of-government-urls-that-do-not-end-in-gov-or-mil-crawled-by-usa-gov/

Will ingest soon.

Add keys for US data

US data is missing key field in many cases - cf #39

Display country name not just code

Load country code info from e.g. http://data.okfn.org/data/country-codes and use them ...

Broken links

All the CSV downloads on the homepage link to "undefined.csv": http://publicbodies.org/

Basic tests

See https://github.com/okfn/opendatacensus/tree/master/tests for our preferred approach (using mocha, superagent etc)

Decide whether or not organizational units are in scope

Are in scope of the data for this project:

a) only organizations (as in org:Organization ); or
b) organization and their respective hierarchy of organizational units (as in org:OrganizationalUnit )?

Change key to use slug

Let's get rid of random generated uuid parts for keys and use slug instead.

Check that slugs are unique per jurisdiction
implement the change

Also:

What about rename key => id?

Add to repository all scripts that load publicly avaliable data

We should create "scripts/import/XX" directories as needed in the repository to hold scripts to update the data, where avaliable from public sources. That way it would be much easier to keep the data up-to-date.

Data for China

Data from Shen: http://ubercheckout.com/cn.csv

Rework schema (list of headers) and document

@rossjones suggested: "Would it make sense for publicbodies.org to follow the popolo spec at http://popoloproject.com/data.html" (that link is now broken)

Correct link is: http://popoloproject.com/specs/organization.html

Seems a great idea!

Current fields

Current fields and suggested changes (e.g. to be in line with popolo as much as possible). Note the list of changes is in progress and incomplete.

title => name (in org name)
abbr => abbreviation
key => id (?)
category => classification
parent => DELETE (just have parent_id)
parent_key => parent_id
description
url
jurisdiction => DELETE (just have jurisdiction code)
jurisdiction_code = ISO 2 digit code where that exists. Otherwise we coin.
source => DELETE in favour of source URL (??)
source_url => keep
- make clear there is no point pointing at exactly the same API endpoint - much more useful to point at a specific location
- (??) DELETE entirely and just credit in contributor notes (we already have a bunch of different sources for data and as people add the problem will get worse)
- Could have multiple sources per entry (??)
address
contact => What's the difference from address
email
tags => keep
- at the moment several of the files use tags (though not necessarily consistently)
created_at => DELETE (little value ...)
updated_at => DELETE (ditto)

Add:

other_names: semi-colon separated list of alternate names
founding_date: ISO 8601
dissolution_date: ISO 8601
image

Consider switch to JSON from CSV

Pros / Cons

(+) Greater flexibility, ability to directly match org spec
- In particular can handle multiple values, multiple identifiers
(-) Much bigger and less compact. Harder for people to work with (e.g. CSV usable in spreadsheets etc)
(-) More complexity (but perhaps necessary)

Connect with relevant FOI sites

Would be nice to link out from a given public body to all requests related to it on relevant FoI sites

/cc @wombleton NZ could be a test case for this ...

Slovenian government account holders

http://www.ujp.gov.si/dokumenti/dokument.asp?id=127 -- first excel links :)

Build to flat files and deploy to s3

Build
Deploy

Build

Let's use nunjucks

var env = new nunjucks.Environment();
var tmpl = env.getTemplate('test.html');
console.log(tmpl.render({ username: "james" }));

That should help users of other languages to browse for public bodies in their native language.

datasets / publicbodies Goto Github PK

publicbodies's Issues