openownership / register Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 3.0 8.45 MB

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia

Home Page: https://register.openownership.org

License: GNU Affero General Public License v3.0

Dockerfile 0.56% Ruby 28.91% JavaScript 8.33% CSS 0.66% HTML 0.97% Shell 0.81% SCSS 18.54% Haml 41.20% Procfile 0.01%

beneficial-ownership beneficial-ownership-data elasticsearch open-source

register's People

Contributors

Stargazers

Watchers

Forkers

stephenabbott openownership frederik101

register's Issues

Refactor identifiers to that they have a common, fixed, structure

SO THAT

I don’t have to reformat them and make source-specific judgements when I come to bulk export data
Every record in the database is consistent in it’s storage of identifiers
I can query all identifiers in a consistent way if needed

Background:
Currently identifiers are stored as a list of objects, where each object can have different attributes depending on the source. All the attributes of a particular identifier are together are taken as the 'unique' value. Therefore, the only way to be sure what forms of identifier exist is to look at all of the data in the database. From a review of the current code though, I can see we have following different structures:

UK: document_id, company_number (child companies) or document_id, company_number, link (parent companies) or document_id, link (people)
DK: document_id, company_number (companies) or document_id, beneficial_owner_id (people)
SK: document_id, company_number (companies) or document_id, beneficial_owner_id (people)
UA: document_id, company_number (companies) or document_id, company_number, name (people, note this is bad as there's no solid guarantee of uniqueness)
EITI: document_id, name (both companies and people - again, this is bad for uniqueness)
BODS data: document_id, statement_id and any number of identifiers given in the data, which have at least one of scheme, scheme_name and then one or more of id, uri (companies and people)
OpenCorporates: jurisdiction_code, company_number

When we come to export these as BODS, we tend to take the document_id and either lookup an Org-Id scheme code, or declare it directly as the schemeName. We then combine the other parts of the identifier as the 'id'. We make special exceptions for OC identifiers and also add the register's internal id as another identifier.

As with #14 , I think we should probably move towards matching BODS' Identifier object.

A/C

All identifiers match the structure of a BODS Identifier
Identifiers are represented by a Rails model or a ruby object (see OO-197: Model entity identifiers as classes to improve code qualityTO DO), with documented fields
All importers create identifiers in the BODS format during import, making the conversion explicit and tested.
All importers contain and use the Org-Id scheme code(s) which refer to their specific data sources
We have a one-off batch process to update existing data
We can remove all of the identifier mapping code from the BODS export (except for basic renaming of fields from Ruby norms to JSON norms).
We have updated database indexes for querying identifiers and have removed any old indexes
Re-importing records still finds and updates the existing data, rather than creating dupes (and we have system tests for each importer to assert that).
document_ids are consistent with the new naming scheme we've introduced

Outputting BODS json for large merged people uses too much memory and times out

We recently put a link to our JSON versions of entities, which has resulted in Google crawling them.

With this, we've had a recurrence of the issues of memory consumption and request timeouts, because some of the entities have thousands of merged people and owned companies. This results in a lot of data, and a lot of memory needed to traverse the chains of ownerships.

On the page versions, we resolved this by paginating owned companies and merged people (indepedently). We could implement something similar within the JSON, but we'd need to:

Figure out how to specify the pagination in the response - currently we output a JSON list, we would presumably have to wrap that in an object with some extra parameters.
Document the pagination - we're effectively becoming more of an API here, so we need to document how it works.
Figure out how to actually paginate the data in the JSON - we do quite custom MongoDB queries for the page versions at the moment, but the equivalents of those queries are embedded in the graph traversal for the JSON. The same code is also used for the graph page and bulk export (and perhaps other things I can't remember).

Model entity identifiers as classes to improve code quality

An Entity can have multiple identifiers – each is a single unique identifier, from a particular source, that helps us find and dedupe entities.

Currently, these are modelled as an Array of Hash objects. Whilst this works okay, we should consider having first class model classes for each kind. This will allow us to:

Transparently control the ordering of the serialisation (and thus avoid bugs like the the one in OO-141.
Type check for particular kinds of identifiers (e.g. the OC identifier).
Control the construction of identifiers at the model level, rather than in importers, etc.
Provide utility methods etc. (e.g. like the ones currently in the Entity class for managing OC identifiers).

sample_date is potentially wrong/inaccurate on UK & SK relationships

When doing OO-293 (adding start and end dates to all relationships) we realised that the 'sample_date' on relationships was being set from the 'start date' in PSC and SK data.

For reference: we think that 'sample_date' is intended to be a 'when was the information about this relationship actually declared' kind of date, and we display it under 'Provenance' with the label 'As of:' and the help text 'The date this information was known to be true'.

Being specific, in the UK data, sample_date is being set from the 'notified_on' field of the 'data' record (where 'data' is the info about the owning person or company), while in the SK data it's coming from the 'PlatnostDo' (valid from) on the record we're currently processing from 'KonecniUzivateliaVyhod' (which is roughly translated the list of people associated with a company).

To be perfectly accurate to the name of the field, I think in these cases we shouldn't save anything in it, because neither SK nor PSC actually tell us when the data was declared. However, we should decide if it's better to have a 'Don't Know' there than what we have at the moment, which is effectively a best guess from the later of the start date and end date. Both of these are strictly within the definition of 'The date this information was known to be true' but I don't think they're very helpful to the user.

OpenCorporates resolution deletes data from source

We found this when doing the SK geocoding:

Sprematec GMBH has an address in the source data: https://rpvs.gov.sk/rpvs/Partner/Partner/Detail/2213
Open corporates doesn't have an address for them: https://opencorporates.com/companies/de/P3210_P3212_HRB111443
Therefore, when we import the data and look them up with OC, we lose the address: https://register.openownership.org/entities/59c225c267e4ebf34031fb65

Acceptance criteria

Given we have fixed the bug

And we've re-run an import of all our raw data

When I look at Sprematec GMBH: https://register.openownership.org/entities/59c225c267e4ebf34031fb65

Then I see an address in the top right metadata

When I look at the download changelog page I see a description of the impact of this change on the register's dataset.

In particular, with stats on: How many companies changed after this fix (and by extension how many didn't). Broken down into - how many changed because they were matched with OC when they weren't previously, how many just have source data added (i.e. blank fields with data in now), how many have changed because OC's data has changed since we last looked them up.

Hide ended statements in graphs by default

Example of current and former ownership and control relationships coexisting: https://register.openownership.org/entities/5b16a8b89dfc3fae18f62024/graph

In my experience the default behaviour of showing all relationships regardless of whether they have ended:

is a source of confusion for users;
gives the impression that the data is messier and less useful than it really is.

I would prefer to see only the current situation, with some mechanism to see past ownership positions.

Register v2: Country of Residence for Person Statements is not displaying

Update scheme for identifiers for Denmark CVR

From the BODS data for Denmark - see example - looks like we use a MISC-Denmark CVR ID as the CVR wasn't on org-id at the time of starting this ingest.

CVR is now on org-id so we should update that for new DK importer https://org-id.guide/list/DK-CVR

Transliterate data from the OpenCorporates API

Make the site menu appear part of the main OO website

Part of this is styling the menu + logo more identically, but also potentially adding a menu level above the current one that makes it appear that you're on the same site.

Add and configure the lograge gem to streamline our logs

We recently ran over our 200MB daily limit of logs with Papertrail. They have some suggestions to reduce the size of the logs you produce: https://help.papertrailapp.com/kb/configuration/controlling-verbosity and as stopgap I added a filter to remove any logs about what views and partials were rendering via the Papertrail settings.

Some of the other suggestions make more sense long term however, including installing the lograge gem (which collapses down Rails' logs to single lines) and perhaps disabling action view logging altogether.

Register v2: GB-COH identifier not present for some UK companies

Need to investigate issue where the "GB-COH" scheme is not present for some companies companies - see examples one, two, three - meaning that they don’t have GB-COH IDs in their BODS data

Statement IDs

17818261013873633972
14010613442203296090
14609729328118672241
17216407376407088861

Replace the Provenance model entirely with RawDataProvenances

So that

I can show more detailed provenance (OO-509: As a user, I want a Provenance link to raw data from the relationship pageDONE)
I can output Source info for every statement in BODS
I can remove the code that deals with Provenance sources in the BODS export
I can show a statementDate on person and entity statements
I can show sources and statementDates on unknownPerson statements

Assumptions

UnknownPersonsEntities (or whatever code replaces them) will need to have RawDataProvenances given to them from Statements when they’re created.

Register v2: Country source/entity labels

Name of country source and type of entity labels not appearing on prototype pages for entities under the title (compare this with this)

Make it possible to store approximate dates in every date field

We currently use a library which provides a special database field for ISO8601 approximate dates (e.g. 2019-05). However, in developing the BODS import, I realised that it doesn’t really work in a way that supports this correctly.

The library allows us to parse dates like 2019-05, but it turns that into 2019-05-01 when it saves it in the database, effectively losing the ‘approximation’ from the source. This has some advantages in that it becomes comparable to other full dates (e.g. for sorting) but it seems important that we don’t lose the original intention of the source.

Relatedly, we only use this special date library on some dates and in some database tables. It seems like we should use them everywhere, or at least be consistent in which kinds of things are approximate and which aren’t (e.g. are statement dates approximate?).

Update bootstrap-ui library

We are currently using an alpha version of the Bootstrap v5 library, which has some broken things in it (e.g. collapsible containers don't quite work). We should upgrade this to the latest stable v5 release.

Note: I tried this a few weeks back and it looks like the entire header is broken on the latest v5 release due to breaking changes from the alpha (in the nav component). So this will likely be a bigger task than expected.

Remove the Statements model and record missing owners like BODS does instead

Register v2: Company IDs starting with a zero

One new issue I've noticed is that we aren't transforming company ids starting with a zero, so there are occasionally duplicates. I'll fix that one for the import next week (end of May)

You can't submit a state-owned entity because all ownership has to end with a person

We just had this issue with: https://register.openownership.org/entities/5e569474ac0ca34dfc6841b6 which a submitter had initially corrected/hacked around by putting the president as the president of the state-owned holding company.

ElasticSearch: aggregation exception error

The Error

[400] {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"open_ownership_register_entities_development","node":"SLgzHYLrTfidSNSMn3-UeQ","reason":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}},"status":400}

Stack Trace

elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/base.rb:205:in __raise_transport_error' elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/base.rb:323:in perform_request'
elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request' elasticsearch-transport (6.1.0) lib/elasticsearch/transport/client.rb:131:in perform_request'
elasticsearch-api (6.1.0) lib/elasticsearch/api/actions/search.rb:187:in search' elasticsearch-model (6.0.0) lib/elasticsearch/model/searching.rb:51:in execute!'
elasticsearch-model (6.0.0) lib/elasticsearch/model/response.rb:29:in response' elasticsearch-model (6.0.0) lib/elasticsearch/model/response/base.rb:34:in total'
app/controllers/searches_controller.rb:12:in show' actionpack (5.2.4.3) lib/action_controller/metal/basic_implicit_render.rb:6:in send_action'
actionpack (5.2.4.3) lib/abstract_controller/base.rb:194:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/rendering.rb:30:in process_action'
actionpack (5.2.4.3) lib/abstract_controller/callbacks.rb:42:in block in process_action' activesupport (5.2.4.3) lib/active_support/callbacks.rb:132:in run_callbacks'
actionpack (5.2.4.3) lib/abstract_controller/callbacks.rb:41:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/rescue.rb:22:in process_action'
actionpack (5.2.4.3) lib/action_controller/metal/instrumentation.rb:34:in block in process_action' activesupport (5.2.4.3) lib/active_support/notifications.rb:168:in block in instrument'
activesupport (5.2.4.3) lib/active_support/notifications/instrumenter.rb:23:in instrument' activesupport (5.2.4.3) lib/active_support/notifications.rb:168:in instrument'
actionpack (5.2.4.3) lib/action_controller/metal/instrumentation.rb:32:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/params_wrapper.rb:256:in process_action'
actionpack (5.2.4.3) lib/abstract_controller/base.rb:134:in process' actionview (5.2.4.3) lib/action_view/rendering.rb:32:in process'
actionpack (5.2.4.3) lib/action_controller/metal.rb:191:in dispatch' actionpack (5.2.4.3) lib/action_controller/metal.rb:252:in dispatch'
actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:52:in dispatch' actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:34:in serve'
actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:52:in block in serve' actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:35:in each'
actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:35:in serve' actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:840:in call'
rack-attack (6.2.1) lib/rack/attack.rb:156:in call' rack-attack (6.2.1) lib/rack/attack.rb:170:in call'
warden (1.2.8) lib/warden/manager.rb:36:in block in call' warden (1.2.8) lib/warden/manager.rb:34:in catch'
warden (1.2.8) lib/warden/manager.rb:34:in call' rack (2.2.3) lib/rack/tempfile_reaper.rb:15:in call'
rack (2.2.3) lib/rack/etag.rb:27:in call' rack (2.2.3) lib/rack/conditional_get.rb:27:in call'
rack (2.2.3) lib/rack/head.rb:12:in call' actionpack (5.2.4.3) lib/action_dispatch/http/content_security_policy.rb:18:in call'
rack (2.2.3) lib/rack/session/abstract/id.rb:266:in context' rack (2.2.3) lib/rack/session/abstract/id.rb:260:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/cookies.rb:670:in call' actionpack (5.2.4.3) lib/action_dispatch/middleware/callbacks.rb:28:in block in call'
activesupport (5.2.4.3) lib/active_support/callbacks.rb:98:in run_callbacks' actionpack (5.2.4.3) lib/action_dispatch/middleware/callbacks.rb:26:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/executor.rb:14:in call' rollbar (2.27.0) lib/rollbar/middleware/rails/rollbar.rb:25:in block in call'
rollbar (2.27.0) lib/rollbar.rb:145:in scoped' rollbar (2.27.0) lib/rollbar/middleware/rails/rollbar.rb:22:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/debug_exceptions.rb:61:in call' rollbar (2.27.0) lib/rollbar/middleware/rails/show_exceptions.rb:22:in call_with_rollbar'
web-console (3.4.0) lib/web_console/middleware.rb:135:in call_app' web-console (3.4.0) lib/web_console/middleware.rb:28:in block in call'
web-console (3.4.0) lib/web_console/middleware.rb:18:in catch' web-console (3.4.0) lib/web_console/middleware.rb:18:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/show_exceptions.rb:33:in call' railties (5.2.4.3) lib/rails/rack/logger.rb:38:in call_app'
railties (5.2.4.3) lib/rails/rack/logger.rb:26:in block in call' activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:71:in block in tagged'
activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:28:in tagged' activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:71:in tagged'
railties (5.2.4.3) lib/rails/rack/logger.rb:26:in call' sprockets-rails (3.2.1) lib/sprockets/rails/quiet_assets.rb:13:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/remote_ip.rb:81:in call' request_store (1.4.0) lib/request_store/middleware.rb:19:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/request_id.rb:27:in call' rack (2.2.3) lib/rack/method_override.rb:24:in call'
rack (2.2.3) lib/rack/runtime.rb:22:in call' activesupport (5.2.4.3) lib/active_support/cache/strategy/local_cache_middleware.rb:29:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/executor.rb:14:in call' actionpack (5.2.4.3) lib/action_dispatch/middleware/static.rb:127:in call'
rack (2.2.3) lib/rack/sendfile.rb:110:in call' webpacker (4.0.7) lib/webpacker/dev_server_proxy.rb:29:in perform_request'
rack-proxy (0.6.5) lib/rack/proxy.rb:57:in call' railties (5.2.4.3) lib/rails/engine.rb:524:in call'
puma (4.3.5) lib/puma/configuration.rb:228:in call' puma (4.3.5) lib/puma/server.rb:713:in handle_request'
puma (4.3.5) lib/puma/server.rb:472:in process_client' puma (4.3.5) lib/puma/server.rb:328:in block in run'
puma (4.3.5) lib/puma/thread_pool.rb:134:in `block in spawn_thread'

Solution

Changed the following code in search.rb:

def self.aggregations
    {
      type: {
        terms: {
          field: :type
        },
      },
      country: {
        terms: {
          field: :country_code
        },
      },
    }
  end

to:

def self.aggregations
    {
      type: {
        terms: {
          field: "#{:type}.keyword",
        },
      },
      country: {
        terms: {
          field: "#{:country_code}.keyword",
        },
      },
    }
  end

Question

How is it that it worked for you as it is and not for me in my local environment. Do you have any pointers for me?

Add some ways to browse example data

This was a recommendation that came out of a UX review commissioned across all of OO's web properties and it has been mentioned by other users too.

The UX review specifically suggested adding example(s) to the homepage, other users have asked for ways to browse by country.

Update approach to using OpenCorporates identifiers for sub-national jurisdictions

Inspired by this Twitter thread, I found myself searching for a number of Scottish Qualifying Partnerships on the Open Ownership Register. This took me to the following search results page where we realised that the duplicate entities are not being resolved due to an OpenCorporates issue.

@spacesnottabs investigated further and discovered that Open Corporates has the company under jurisdiction ca_pe for "Prince Edward Island (Canada)" but the Register is parsing the jurisdiction as ca (Canada). If we try to resolve the record with ca as the jurisdiction code, it will find nothing.

Sample PSC record:

{  "company_number": "SG000612",
  "data": {                                                                     
    "address": {                                                                
      "address_line_1": "Grafton Street",                                       
      "country": "Canada",                                                      
      "locality": "Charlottestown",                                             
      "premises": "65",                                                         
      "region": "Prince Edward Island C1a8b9"                                   
    },                                                                          
    "etag": "510f53dafafaf4acf43a16964418a2cf8ccc9a3e",                         
    "identification": {                                                         
      "country_registered": "Canada",                                           
      "legal_authority": "Canada",                                              
      "legal_form": "Private Company",
      "place_registered": "Pei Business/Corporate Registry",
      "registration_number": "13174"
    },
    "kind": "corporate-entity-person-with-significant-control",
    "links": {
      "self": "/company/SG000612/persons-with-significant-control/corporate-entity/RnA_vTfWVHeC1PJqQqRw8LZuFoU"
    },
    "name": "Integritas (Canada) Trustee Corporation",
    "natures_of_control": [
      "right-to-appoint-and-remove-person"
    ],
    "notified_on": "2017-06-26"
  }
}

Our sample Entity stored in Mongo:
#<Entity _id: 630e81eab19f5888b5a78d34, updated_at: 2022-08-30 21:32:26.818 UTC, type: "legal-entity", name: "Integritas (Canada) Trustee Corporation", address: "65, Grafton Street, Charlottestown, Prince Edward Island C1a8b9", nationality: nil, country_of_residence: nil, dob: nil, jurisdiction_code: "ca", company_number: "13174", incorporation_date: nil, dissolution_date: nil, company_type: nil, restricted_for_marketing: nil, lang_code: nil, identifiers: [{"document_id"=>"GB PSC Snapshot", "link"=>"/company/SG000612/persons-with-significant-control/corporate-entity/RnA_vTfWVHeC1PJqQqRw8LZuFoU", "company_number"=>"13174"}], merged_entities_count: nil, master_entity_id: nil, oc_updated_at: nil, last_resolved_at: nil, self_updated_at: 2022-08-30 21:32:26.818 UTC, _type: "Entity">

Currently "region" is not used in the code at all, and only country is used. This is fine for our gb, dk, sk jurisdictions, but doesn't work for overseas such as Canada.

We need to extend our support to use both region and country to get the jurisdiction name/code by upgrading the countries gem we already use to the latest version (and fix the breaking changes): https://github.com/countries/countries

The work involved will make sure we can find it even if the name isn't an exact match. Working theory would be the jurisdiction code is {country-code}_{region-code} but this needs to be checked against the gem and org-id.guide approach: https://org-id.guide/results?structure=all&coverage=CA&sector=all

Register v2: Elasticsearch mappings

For the BODS v0.2 data, the Elasticsearch mappings are set to "keyword" which only allows exact matches. When we reingest, we need it to be type text instead - see message in Slack for details - otherwise, searching for similar names etc doesn't work (as it needs it exactly)

Develop a way to incrementally explore networks that are too large for initial display

At the moment, the graph is not supportive of very large networks, which are those that would need the most viz support. This is partly a backend performance restriction, where getting that data from the DB is too slow, but also a design problem, where we don't know how to present 1,000s of nodes/edges in a way that is useful.

Examples

FD Secretarial Ltd

Our current thinking is that some kind of click-to-expand clustering is the best way to solve this.

DoubleRenderError in EntitiesController#show

When rendering the JSON of a merged person, we get a DoubleRenderError because we've redirected to their master entity, but we haven't returned, so the subsequent render still gets called.

Import every kind of address when importing BODS data

So that:

I can better determine who I'm looking at
I don't see as many duplicates

Assumptions:

Currently we limit the import to 'service' address types only
We only have one address field, so we have to choose one address type or change the data model
We don't have any real BODS data sources to learn from
We need to do more work to mark our output address types correctly

Acceptance criteria:

When we load in our own BODS output data, we see addresses for people that have them in the original sources.

Add a content security policy

Implement a Content Security Policy (CSP) (an added layer of security that helps to detect and mitigate various types of attacks on our web applications, including Cross Site Scripting (XSS) and data injection attacks.)

Users are requesting transliterated urls where the querystring is url-encoded, causing it to not be recognised and the url to 404

For example:

https://rollbar.com/OpenOwnership/openownership-register/items/337/occurrences/65572472239/

which seems like a genuine user, not a bot. (Ignore the main title, Rollbar has grouped it under other 404 errors).

How did this user get to that url, did they click on it from a link we displayed (in which case there's a bug in our code somewhere) or did they do something weird with it (e.g. copy, paste somewhere else, re-copy and paste into the browser).

This is not a single isolated incident, and there are multiple browsers involved (at least IE and Safari) so I suspect a bug in our code somewhere.

Industry classification codes from OpenCorporates are not presented with the relevant scheme

There are multiple classification schemes for industry classification codes. OpenCorporates presents the original classification data and a mapping between the original and other schemes, e.g. https://opencorporates.com/statements/757088716

I think that the register is taking the original industry classification and combining with the mapped classifications, and presenting these without the mappings to the relevant schemes, e.g. https://register.openownership.org/entities/5e53bf41ac0ca34dfc2e8c97

This is a source of confusion to users and means that we have a lot of noise in the registry data.

Denmark importer is not making the best use of available DK data

Whilst going through our test data to anonymise it, particularly by culling out parts of the data that we don't actually use during imports, I've noticed a couple of things that the importer could do better, given the data that's available. This isn't important for the code (what we have works) but would be if we were to document that source:

We have code to find the 'most recent' address for people, but the data provides this directly in nyesteBeliggenhedsadresse inside deltagerpersonMetadata
We delve deep into the interests of every relationship listed to find the beneficial ownership relationships (among directorships and other types of relationship the data contains), but these relationship are actually flagged at a higher level in the virksomhedSummariskRelation/organisationer/organisationsNavn/navn by the value "Reelle ejere". This would potentially simplify the code because we could just find this relationship, then process all the interests (medlemsData) in two discrete steps, rather than the slight confusing loop-and-break structure we have now.

It's probably also worth noting in any docs that we're ignoring lots of historical data about companies and people (old names, old addresses) and there are a lot of other fields which we don't really understand/use at the moment.

Refactor Interests to share the same structure across all sources

SO THAT

I don’t have to reformat them and make source-specific judgements when I come to bulk export data
Every record in the database is consistent in it’s storage of interests
I can query all interests in a consistent way if needed
I can simplify the code to display interests across the site
I can translate interests in a more scalable manner

Background:

Currently we have three different structures:

UK: text strings which encode ranges of shares (where relevant)
DK: objects, which have a type, share_min and share_max
SK: we have no interests or interest type for Slovakia
BODS: objects which look like DK, but have exclusive_min and exclusive_max properties too. This is the canonical version we’d like to use everywhere.

A/C

All relationship records in the database have an interest object, with all the properties defined in Interests in the BODS schema
Interests are represented by a Rails model, with documented fields
All importers convert their source’s interest format to the BODS format during import, making the conversion explicit and tested.
We have a one-off batch process to update existing imports
All interests are displayed through a single decorator object
The BODS export does not need to map interests differently by source.
The BODS export can export interests which have been imported from BODs sources.
We no longer need a large list of PSC interest codes in our translation strings to properly translate interests for display
We have system tests of the bods export which asserts that the interests are output correctly, and these don’t change after this refactoring.
The interests on the relationship page don’t change either (and we have system tests to assert that).

birthDate in BODS JSON from UK PSC Register assumes everyone's birthdays are the first of the month

Current mapping from UK PSC Register JSON snapshots to BODS v0.1 maps date_of_birth to birthDate in BODS v0.1.

But this assumes a complete address string. The day of officers' date of birth in UK PSC Register is suppressed so that only the month and year that they were born is provided.

The result in the BODS v0.1 json files for individuals is that all officers from the UK PSC Register seem to be given birth dates on the first of the month which is incorrect. Instead we should map just to YYYY-MM and not YYYY-MM-DD in the birthDate field.

If this change is made, need to consider what impact it will have on existing deduplication/entity matching, for example in the natural persons duplicate merger process.

Register v2: Just show birth month and year

On person pages, change Born in top right-hand corner to just show year and month as the day is not correct.

The birthDate should use ISO 8601 format to show YYYY-MM (see BODS version 0.2 guidance).

Output BODS data in V0.2

So that

It matches the most obvious documentation
It provides a working example of the current guidance and standard
We don't have to incur tech debt maintaining multiple versions of parsers, etc

Assumptions

We have work to do upgrading our BODS export to match v0.2. In particular, we’re not working to a consistent model of indirect ownership because we’re currently reflecting each individual source’s approach
We need lib-cove-bods to be updated to v0.2 and released

A/C

It's valid according lib-cove-bods
- TBC - it matches all the new things in v0.2

Remove Google Analytics and replace with Plausible.io

Refactor the various relationship details views into one shared bit of code

So That

I have a single unit of code which provides all relationship diagrams, not several competing versions which confuse me.
I can ensure I’m presenting a consistent experience to the user whenever I use a tooltip.
I can test one piece of code and be confident it handles all of the eventualities of relationship display.

A/C

The relationships on this page: https://register.openownership.org/entities/59bfef8b67e4ebf3402d021b/59bfef7767e4ebf3402cedb1-unknown are produced by the same partial view and styled by the same CSS as the relationships on a tooltip on this page: https://register.openownership.org/entities/59bfef8b67e4ebf3402d021b/graph.
The two views look visually consistent, barring necessary differences for the varying contexts they’re displayed in.
There’s a test suite for that partial view which tests all of the ways it can be used to display relationships.

Register v2: companies owned by companies issue

It looks like the data is fine when people own companies, but when a company has an ownership statement from a registered entity, the owner is coming through as empty. I've tracked down the bug, but we will need to fix the records that have this problem by reimporting them (conversation in Slack)

Register v2: Person details repeated

When you are on a person page in Register v2 and look at the People with similar names section, the page details you are on are repeated in that section. Can we hide those repeated details?

Improve README for open sourcing

Register v2: Middle names

Middle names not appearing as v2 prototype uses name from raw data rather than defaulting to the OpenCorporates name for people

Graph tooltips aren't always visible on big edges or when you pan/zoom

On big edges they're always in the middle of the edge, not a visible part of it
When you pan/zoom the graph, the tooltip stays where it was opened, it doesn't stick to the node or edge.

Look up jurisdictions from address geocoding when they're missing, so we can OC resolve them

Without a jurisdiction we can't match to OpenCorporates

We have 46,000 companies currently which have an address but not a jurisdiction.

Stop hard-coding address types to 'service' in BODS output

We don't know what they are, if we're honest so it should probably be empty.

Investigate new functionality to 'Show other businesses registered at this address'

Register v2: PSC company number of raw records is not displaying, only the data

Finish the pending specs for the BodsMapper

Surface structured data from Register person and entity pages for inclusion in Google Search results

Investigate surfacing key structured data fields from Register entity pages for inclusion in Google Search results in line with Google's guidance on 'rich results' https://developers.google.com/search/docs/advanced/structured-data/sd-policies

Key fields to surface will likely include:

Name of entity
Founding date
Address
Company registration number / ID

Register v2: Hundreds of PSC identifiers produced for same entity

When transforming PSC records, some can produce hundreds of unique identifiers using the PSC link URL from the source record.

...
{"document_id"=>"GB PSC Snapshot", "link"=>"/company/12728169/persons-with-significant-control/corporate-entity/gSel9kO8D2MHk5upXoFQLSR6zzA", "company_number"=>"09361466"},
 {"document_id"=>"GB PSC Snapshot", "link"=>"/company/12729601/persons-with-significant-control/corporate-entity/srvxdgCRlgLeC6u1cfgRHqgwFf0", "company_number"=>"9361466"},
 {"document_id"=>"GB PSC Snapshot", "link"=>"/company/12731527/persons-with-significant-control/corporate-entity/4UZxvVwxokDJXFx1TZIDCrvV4lU", "company_number"=>"9361466"},

The current flow will fetch all previous statements for this entity, append a new identifier, then save a new record, each time it is seen as part of a new statement from the PSC source.
This means for some companies which are involved in a few thousand statements, this would need to fetch thousands of statements and store a new record each time with thousands of identifiers.

To resolve this, do not use the PSC statement link as an identifier and instead use it to populate the source.url of the produced BODS v0.2 record.

Register v2: Unknown/anonymous people records

Unknown/anonymous people records bug needs to be checked. I'd initially thought that was the issue here, as there were unknown persons in the original graph, but I only found yesterday it was due to something being owned by a registeredEntity instead of a legalEntity (conversation in Slack)

Mark interests as constituting beneficial ownership (or not) in BODS output

So that:

I know what data I'm looking at

Assumptions
we can only operate on a source-by-source case. For DK everything should set this to true. For PSC, it should be true for every relationship directly between a person and a company. I think the same should be true anywhere else at a minimum (e.g. EITI, UA, SK) but it may be the case in SK and UA that it can be true for everything. OO-XXX

A/C

beneficialOwnershipOrControl is set on every interest it can be TBC