Coder Social home page Coder Social logo

openownership / register Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 3.0 8.45 MB

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia

Home Page: https://register.openownership.org

License: GNU Affero General Public License v3.0

Dockerfile 0.56% Ruby 28.91% JavaScript 8.33% CSS 0.66% HTML 0.97% Shell 0.81% SCSS 18.54% Haml 41.20% Procfile 0.01%
beneficial-ownership beneficial-ownership-data elasticsearch open-source

register's People

Contributors

bensymonds avatar bibianac avatar brendangatens avatar dependabot[bot] avatar dominicsayers avatar james avatar jits avatar openownership-bot avatar philt avatar spacesnottabs avatar stephenabbott avatar stevenday avatar thomasmarshall avatar timcraft avatar tiredpixel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

register's Issues

Refactor identifiers to that they have a common, fixed, structure

SO THAT

  • I don’t have to reformat them and make source-specific judgements when I come to bulk export data
  • Every record in the database is consistent in it’s storage of identifiers
  • I can query all identifiers in a consistent way if needed

Background:
Currently identifiers are stored as a list of objects, where each object can have different attributes depending on the source. All the attributes of a particular identifier are together are taken as the 'unique' value. Therefore, the only way to be sure what forms of identifier exist is to look at all of the data in the database. From a review of the current code though, I can see we have following different structures:

  • UK: document_id, company_number (child companies) or document_id, company_number, link (parent companies) or document_id, link (people)
  • DK: document_id, company_number (companies) or document_id, beneficial_owner_id (people)
  • SK: document_id, company_number (companies) or document_id, beneficial_owner_id (people)
  • UA: document_id, company_number (companies) or document_id, company_number, name (people, note this is bad as there's no solid guarantee of uniqueness)
  • EITI: document_id, name (both companies and people - again, this is bad for uniqueness)
  • BODS data: document_id, statement_id and any number of identifiers given in the data, which have at least one of scheme, scheme_name and then one or more of id, uri (companies and people)
  • OpenCorporates: jurisdiction_code, company_number

When we come to export these as BODS, we tend to take the document_id and either lookup an Org-Id scheme code, or declare it directly as the schemeName. We then combine the other parts of the identifier as the 'id'. We make special exceptions for OC identifiers and also add the register's internal id as another identifier.

As with #14 , I think we should probably move towards matching BODS' Identifier object.

A/C

  • All identifiers match the structure of a BODS Identifier
  • Identifiers are represented by a Rails model or a ruby object (see OO-197: Model entity identifiers as classes to improve code qualityTO DO), with documented fields
  • All importers create identifiers in the BODS format during import, making the conversion explicit and tested.
  • All importers contain and use the Org-Id scheme code(s) which refer to their specific data sources
  • We have a one-off batch process to update existing data
  • We can remove all of the identifier mapping code from the BODS export (except for basic renaming of fields from Ruby norms to JSON norms).
  • We have updated database indexes for querying identifiers and have removed any old indexes
  • Re-importing records still finds and updates the existing data, rather than creating dupes (and we have system tests for each importer to assert that).
  • document_ids are consistent with the new naming scheme we've introduced

Outputting BODS json for large merged people uses too much memory and times out

We recently put a link to our JSON versions of entities, which has resulted in Google crawling them.

With this, we've had a recurrence of the issues of memory consumption and request timeouts, because some of the entities have thousands of merged people and owned companies. This results in a lot of data, and a lot of memory needed to traverse the chains of ownerships.

On the page versions, we resolved this by paginating owned companies and merged people (indepedently). We could implement something similar within the JSON, but we'd need to:

  • Figure out how to specify the pagination in the response - currently we output a JSON list, we would presumably have to wrap that in an object with some extra parameters.
  • Document the pagination - we're effectively becoming more of an API here, so we need to document how it works.
  • Figure out how to actually paginate the data in the JSON - we do quite custom MongoDB queries for the page versions at the moment, but the equivalents of those queries are embedded in the graph traversal for the JSON. The same code is also used for the graph page and bulk export (and perhaps other things I can't remember).

Model entity identifiers as classes to improve code quality

An Entity can have multiple identifiers – each is a single unique identifier, from a particular source, that helps us find and dedupe entities.

Currently, these are modelled as an Array of Hash objects. Whilst this works okay, we should consider having first class model classes for each kind. This will allow us to:

  • Transparently control the ordering of the serialisation (and thus avoid bugs like the the one in OO-141.
  • Type check for particular kinds of identifiers (e.g. the OC identifier).
  • Control the construction of identifiers at the model level, rather than in importers, etc.
  • Provide utility methods etc. (e.g. like the ones currently in the Entity class for managing OC identifiers).

sample_date is potentially wrong/inaccurate on UK & SK relationships

When doing OO-293 (adding start and end dates to all relationships) we realised that the 'sample_date' on relationships was being set from the 'start date' in PSC and SK data.

For reference: we think that 'sample_date' is intended to be a 'when was the information about this relationship actually declared' kind of date, and we display it under 'Provenance' with the label 'As of:' and the help text 'The date this information was known to be true'.

Being specific, in the UK data, sample_date is being set from the 'notified_on' field of the 'data' record (where 'data' is the info about the owning person or company), while in the SK data it's coming from the 'PlatnostDo' (valid from) on the record we're currently processing from 'KonecniUzivateliaVyhod' (which is roughly translated the list of people associated with a company).

To be perfectly accurate to the name of the field, I think in these cases we shouldn't save anything in it, because neither SK nor PSC actually tell us when the data was declared. However, we should decide if it's better to have a 'Don't Know' there than what we have at the moment, which is effectively a best guess from the later of the start date and end date. Both of these are strictly within the definition of 'The date this information was known to be true' but I don't think they're very helpful to the user.

OpenCorporates resolution deletes data from source

We found this when doing the SK geocoding:

Sprematec GMBH has an address in the source data: https://rpvs.gov.sk/rpvs/Partner/Partner/Detail/2213
Open corporates doesn't have an address for them: https://opencorporates.com/companies/de/P3210_P3212_HRB111443
Therefore, when we import the data and look them up with OC, we lose the address: https://register.openownership.org/entities/59c225c267e4ebf34031fb65

Acceptance criteria

Given we have fixed the bug

And we've re-run an import of all our raw data

When I look at Sprematec GMBH: https://register.openownership.org/entities/59c225c267e4ebf34031fb65

Then I see an address in the top right metadata

When I look at the download changelog page I see a description of the impact of this change on the register's dataset.

In particular, with stats on: How many companies changed after this fix (and by extension how many didn't). Broken down into - how many changed because they were matched with OC when they weren't previously, how many just have source data added (i.e. blank fields with data in now), how many have changed because OC's data has changed since we last looked them up.

Hide ended statements in graphs by default

Example of current and former ownership and control relationships coexisting: https://register.openownership.org/entities/5b16a8b89dfc3fae18f62024/graph

In my experience the default behaviour of showing all relationships regardless of whether they have ended:

  • is a source of confusion for users;
  • gives the impression that the data is messier and less useful than it really is.

I would prefer to see only the current situation, with some mechanism to see past ownership positions.

Add and configure the lograge gem to streamline our logs

We recently ran over our 200MB daily limit of logs with Papertrail. They have some suggestions to reduce the size of the logs you produce: https://help.papertrailapp.com/kb/configuration/controlling-verbosity and as stopgap I added a filter to remove any logs about what views and partials were rendering via the Papertrail settings.

Some of the other suggestions make more sense long term however, including installing the lograge gem (which collapses down Rails' logs to single lines) and perhaps disabling action view logging altogether.

Replace the Provenance model entirely with RawDataProvenances

So that

  • I can show more detailed provenance (OO-509: As a user, I want a Provenance link to raw data from the relationship pageDONE)
  • I can output Source info for every statement in BODS
  • I can remove the code that deals with Provenance sources in the BODS export
  • I can show a statementDate on person and entity statements
  • I can show sources and statementDates on unknownPerson statements

Assumptions

  • UnknownPersonsEntities (or whatever code replaces them) will need to have RawDataProvenances given to them from Statements when they’re created.

Make it possible to store approximate dates in every date field

We currently use a library which provides a special database field for ISO8601 approximate dates (e.g. 2019-05). However, in developing the BODS import, I realised that it doesn’t really work in a way that supports this correctly.

The library allows us to parse dates like 2019-05, but it turns that into 2019-05-01 when it saves it in the database, effectively losing the ‘approximation’ from the source. This has some advantages in that it becomes comparable to other full dates (e.g. for sorting) but it seems important that we don’t lose the original intention of the source.

Relatedly, we only use this special date library on some dates and in some database tables. It seems like we should use them everywhere, or at least be consistent in which kinds of things are approximate and which aren’t (e.g. are statement dates approximate?).

Update bootstrap-ui library

We are currently using an alpha version of the Bootstrap v5 library, which has some broken things in it (e.g. collapsible containers don't quite work). We should upgrade this to the latest stable v5 release.

Note: I tried this a few weeks back and it looks like the entire header is broken on the latest v5 release due to breaking changes from the alpha (in the nav component). So this will likely be a bigger task than expected.

Register v2: Company IDs starting with a zero

One new issue I've noticed is that we aren't transforming company ids starting with a zero, so there are occasionally duplicates. I'll fix that one for the import next week (end of May)

ElasticSearch: aggregation exception error

The Error

[400] {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"open_ownership_register_entities_development","node":"SLgzHYLrTfidSNSMn3-UeQ","reason":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [type] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}},"status":400}

Stack Trace

elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/base.rb:205:in __raise_transport_error' elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/base.rb:323:in perform_request'
elasticsearch-transport (6.1.0) lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request' elasticsearch-transport (6.1.0) lib/elasticsearch/transport/client.rb:131:in perform_request'
elasticsearch-api (6.1.0) lib/elasticsearch/api/actions/search.rb:187:in search' elasticsearch-model (6.0.0) lib/elasticsearch/model/searching.rb:51:in execute!'
elasticsearch-model (6.0.0) lib/elasticsearch/model/response.rb:29:in response' elasticsearch-model (6.0.0) lib/elasticsearch/model/response/base.rb:34:in total'
app/controllers/searches_controller.rb:12:in show' actionpack (5.2.4.3) lib/action_controller/metal/basic_implicit_render.rb:6:in send_action'
actionpack (5.2.4.3) lib/abstract_controller/base.rb:194:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/rendering.rb:30:in process_action'
actionpack (5.2.4.3) lib/abstract_controller/callbacks.rb:42:in block in process_action' activesupport (5.2.4.3) lib/active_support/callbacks.rb:132:in run_callbacks'
actionpack (5.2.4.3) lib/abstract_controller/callbacks.rb:41:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/rescue.rb:22:in process_action'
actionpack (5.2.4.3) lib/action_controller/metal/instrumentation.rb:34:in block in process_action' activesupport (5.2.4.3) lib/active_support/notifications.rb:168:in block in instrument'
activesupport (5.2.4.3) lib/active_support/notifications/instrumenter.rb:23:in instrument' activesupport (5.2.4.3) lib/active_support/notifications.rb:168:in instrument'
actionpack (5.2.4.3) lib/action_controller/metal/instrumentation.rb:32:in process_action' actionpack (5.2.4.3) lib/action_controller/metal/params_wrapper.rb:256:in process_action'
actionpack (5.2.4.3) lib/abstract_controller/base.rb:134:in process' actionview (5.2.4.3) lib/action_view/rendering.rb:32:in process'
actionpack (5.2.4.3) lib/action_controller/metal.rb:191:in dispatch' actionpack (5.2.4.3) lib/action_controller/metal.rb:252:in dispatch'
actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:52:in dispatch' actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:34:in serve'
actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:52:in block in serve' actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:35:in each'
actionpack (5.2.4.3) lib/action_dispatch/journey/router.rb:35:in serve' actionpack (5.2.4.3) lib/action_dispatch/routing/route_set.rb:840:in call'
rack-attack (6.2.1) lib/rack/attack.rb:156:in call' rack-attack (6.2.1) lib/rack/attack.rb:170:in call'
warden (1.2.8) lib/warden/manager.rb:36:in block in call' warden (1.2.8) lib/warden/manager.rb:34:in catch'
warden (1.2.8) lib/warden/manager.rb:34:in call' rack (2.2.3) lib/rack/tempfile_reaper.rb:15:in call'
rack (2.2.3) lib/rack/etag.rb:27:in call' rack (2.2.3) lib/rack/conditional_get.rb:27:in call'
rack (2.2.3) lib/rack/head.rb:12:in call' actionpack (5.2.4.3) lib/action_dispatch/http/content_security_policy.rb:18:in call'
rack (2.2.3) lib/rack/session/abstract/id.rb:266:in context' rack (2.2.3) lib/rack/session/abstract/id.rb:260:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/cookies.rb:670:in call' actionpack (5.2.4.3) lib/action_dispatch/middleware/callbacks.rb:28:in block in call'
activesupport (5.2.4.3) lib/active_support/callbacks.rb:98:in run_callbacks' actionpack (5.2.4.3) lib/action_dispatch/middleware/callbacks.rb:26:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/executor.rb:14:in call' rollbar (2.27.0) lib/rollbar/middleware/rails/rollbar.rb:25:in block in call'
rollbar (2.27.0) lib/rollbar.rb:145:in scoped' rollbar (2.27.0) lib/rollbar/middleware/rails/rollbar.rb:22:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/debug_exceptions.rb:61:in call' rollbar (2.27.0) lib/rollbar/middleware/rails/show_exceptions.rb:22:in call_with_rollbar'
web-console (3.4.0) lib/web_console/middleware.rb:135:in call_app' web-console (3.4.0) lib/web_console/middleware.rb:28:in block in call'
web-console (3.4.0) lib/web_console/middleware.rb:18:in catch' web-console (3.4.0) lib/web_console/middleware.rb:18:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/show_exceptions.rb:33:in call' railties (5.2.4.3) lib/rails/rack/logger.rb:38:in call_app'
railties (5.2.4.3) lib/rails/rack/logger.rb:26:in block in call' activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:71:in block in tagged'
activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:28:in tagged' activesupport (5.2.4.3) lib/active_support/tagged_logging.rb:71:in tagged'
railties (5.2.4.3) lib/rails/rack/logger.rb:26:in call' sprockets-rails (3.2.1) lib/sprockets/rails/quiet_assets.rb:13:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/remote_ip.rb:81:in call' request_store (1.4.0) lib/request_store/middleware.rb:19:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/request_id.rb:27:in call' rack (2.2.3) lib/rack/method_override.rb:24:in call'
rack (2.2.3) lib/rack/runtime.rb:22:in call' activesupport (5.2.4.3) lib/active_support/cache/strategy/local_cache_middleware.rb:29:in call'
actionpack (5.2.4.3) lib/action_dispatch/middleware/executor.rb:14:in call' actionpack (5.2.4.3) lib/action_dispatch/middleware/static.rb:127:in call'
rack (2.2.3) lib/rack/sendfile.rb:110:in call' webpacker (4.0.7) lib/webpacker/dev_server_proxy.rb:29:in perform_request'
rack-proxy (0.6.5) lib/rack/proxy.rb:57:in call' railties (5.2.4.3) lib/rails/engine.rb:524:in call'
puma (4.3.5) lib/puma/configuration.rb:228:in call' puma (4.3.5) lib/puma/server.rb:713:in handle_request'
puma (4.3.5) lib/puma/server.rb:472:in process_client' puma (4.3.5) lib/puma/server.rb:328:in block in run'
puma (4.3.5) lib/puma/thread_pool.rb:134:in `block in spawn_thread'

Solution

Changed the following code in search.rb:

def self.aggregations
    {
      type: {
        terms: {
          field: :type
        },
      },
      country: {
        terms: {
          field: :country_code
        },
      },
    }
  end

to:

def self.aggregations
    {
      type: {
        terms: {
          field: "#{:type}.keyword",
        },
      },
      country: {
        terms: {
          field: "#{:country_code}.keyword",
        },
      },
    }
  end

Question

How is it that it worked for you as it is and not for me in my local environment. Do you have any pointers for me?

Add some ways to browse example data

This was a recommendation that came out of a UX review commissioned across all of OO's web properties and it has been mentioned by other users too.

The UX review specifically suggested adding example(s) to the homepage, other users have asked for ways to browse by country.

Update approach to using OpenCorporates identifiers for sub-national jurisdictions

Inspired by this Twitter thread, I found myself searching for a number of Scottish Qualifying Partnerships on the Open Ownership Register. This took me to the following search results page where we realised that the duplicate entities are not being resolved due to an OpenCorporates issue.

@spacesnottabs investigated further and discovered that Open Corporates has the company under jurisdiction ca_pe for "Prince Edward Island (Canada)" but the Register is parsing the jurisdiction as ca (Canada). If we try to resolve the record with ca as the jurisdiction code, it will find nothing.

Sample PSC record:

{  "company_number": "SG000612",
  "data": {                                                                     
    "address": {                                                                
      "address_line_1": "Grafton Street",                                       
      "country": "Canada",                                                      
      "locality": "Charlottestown",                                             
      "premises": "65",                                                         
      "region": "Prince Edward Island C1a8b9"                                   
    },                                                                          
    "etag": "510f53dafafaf4acf43a16964418a2cf8ccc9a3e",                         
    "identification": {                                                         
      "country_registered": "Canada",                                           
      "legal_authority": "Canada",                                              
      "legal_form": "Private Company",
      "place_registered": "Pei Business/Corporate Registry",
      "registration_number": "13174"
    },
    "kind": "corporate-entity-person-with-significant-control",
    "links": {
      "self": "/company/SG000612/persons-with-significant-control/corporate-entity/RnA_vTfWVHeC1PJqQqRw8LZuFoU"
    },
    "name": "Integritas (Canada) Trustee Corporation",
    "natures_of_control": [
      "right-to-appoint-and-remove-person"
    ],
    "notified_on": "2017-06-26"
  }
}

Our sample Entity stored in Mongo:
#<Entity _id: 630e81eab19f5888b5a78d34, updated_at: 2022-08-30 21:32:26.818 UTC, type: "legal-entity", name: "Integritas (Canada) Trustee Corporation", address: "65, Grafton Street, Charlottestown, Prince Edward Island C1a8b9", nationality: nil, country_of_residence: nil, dob: nil, jurisdiction_code: "ca", company_number: "13174", incorporation_date: nil, dissolution_date: nil, company_type: nil, restricted_for_marketing: nil, lang_code: nil, identifiers: [{"document_id"=>"GB PSC Snapshot", "link"=>"/company/SG000612/persons-with-significant-control/corporate-entity/RnA_vTfWVHeC1PJqQqRw8LZuFoU", "company_number"=>"13174"}], merged_entities_count: nil, master_entity_id: nil, oc_updated_at: nil, last_resolved_at: nil, self_updated_at: 2022-08-30 21:32:26.818 UTC, _type: "Entity">

Currently "region" is not used in the code at all, and only country is used. This is fine for our gb, dk, sk jurisdictions, but doesn't work for overseas such as Canada.

We need to extend our support to use both region and country to get the jurisdiction name/code by upgrading the countries gem we already use to the latest version (and fix the breaking changes): https://github.com/countries/countries

The work involved will make sure we can find it even if the name isn't an exact match. Working theory would be the jurisdiction code is {country-code}_{region-code} but this needs to be checked against the gem and org-id.guide approach: https://org-id.guide/results?structure=all&coverage=CA&sector=all

Develop a way to incrementally explore networks that are too large for initial display

At the moment, the graph is not supportive of very large networks, which are those that would need the most viz support. This is partly a backend performance restriction, where getting that data from the DB is too slow, but also a design problem, where we don't know how to present 1,000s of nodes/edges in a way that is useful.

Examples

Our current thinking is that some kind of click-to-expand clustering is the best way to solve this.

DoubleRenderError in EntitiesController#show

When rendering the JSON of a merged person, we get a DoubleRenderError because we've redirected to their master entity, but we haven't returned, so the subsequent render still gets called.

Import every kind of address when importing BODS data

So that:

  • I can better determine who I'm looking at
  • I don't see as many duplicates

Assumptions:

  • Currently we limit the import to 'service' address types only
  • We only have one address field, so we have to choose one address type or change the data model
  • We don't have any real BODS data sources to learn from
  • We need to do more work to mark our output address types correctly

Acceptance criteria:

  • When we load in our own BODS output data, we see addresses for people that have them in the original sources.

Add a content security policy

Implement a Content Security Policy (CSP) (an added layer of security that helps to detect and mitigate various types of attacks on our web applications, including Cross Site Scripting (XSS) and data injection attacks.)

Users are requesting transliterated urls where the querystring is url-encoded, causing it to not be recognised and the url to 404

For example:

https://rollbar.com/OpenOwnership/openownership-register/items/337/occurrences/65572472239/

which seems like a genuine user, not a bot. (Ignore the main title, Rollbar has grouped it under other 404 errors).

How did this user get to that url, did they click on it from a link we displayed (in which case there's a bug in our code somewhere) or did they do something weird with it (e.g. copy, paste somewhere else, re-copy and paste into the browser).

This is not a single isolated incident, and there are multiple browsers involved (at least IE and Safari) so I suspect a bug in our code somewhere.

Industry classification codes from OpenCorporates are not presented with the relevant scheme

There are multiple classification schemes for industry classification codes. OpenCorporates presents the original classification data and a mapping between the original and other schemes, e.g. https://opencorporates.com/statements/757088716

I think that the register is taking the original industry classification and combining with the mapped classifications, and presenting these without the mappings to the relevant schemes, e.g. https://register.openownership.org/entities/5e53bf41ac0ca34dfc2e8c97

This is a source of confusion to users and means that we have a lot of noise in the registry data.

Denmark importer is not making the best use of available DK data

Whilst going through our test data to anonymise it, particularly by culling out parts of the data that we don't actually use during imports, I've noticed a couple of things that the importer could do better, given the data that's available. This isn't important for the code (what we have works) but would be if we were to document that source:

  • We have code to find the 'most recent' address for people, but the data provides this directly in nyesteBeliggenhedsadresse inside deltagerpersonMetadata
  • We delve deep into the interests of every relationship listed to find the beneficial ownership relationships (among directorships and other types of relationship the data contains), but these relationship are actually flagged at a higher level in the virksomhedSummariskRelation/organisationer/organisationsNavn/navn by the value "Reelle ejere". This would potentially simplify the code because we could just find this relationship, then process all the interests (medlemsData) in two discrete steps, rather than the slight confusing loop-and-break structure we have now.

It's probably also worth noting in any docs that we're ignoring lots of historical data about companies and people (old names, old addresses) and there are a lot of other fields which we don't really understand/use at the moment.

Refactor Interests to share the same structure across all sources

SO THAT

  • I don’t have to reformat them and make source-specific judgements when I come to bulk export data
  • Every record in the database is consistent in it’s storage of interests
  • I can query all interests in a consistent way if needed
  • I can simplify the code to display interests across the site
  • I can translate interests in a more scalable manner

Background:

Currently we have three different structures:

  • UK: text strings which encode ranges of shares (where relevant)
  • DK: objects, which have a type, share_min and share_max
  • SK: we have no interests or interest type for Slovakia
  • BODS: objects which look like DK, but have exclusive_min and exclusive_max properties too. This is the canonical version we’d like to use everywhere.

A/C

  • All relationship records in the database have an interest object, with all the properties defined in Interests in the BODS schema
  • Interests are represented by a Rails model, with documented fields
  • All importers convert their source’s interest format to the BODS format during import, making the conversion explicit and tested.
  • We have a one-off batch process to update existing imports
  • All interests are displayed through a single decorator object
  • The BODS export does not need to map interests differently by source.
  • The BODS export can export interests which have been imported from BODs sources.
  • We no longer need a large list of PSC interest codes in our translation strings to properly translate interests for display
  • We have system tests of the bods export which asserts that the interests are output correctly, and these don’t change after this refactoring.
  • The interests on the relationship page don’t change either (and we have system tests to assert that).

birthDate in BODS JSON from UK PSC Register assumes everyone's birthdays are the first of the month

Current mapping from UK PSC Register JSON snapshots to BODS v0.1 maps date_of_birth to birthDate in BODS v0.1.

But this assumes a complete address string. The day of officers' date of birth in UK PSC Register is suppressed so that only the month and year that they were born is provided.

The result in the BODS v0.1 json files for individuals is that all officers from the UK PSC Register seem to be given birth dates on the first of the month which is incorrect. Instead we should map just to YYYY-MM and not YYYY-MM-DD in the birthDate field.

If this change is made, need to consider what impact it will have on existing deduplication/entity matching, for example in the natural persons duplicate merger process.

Output BODS data in V0.2

So that

  • It matches the most obvious documentation
  • It provides a working example of the current guidance and standard
  • We don't have to incur tech debt maintaining multiple versions of parsers, etc

Assumptions

  • We have work to do upgrading our BODS export to match v0.2. In particular, we’re not working to a consistent model of indirect ownership because we’re currently reflecting each individual source’s approach
  • We need lib-cove-bods to be updated to v0.2 and released

A/C

  • It's valid according lib-cove-bods
  • - TBC - it matches all the new things in v0.2

Refactor the various relationship details views into one shared bit of code

So That

  • I have a single unit of code which provides all relationship diagrams, not several competing versions which confuse me.
  • I can ensure I’m presenting a consistent experience to the user whenever I use a tooltip.
  • I can test one piece of code and be confident it handles all of the eventualities of relationship display.

A/C

Register v2: companies owned by companies issue

It looks like the data is fine when people own companies, but when a company has an ownership statement from a registered entity, the owner is coming through as empty. I've tracked down the bug, but we will need to fix the records that have this problem by reimporting them (conversation in Slack)

Register v2: Middle names

Middle names not appearing as v2 prototype uses name from raw data rather than defaulting to the OpenCorporates name for people

Register v2: Hundreds of PSC identifiers produced for same entity

When transforming PSC records, some can produce hundreds of unique identifiers using the PSC link URL from the source record.

...
{"document_id"=>"GB PSC Snapshot", "link"=>"/company/12728169/persons-with-significant-control/corporate-entity/gSel9kO8D2MHk5upXoFQLSR6zzA", "company_number"=>"09361466"},
 {"document_id"=>"GB PSC Snapshot", "link"=>"/company/12729601/persons-with-significant-control/corporate-entity/srvxdgCRlgLeC6u1cfgRHqgwFf0", "company_number"=>"9361466"},
 {"document_id"=>"GB PSC Snapshot", "link"=>"/company/12731527/persons-with-significant-control/corporate-entity/4UZxvVwxokDJXFx1TZIDCrvV4lU", "company_number"=>"9361466"},

The current flow will fetch all previous statements for this entity, append a new identifier, then save a new record, each time it is seen as part of a new statement from the PSC source.
This means for some companies which are involved in a few thousand statements, this would need to fetch thousands of statements and store a new record each time with thousands of identifiers.

To resolve this, do not use the PSC statement link as an identifier and instead use it to populate the source.url of the produced BODS v0.2 record.

Register v2: Unknown/anonymous people records

Unknown/anonymous people records bug needs to be checked. I'd initially thought that was the issue here, as there were unknown persons in the original graph, but I only found yesterday it was due to something being owned by a registeredEntity instead of a legalEntity (conversation in Slack)

Mark interests as constituting beneficial ownership (or not) in BODS output

So that:

I know what data I'm looking at

Assumptions
we can only operate on a source-by-source case. For DK everything should set this to true. For PSC, it should be true for every relationship directly between a person and a company. I think the same should be true anywhere else at a minimum (e.g. EITI, UA, SK) but it may be the case in SK and UA that it can be true for everything. OO-XXX

A/C

beneficialOwnershipOrControl is set on every interest it can be TBC

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.