openownership / bodspipelines Goto Github PK
View Code? Open in Web Editor NEWShared library intended to support building pipelines to produce beneficial ownership statements (BODS) data.
License: GNU Affero General Public License v3.0
Shared library intended to support building pipelines to produce beneficial ownership statements (BODS) data.
License: GNU Affero General Public License v3.0
There are about 346 LEI, 1239 RR and 438 REPEX records that don't make it through the pipeline because they are "duplicates" under current code.
For the 346 LEI record duplicates, 275 have one records whose RegistrationStatus is PENDING_ARCHIVAL and one with some other value such as ISSUED or LAPSED. Unfortunately there numerous other records with the PENDING_ARCHIVAL, so it is not an identifying criterion. There is one analogous case where the duplicate is a PENDING_TRANSFER record.
There is one case where the RegistrationStatus for both is ISSUED, the LastUpdateDate is a few minutes apart and there are significant changes to the data.
There rest of the LEI duplicates (70) appear to be actual duplicates. The data is identical (including the LastUpdateDate) and they tend to occur sequentially.
In all cases it looks like taking the record with the newest LastUpdateDate is the best bet.
We missed out from the original mappings that LEI level 1 data has a RegistrationAuthority
object. If possible, it would be good to add this where present to the BODS entity statement identifiers
array. Specifically:
LEI level 1 field | BODS entity statement field |
---|---|
RegistrationAuthority/RegistrationAuthorityID |
identifiers/1/scheme |
RegistrationAuthority/RegistrationAuthorityEntityID |
identifiers/1/id |
using /1/
here as shorthand to denote that this should be appended to the identifiers array and the LEI identifier kept. In reality there may be instances where there is just one of either identifier.
See the interest type codelist
There are a number of performance/efficiency improvements to the current pipeline which we may want to consider, which would have general positive effects across the board (faster pipeline execution will help not only in production but also for future development), but specifically would highly beneficial to either of the two main options (2 or 3) for improving handling of updates to input data (see #9), since both of these options on their own would likely result in significantly longer processing times. Improvements to consider would be:
Depending on exactly where the bottlenecks are there are likely to be significant performance gains that could be achieved with a small amount of effort, which would provide a good foundation to move forward from.
As we are mapping to BODS v0.2, we should be using incorporatedInJurisdiction
in entity statements
Apologies for not pointing this out before, but incorporatedInJuridiction
should be a JSON object and not a string. See the schema reference
Note that name
is mandatory - given that the GLEIF schema doesn't include a name, suggest using the country code to populate both the name
and the code
fields.
Currently while new GLEIF records will result in new BODS statements being created, if existing records are modified there is no mechanism for propagating those changes (e.g. an address change on a LEI record will not result in an updated BODS statement). While there are certainly approaches that could address this, they need work in the context of a pipeline (sequential processing). Given that this is current targeting BODS v0.2 this also means navigating change over time issues (openownership/data-standard#392). While producing an updated statement is certainly possible, the previous statement can be linked to other statements and if those links need to be updated as well, the problem becomes much more complex.
At the moment, incorporatedInJurisdiction
is rendered as a string, eg:
"incorporatedInJurisdiction": "GB"
Whereas it should be an object:
"incorporatedInJurisdiction": {
"code": "GB"
"name": "United Kingdom"
}
we usually ask BODS publishers to provides dates in the format YYYY-MM-DD - it would be great to adjust the mappings of date fields to adhere to this format rather than datetimes
Attempting to check some (randomly selected) converted data with the BODS data review tool revealed an unrecognised escape character, which prevented the tool from working. Details:
"statementID": "133a135c-28a4-5d73-2617-9965be7424a4"
"name": "ALEXANDERS\' PHARMACY LTD"
removing the backslash and single quote from the name enables the tool to run
We are currently using 'unknownInterest' to represent unknown interest types, but this is a feature of BODS v0.3 rather than BODS v0.2. We should consider how to best represent this with BODS v0.2.
The GLEIF Relationship Record (RR) CDF Format docs specify that the start node in a record is the child or lower node, and the end node is the parent node.
In BODS terms, this means that the start node is the subject
of an ooc statement and the end node is the interestedParty
. However, we currently have this the wrong way around.
(I think it's only in rr records that this is happening. Worth checking, though.)
Evidence of the problem:
Looking at a particular example. Under the parents section here (https://search.gleif.org/#/record/5493005S591LT1RDBJ14) you can see that AEGON Investment Management B.V. (LEI code 5493005S591LT1RDBJ14) has a single entity (AEGON N.V.) recorded as its direct parent and its ultimate parent.
Looking at the BODS GLEIF data, we would expect that there would be one or two ownership-or-control statements which have the AEGON Investment Management B.V. entity statement (BODS statementID 9310e3ea-c6e6-1bd9-def1-cc5e902785da) as the subject. (One if we have consolidated the two interests - direct and ultimate parent - into a single ooc statement. Two, otherwise.) However there are 153 such statements. See this by running on Datasette at https://bods-data-datasette.openownership.org/gleif:
select interestedparty_describedbyentitystatement from ooc_statement
where subject_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'
Sure enough, if I assume we have the mapping of subject <--> interested parties the wrong way around and run:
select subject_describedbyentitystatement from ooc_statement
where interestedparty_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'
... there are two identical results:
c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b
c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b
And c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b is the entity statement for AEGON N.V.
(Thanks to our colleagues at GLEIF for spotting this.)
As of June 2023, there are relationship CDF fields that are not mapped to BODS 0.2. The following mappings could (and should?) be added:
The following can be left unmapped:
RelationshipStatus
RelationshipQualifiers
RelationshipQuantifiers
RelationshipPeriods.RelationshipPeriod.PeriodType=="ACCOUNTING_PERIOD"
RelationshipPeriods.RelationshipPeriod.PeriodType=="DOCUMENT_FILING_PERIOD"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.