openownership / bodspipelines Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 369 KB

Shared library intended to support building pipelines to produce beneficial ownership statements (BODS) data.

License: GNU Affero General Public License v3.0

Python 100.00%

bodspipelines's People

Contributors

Stargazers

Watchers

bodspipelines's Issues

"Duplicate" Records In GLEIF Data

There are about 346 LEI, 1239 RR and 438 REPEX records that don't make it through the pipeline because they are "duplicates" under current code.

For the 346 LEI record duplicates, 275 have one records whose RegistrationStatus is PENDING_ARCHIVAL and one with some other value such as ISSUED or LAPSED. Unfortunately there numerous other records with the PENDING_ARCHIVAL, so it is not an identifying criterion. There is one analogous case where the duplicate is a PENDING_TRANSFER record.

There is one case where the RegistrationStatus for both is ISSUED, the LastUpdateDate is a few minutes apart and there are significant changes to the data.

There rest of the LEI duplicates (70) appear to be actual duplicates. The data is identical (including the LastUpdateDate) and they tend to occur sequentially.

In all cases it looks like taking the record with the newest LastUpdateDate is the best bet.

GLEIF - additional entity identifiers

We missed out from the original mappings that LEI level 1 data has a RegistrationAuthority object. If possible, it would be good to add this where present to the BODS entity statement identifiers array. Specifically:

LEI level 1 field	BODS entity statement field
`RegistrationAuthority/RegistrationAuthorityID`	`identifiers/1/scheme`
`RegistrationAuthority/RegistrationAuthorityEntityID`	`identifiers/1/id`

using /1/ here as shorthand to denote that this should be appended to the identifiers array and the LEI identifier kept. In reality there may be instances where there is just one of either identifier.

'otherInfluenceOrControl' should be 'other-influence-or-control'

See the interest type codelist

General Performance/Efficiency Improvements

There are a number of performance/efficiency improvements to the current pipeline which we may want to consider, which would have general positive effects across the board (faster pipeline execution will help not only in production but also for future development), but specifically would highly beneficial to either of the two main options (2 or 3) for improving handling of updates to input data (see #9), since both of these options on their own would likely result in significantly longer processing times. Improvements to consider would be:

Improve XML parser performance
Optimise Elasticsearch usage
Decouple various sub-stages of pipeline (with concurrency or separate processes)
Possibly improve Kinesis usage (though that has seen some work already)

Depending on exactly where the bottlenecks are there are likely to be significant performance gains that could be achieved with a small amount of effort, which would provide a good foundation to move forward from.

GLEIF - `jurisdiction` should be `incorporatedInJurisdiction`

As we are mapping to BODS v0.2, we should be using incorporatedInJurisdiction in entity statements

`incorporatedInJurisdiction` should be a JSON object

Apologies for not pointing this out before, but incorporatedInJuridiction should be a JSON object and not a string. See the schema reference

Note that name is mandatory - given that the GLEIF schema doesn't include a name, suggest using the country code to populate both the name and the code fields.

Handling Updates To Existing GLEIF Records

Currently while new GLEIF records will result in new BODS statements being created, if existing records are modified there is no mechanism for propagating those changes (e.g. an address change on a LEI record will not result in an updated BODS statement). While there are certainly approaches that could address this, they need work in the context of a pipeline (sequential processing). Given that this is current targeting BODS v0.2 this also means navigating change over time issues (openownership/data-standard#392). While producing an updated statement is certainly possible, the previous statement can be linked to other statements and if those links need to be updated as well, the problem becomes much more complex.

Jurisdiction property should give an object (not a string) as a key

At the moment, incorporatedInJurisdiction is rendered as a string, eg:

"incorporatedInJurisdiction": "GB"

Whereas it should be an object:

"incorporatedInJurisdiction": {
  "code": "GB"
  "name": "United Kingdom"
}

date formats

we usually ask BODS publishers to provides dates in the format YYYY-MM-DD - it would be great to adjust the mappings of date fields to adhere to this format rather than datetimes

unrecognised escape characters

Attempting to check some (randomly selected) converted data with the BODS data review tool revealed an unrecognised escape character, which prevented the tool from working. Details:

"statementID": "133a135c-28a4-5d73-2617-9965be7424a4"
"name": "ALEXANDERS\' PHARMACY LTD"

removing the backslash and single quote from the name enables the tool to run

GLEIF - 'unknownInterest' not a feature of BODS v0.2

We are currently using 'unknownInterest' to represent unknown interest types, but this is a feature of BODS v0.3 rather than BODS v0.2. We should consider how to best represent this with BODS v0.2.

Direction of ownership-or-control relationships for rr records is incorrect

The GLEIF Relationship Record (RR) CDF Format docs specify that the start node in a record is the child or lower node, and the end node is the parent node.

In BODS terms, this means that the start node is the subject of an ooc statement and the end node is the interestedParty. However, we currently have this the wrong way around.

(I think it's only in rr records that this is happening. Worth checking, though.)

Evidence of the problem:
Looking at a particular example. Under the parents section here (https://search.gleif.org/#/record/5493005S591LT1RDBJ14) you can see that AEGON Investment Management B.V. (LEI code 5493005S591LT1RDBJ14) has a single entity (AEGON N.V.) recorded as its direct parent and its ultimate parent.

Looking at the BODS GLEIF data, we would expect that there would be one or two ownership-or-control statements which have the AEGON Investment Management B.V. entity statement (BODS statementID 9310e3ea-c6e6-1bd9-def1-cc5e902785da) as the subject. (One if we have consolidated the two interests - direct and ultimate parent - into a single ooc statement. Two, otherwise.) However there are 153 such statements. See this by running on Datasette at https://bods-data-datasette.openownership.org/gleif:

select interestedparty_describedbyentitystatement from ooc_statement
where subject_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'

Sure enough, if I assume we have the mapping of subject <--> interested parties the wrong way around and run:

select subject_describedbyentitystatement from ooc_statement
where interestedparty_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'

... there are two identical results:

c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b
c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b

And c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b is the entity statement for AEGON N.V.

(Thanks to our colleagues at GLEIF for spotting this.)

Handle more relationship information

As of June 2023, there are relationship CDF fields that are not mapped to BODS 0.2. The following mappings could (and should?) be added:

RelationshipType -> Interest.Details
RegistrationStatus ->
- Annotations.[0].statementPointerTarget="/";
- Annotations.[0].motivation= "commenting";
- Annotations.[0].description = "GLEIF registration status: [RegistrationStatus]";

The following can be left unmapped:

RelationshipStatus
RelationshipQualifiers
RelationshipQuantifiers
RelationshipPeriods.RelationshipPeriod.PeriodType=="ACCOUNTING_PERIOD"
RelationshipPeriods.RelationshipPeriod.PeriodType=="DOCUMENT_FILING_PERIOD"

openownership / bodspipelines Goto Github PK

bodspipelines's People

Contributors

Stargazers

Watchers

bodspipelines's Issues

Recommend Projects

Recommend Topics

Recommend Org