Coder Social home page Coder Social logo

openownership / bodspipelines Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 369 KB

Shared library intended to support building pipelines to produce beneficial ownership statements (BODS) data.

License: GNU Affero General Public License v3.0

Python 100.00%

bodspipelines's People

Contributors

radix0000 avatar stephenabbott avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bodspipelines's Issues

"Duplicate" Records In GLEIF Data

There are about 346 LEI, 1239 RR and 438 REPEX records that don't make it through the pipeline because they are "duplicates" under current code.

For the 346 LEI record duplicates, 275 have one records whose RegistrationStatus is PENDING_ARCHIVAL and one with some other value such as ISSUED or LAPSED. Unfortunately there numerous other records with the PENDING_ARCHIVAL, so it is not an identifying criterion. There is one analogous case where the duplicate is a PENDING_TRANSFER record.

There is one case where the RegistrationStatus for both is ISSUED, the LastUpdateDate is a few minutes apart and there are significant changes to the data.

There rest of the LEI duplicates (70) appear to be actual duplicates. The data is identical (including the LastUpdateDate) and they tend to occur sequentially.

In all cases it looks like taking the record with the newest LastUpdateDate is the best bet.

GLEIF - additional entity identifiers

We missed out from the original mappings that LEI level 1 data has a RegistrationAuthority object. If possible, it would be good to add this where present to the BODS entity statement identifiers array. Specifically:

LEI level 1 field BODS entity statement field
RegistrationAuthority/RegistrationAuthorityID identifiers/1/scheme
RegistrationAuthority/RegistrationAuthorityEntityID identifiers/1/id

using /1/ here as shorthand to denote that this should be appended to the identifiers array and the LEI identifier kept. In reality there may be instances where there is just one of either identifier.

General Performance/Efficiency Improvements

There are a number of performance/efficiency improvements to the current pipeline which we may want to consider, which would have general positive effects across the board (faster pipeline execution will help not only in production but also for future development), but specifically would highly beneficial to either of the two main options (2 or 3) for improving handling of updates to input data (see #9), since both of these options on their own would likely result in significantly longer processing times. Improvements to consider would be:

  1. Improve XML parser performance
  2. Optimise Elasticsearch usage
  3. Decouple various sub-stages of pipeline (with concurrency or separate processes)
  4. Possibly improve Kinesis usage (though that has seen some work already)

Depending on exactly where the bottlenecks are there are likely to be significant performance gains that could be achieved with a small amount of effort, which would provide a good foundation to move forward from.

`incorporatedInJurisdiction` should be a JSON object

Apologies for not pointing this out before, but incorporatedInJuridiction should be a JSON object and not a string. See the schema reference

Note that name is mandatory - given that the GLEIF schema doesn't include a name, suggest using the country code to populate both the name and the code fields.

Handling Updates To Existing GLEIF Records

Currently while new GLEIF records will result in new BODS statements being created, if existing records are modified there is no mechanism for propagating those changes (e.g. an address change on a LEI record will not result in an updated BODS statement). While there are certainly approaches that could address this, they need work in the context of a pipeline (sequential processing). Given that this is current targeting BODS v0.2 this also means navigating change over time issues (openownership/data-standard#392). While producing an updated statement is certainly possible, the previous statement can be linked to other statements and if those links need to be updated as well, the problem becomes much more complex.

date formats

we usually ask BODS publishers to provides dates in the format YYYY-MM-DD - it would be great to adjust the mappings of date fields to adhere to this format rather than datetimes

unrecognised escape characters

Attempting to check some (randomly selected) converted data with the BODS data review tool revealed an unrecognised escape character, which prevented the tool from working. Details:

"statementID": "133a135c-28a4-5d73-2617-9965be7424a4"
"name": "ALEXANDERS\' PHARMACY LTD"

removing the backslash and single quote from the name enables the tool to run

Direction of ownership-or-control relationships for rr records is incorrect

The GLEIF Relationship Record (RR) CDF Format docs specify that the start node in a record is the child or lower node, and the end node is the parent node.

In BODS terms, this means that the start node is the subject of an ooc statement and the end node is the interestedParty. However, we currently have this the wrong way around.

(I think it's only in rr records that this is happening. Worth checking, though.)

Evidence of the problem:
Looking at a particular example. Under the parents section here (https://search.gleif.org/#/record/5493005S591LT1RDBJ14) you can see that AEGON Investment Management B.V. (LEI code 5493005S591LT1RDBJ14) has a single entity (AEGON N.V.) recorded as its direct parent and its ultimate parent.

Looking at the BODS GLEIF data, we would expect that there would be one or two ownership-or-control statements which have the AEGON Investment Management B.V. entity statement (BODS statementID 9310e3ea-c6e6-1bd9-def1-cc5e902785da) as the subject. (One if we have consolidated the two interests - direct and ultimate parent - into a single ooc statement. Two, otherwise.) However there are 153 such statements. See this by running on Datasette at https://bods-data-datasette.openownership.org/gleif:

select interestedparty_describedbyentitystatement from ooc_statement
where subject_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'

Sure enough, if I assume we have the mapping of subject <--> interested parties the wrong way around and run:

select subject_describedbyentitystatement from ooc_statement
where interestedparty_describedbyentitystatement='9310e3ea-c6e6-1bd9-def1-cc5e902785da'

... there are two identical results:

c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b
c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b

And c0ae9cb2-fb7c-1b72-70c4-d4c950aa954b is the entity statement for AEGON N.V.

(Thanks to our colleagues at GLEIF for spotting this.)

Handle more relationship information

As of June 2023, there are relationship CDF fields that are not mapped to BODS 0.2. The following mappings could (and should?) be added:

  • RelationshipType -> Interest.Details
  • RegistrationStatus ->
    • Annotations.[0].statementPointerTarget="/";
    • Annotations.[0].motivation= "commenting";
    • Annotations.[0].description = "GLEIF registration status: [RegistrationStatus]";

The following can be left unmapped:

  • RelationshipStatus

  • RelationshipQualifiers

  • RelationshipQuantifiers

  • RelationshipPeriods.RelationshipPeriod.PeriodType=="ACCOUNTING_PERIOD"

  • RelationshipPeriods.RelationshipPeriod.PeriodType=="DOCUMENT_FILING_PERIOD"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.