Coder Social home page Coder Social logo

openownership / bodsdata Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 0.0 285 KB

Data analysis tools to help analysts, journalists and anyone wanting to examine and dive into beneficial ownership data published in line with the Beneficial Ownership Data Standard

Home Page: https://bods-data.openownership.org/

License: Other

Python 81.40% JavaScript 6.82% Ruby 0.54% HTML 11.13% SCSS 0.11%
beneficial-ownership open-data open-source

bodsdata's People

Contributors

kindly avatar lgs85 avatar radix0000 avatar stephenabbott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bodsdata's Issues

New output: Open Ownership Register data mapped to FollowTheMoney

Friedrich from OpenSanctions has created a process for mapping the Open Ownership Register BODS data from the data analysis tools to the FollowTheMoney data model used by OpenSanctions and OCCRP's Aleph tools.

This issue is to record a request for OO and ODS to investigate whether we can reuse this code and offer a FTM output of BODS data with each new refresh of the BODS data analysis tools https://github.com/opensanctions/bods-ftm

Update BODS data analysis tools to offer Denmark, Slovakia and the UK BO registers as separate sources

Currently the first version of the BODS data analysis tools offer data from the Open Ownership Register and Latvia as separate sources.

Development work to update the Open Ownership Register is due to be completed by the end of August 2022 which will see the Register publishing data in line with BODS v0.2 rather than BODS v0.1. A much wider range of data fields will also be present which were previously not published via the Register including better source field information.

This development work should allow us options to offer the data from individual registers ingested into the Open Ownership Register as separate sources in their own right either by taking a country-specific slice of the data from an early stage of the AWS data pipeline using files stored on S3 or by filtering the OO Register dataset by source.

Following more technical conversations to find the best way forward, I would like us to update the BODS data analysis tools to offer the following additional registers as individual sources:

  1. Denmark Central Business Register
  2. Slovakia Public Sector Partners Register
  3. United Kingdom Persons with significant control Register

These would appear in addition to the existing sources for the Open Ownership Register and Register of Enterprises of the Republic of Latvia.

Ingestion of data from Ukraine was halted in mid-2020 so the Ukraine Consolidated State Registry should not be offered as a separate source until a future time when ingestion of Ukraine's data is restarted.

This work should set a template/lay the groundwork by which future registers ingested into the Open Ownership Register at a later stage can be separated out and offered as sources via the data analysis tools. This would be in addition to their consolidated data being offered as part of the main Open Ownership Register dataset.

Once these sources are separated out in the BODS data analysis tools, the Open Ownership team will be spinning up country versions of our data analysis notebook and will want to be able to flag this work from the country data page in the BODS data analysis tools.

New output: Senzing ready JSON

Following on from the new output request in #2 is this one where I would like us to investigate whether we can map the BODS FollowTheMoney JSON output to Senzing's specifications.

Here are the notes from Senzing about mapping to their JSON specifications:
https://senzing.zendesk.com/hc/en-us/articles/231925448-Generic-Entity-Specification-JSON-CSV-Mapping

And here is the code from OpenSanctions again which I hope we'd be able to reuse:
https://github.com/opensanctions/mapper-senzing/blob/main/ftm_processor.py

json_zip stage assumes JSON Lines input files but they actually gzipped

The output json.zip file contains a {source}.json file which should contain a concatenated JSON Lines from input files. Howvere it assumes those input files are JSON Lines when in fact they have been gzipped. So the output {source}.json file actually contains concatenated contents of the gzip files.

Use libcovebods for data checks rather than developing more here

This library already has a lot of data checks coded and tested.

While it may not be suitable for use straight away (eg some of the ways it currently works use high memory with large files and that may not be suitable here, it only supports JSON not JSON Lines) it's in the plan to tackle those issues anyway so being forced to tackle them now is only a good thing!

Add publication guides for each data source

To better explain the details and nuances about the data sources republished via https://bods-data.openownership.org, we should provide a publication guide for each source detailing information about the source register, how the data has been processed and how it is made available via the BODS data analysis tools.

The publication guides should also feature a list of known mapping/transformation issues which users can bear in mind along with a list of the latest errors/issues OO has experienced when transforming the data. Explore whether this should be automatically pulled from the data notebooks which are used to update the BODS data analysis tools.

New output: Neo4J

As part of our partnership with OpenSanctions, their team have created mappers for converting BODS data in line with the FollowTheMoney data model (https://github.com/opensanctions/bods-ftm) and then converting any FTM data source in line with Neo4J (https://github.com/opensanctions/offshore-graph).

This supports the regular conversion of BODS data to FollowTheMoney: https://www.opensanctions.org/datasets/openownership

And also leads to the generation of the graph database which is then loaded into the OpenScreening project:
https://www.opensanctions.org/datasets/graph + https://resources.linkurious.com/openscreening

This issue is for Open Ownership to consider the work required to create our own mapper for converting BODS data in line with Neo4J - and then updating the BODS data analysis tools to offer Neo4J as an export format for each data source.

Incorrect bigquery link

The bigquery link on the home page and on the source pages of the website links to the wrong project. This needs to be updated to link to the bodsdata project.

New output: BODS JSON by source register

Once further work is carried out on the Open Ownership Register to update the import/export processes to align with BODS v0.2, we will be able to separate out the ingested data by the source register more easily.

This issue is to record that it would be great to make the BODS v0.2 mapped JSON data available for each separate register ingested into the Open Ownership Register as well as making this data available as one consolidated BODS JSON download for the OO Register.

Add config file for links to related resources for sources

At present, the BODS data analysis tools process data from each source and offer links to the different hosted databases produced. Upcoming tasks involve adding additional formats like xlsx #8 and Parquet #10

But going forward, I'd also like to be able to link to additional resources from Open Ownership which either use or power the data available from each source. Ideally these links for each source could simply be added in a config file on Github.

Here is a list of example resources I'd like to be able to link to for each source:

This list is non-comprehensive and may be different for each source so it would probably be easiest to make the link text editable for each resource.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.