openownership / bodsdata Goto Github PK

9.0 9.0 0.0 285 KB

Data analysis tools to help analysts, journalists and anyone wanting to examine and dive into beneficial ownership data published in line with the Beneficial Ownership Data Standard

Home Page: https://bods-data.openownership.org/

License: Other

Python 81.40% JavaScript 6.82% Ruby 0.54% HTML 11.13% SCSS 0.11%

beneficial-ownership open-data open-source

bodsdata's People

Contributors

Stargazers

Watchers

bodsdata's Issues

New output: XLSX in multithreaded mode

https://github.com/kindly/flatterer/releases/tag/v0.13.1

New output: Open Ownership Register data mapped to FollowTheMoney

Friedrich from OpenSanctions has created a process for mapping the Open Ownership Register BODS data from the data analysis tools to the FollowTheMoney data model used by OpenSanctions and OCCRP's Aleph tools.

This issue is to record a request for OO and ODS to investigate whether we can reuse this code and offer a FTM output of BODS data with each new refresh of the BODS data analysis tools https://github.com/opensanctions/bods-ftm

Update BODS data analysis tools to offer Denmark, Slovakia and the UK BO registers as separate sources

Currently the first version of the BODS data analysis tools offer data from the Open Ownership Register and Latvia as separate sources.

Development work to update the Open Ownership Register is due to be completed by the end of August 2022 which will see the Register publishing data in line with BODS v0.2 rather than BODS v0.1. A much wider range of data fields will also be present which were previously not published via the Register including better source field information.

This development work should allow us options to offer the data from individual registers ingested into the Open Ownership Register as separate sources in their own right either by taking a country-specific slice of the data from an early stage of the AWS data pipeline using files stored on S3 or by filtering the OO Register dataset by source.

Following more technical conversations to find the best way forward, I would like us to update the BODS data analysis tools to offer the following additional registers as individual sources:

These would appear in addition to the existing sources for the Open Ownership Register and Register of Enterprises of the Republic of Latvia.

Ingestion of data from Ukraine was halted in mid-2020 so the Ukraine Consolidated State Registry should not be offered as a separate source until a future time when ingestion of Ukraine's data is restarted.

This work should set a template/lay the groundwork by which future registers ingested into the Open Ownership Register at a later stage can be separated out and offered as sources via the data analysis tools. This would be in addition to their consolidated data being offered as part of the main Open Ownership Register dataset.

Once these sources are separated out in the BODS data analysis tools, the Open Ownership team will be spinning up country versions of our data analysis notebook and will want to be able to flag this work from the country data page in the BODS data analysis tools.

New output: Senzing ready JSON

Following on from the new output request in #2 is this one where I would like us to investigate whether we can map the BODS FollowTheMoney JSON output to Senzing's specifications.

Here are the notes from Senzing about mapping to their JSON specifications:
https://senzing.zendesk.com/hc/en-us/articles/231925448-Generic-Entity-Specification-JSON-CSV-Mapping

And here is the code from OpenSanctions again which I hope we'd be able to reuse:
https://github.com/opensanctions/mapper-senzing/blob/main/ftm_processor.py

json_zip stage assumes JSON Lines input files but they actually gzipped

The output json.zip file contains a {source}.json file which should contain a concatenated JSON Lines from input files. Howvere it assumes those input files are JSON Lines when in fact they have been gzipped. So the output {source}.json file actually contains concatenated contents of the gzip files.

Use libcovebods for data checks rather than developing more here

This library already has a lot of data checks coded and tested.

While it may not be suitable for use straight away (eg some of the ways it currently works use high memory with large files and that may not be suitable here, it only supports JSON not JSON Lines) it's in the plan to tackle those issues anyway so being forced to tackle them now is only a good thing!

Feature request: Add download size to data overview

If it were relatively easy to do, then having the file size next to the CSV, SQLite download etc would be useful:

Need for input consistency check pipeline stage?

Might be a good idea to do basic consistency checks (e.g. no duplicate statementIDs) on the input data as the first stage of the pipeline.

Add publication guides for each data source

To better explain the details and nuances about the data sources republished via https://bods-data.openownership.org, we should provide a publication guide for each source detailing information about the source register, how the data has been processed and how it is made available via the BODS data analysis tools.

The publication guides should also feature a list of known mapping/transformation issues which users can bear in mind along with a list of the latest errors/issues OO has experienced when transforming the data. Explore whether this should be automatically pulled from the data notebooks which are used to update the BODS data analysis tools.

Update descriptions and field download buttons to make clear that JSON files are in JSONLines

502 errors from all Datasette instances linked to BODS data analysis tools

https://bods-data-datasette.openownership.org/register
https://bods-data-datasette.openownership.org/gleif
https://bods-data-datasette.openownership.org/slovakia
https://bods-data-datasette.openownership.org/UK_PSC
https://bods-data-datasette.openownership.org/denmark
https://bods-data-datasette.openownership.org/latvia

Could you please investigate, @radix0000?

New output: Neo4J

As part of our partnership with OpenSanctions, their team have created mappers for converting BODS data in line with the FollowTheMoney data model (https://github.com/opensanctions/bods-ftm) and then converting any FTM data source in line with Neo4J (https://github.com/opensanctions/offshore-graph).

This supports the regular conversion of BODS data to FollowTheMoney: https://www.opensanctions.org/datasets/openownership

And also leads to the generation of the graph database which is then loaded into the OpenScreening project:
https://www.opensanctions.org/datasets/graph + https://resources.linkurious.com/openscreening

This issue is for Open Ownership to consider the work required to create our own mapper for converting BODS data in line with Neo4J - and then updating the BODS data analysis tools to offer Neo4J as an export format for each data source.

Issue warning when no data downloaded

It would be helpful to issue a warning at the download stage if and when the download is empty

Incorrect bigquery link

The bigquery link on the home page and on the source pages of the website links to the wrong project. This needs to be updated to link to the bodsdata project.

Write wrapper function to run entire bodsdata pipeline for a single source

We can run the entire pipeline of downloading, flattening, creating outputs and updating metadata and the website in a single function. This would make it much easier to quickly add new sources.

Learn more about Microsoft Fabric

https://www.microsoft.com/en-gb/microsoft-fabric

New output: Parquet

See Flatterer release v0.13.0

New output: convert BODS JSON to RDF using GraphDB

We received an offer from Cos at Blue Anvil for us to extend the BODS data analysis tools reusing their code - https://github.com/blueanvil/bods-rdf - in order to covert BODS data from the Register or any other source and ingest it into an RDF repository.

@StephenAbbott to reach out to https://github.com/cosmin-marginean about this in May/June after Blue Anvil launches their Truintel tools https://www.blueanvil.com/truintel/

New output: BODS JSON by source register

Once further work is carried out on the Open Ownership Register to update the import/export processes to align with BODS v0.2, we will be able to separate out the ingested data by the source register more easily.

This issue is to record that it would be great to make the BODS v0.2 mapped JSON data available for each separate register ingested into the Open Ownership Register as well as making this data available as one consolidated BODS JSON download for the OO Register.

Add config file for links to related resources for sources

At present, the BODS data analysis tools process data from each source and offer links to the different hosted databases produced. Upcoming tasks involve adding additional formats like xlsx #8 and Parquet #10

But going forward, I'd also like to be able to link to additional resources from Open Ownership which either use or power the data available from each source. Ideally these links for each source could simply be added in a config file on Github.

Here is a list of example resources I'd like to be able to link to for each source:

Github repository for ingesting source data - e.g. https://github.com/openownership/register-ingester-psc
Data analysis notebook - e.g. https://deepnote.com/@open-ownership/latviademo-4b182b6b-8b8d-4255-97b4-37434c6929c6
Data source page from the Open Ownership Register - e.g. https://register.openownership.org/data_sources/uk-psc-register
Open Ownership research - e.g. https://www.openownership.org/en/news/our-quick-assessment-of-nigerias-first-public-register-a-strong-start-but-more-to-be-done/
Original technical documentation - e.g. https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/persons-with-significant-control

This list is non-comprehensive and may be different for each source so it would probably be easiest to make the link text editable for each resource.

openownership / bodsdata Goto Github PK

bodsdata's People

Contributors

Stargazers

Watchers

bodsdata's Issues

Recommend Projects

Recommend Topics

Recommend Org