openownership / cove-bods Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 291 KB

Check that your data complies with the Beneficial Ownership Data Standard (BODS) using our open source data review tool

Home Page: https://datareview.openownership.org/

License: Other

Python 8.46% Shell 0.03% HTML 17.56% Sass 9.07% SCSS 64.63% Dockerfile 0.20% Procfile 0.06%

beneficial-ownership beneficial-ownership-data

cove-bods's People

Contributors

Stargazers

Watchers

Forkers

spendnetwork stephenabbott

cove-bods's Issues

Data Review Tool UI: specify and make any changes

In the first place: look at https://datareview.openownership.org/ and https://github.com/openownership/cove-bods/tree/master to check whether any UI revisions are needed.

publicationDetails and isComponent fields are not considered part of BODS

Hi,

I noticed that publicationDetails and isComponent fields are ignored from validation and are considered "additional fields" meaning not present in the JSON schema. Both fields are required under BODS 0.2.0.

Move 'Using the data review tool' section to above the main 'upload/link/paste' section

After PR #30 is merged, can the 'Using the data review tool' section be moved to under the header. It's a brief intro to what the tool is and how to use it so needs to be more prominent.

Update BODS Data Review Tool to handle larger data analysis

During the process of checking Latvia's initial release of beneficial ownership data from January 2021, we attempted to use the DRT to analyse how closely their data complies with BODS. But the file could not be analysed as it is too big.

No file size limit is explicitly given in the notes on this page so documentation should be updated to include an explanation of the limit.

Improve validation error messages (including oneOf)

UI does not describe how long files are stored for

From the T&Cs:

Information about how long files will be stored is made available to users directly through the interface, but will be for no longer than one year from the date of submission.

I don't think that this is the case.

tests/fixtures/0.2/bad_statement_id_type.json crashes cove but does not get an error message

https://github.com/openownership/lib-cove-bods/blob/master/tests/fixtures/0.2/bad_statement_id_type.json

statementID that is an object causes a 500 error

Minimal example:

[{"statementID": {}}]

Display schema description in validation error message

This is something we already do for OCDS CoVE, so we could borrow that code.

For example, in the linked file above, the 'ocid' error message:

"Open Contracting ID
A globally unique identifier for this Open Contracting Process. Composed of an ocid prefix and an identifier for the contracting process. For more information see the Open Contracting Identifier guidance"

i.e. the field description from the schema is pulled in and displayed to help users rectify errors.

Make 'beta' readable in the header

The grey colour of the word 'beta' in the header means it's almost invisible. A lighter colour please.

Warn about unrecognised schema version / always print version number used

I've been using the BODS data review tool to develop a sample mapping for Indonesian data. I inadvertently set up my code to output a bodsVersion of 0.2.0, rather than 0.2, which seems to have led the tool to default to validating it as 0.1. It took me a while to work out this was the reason for some other validation errors.

It would be nice if
a) an unrecognised version number was a warning (or maybe even an error?)
b) the tool told you what version number it was using/defaulting to, even if you didn't specify one

When someone uploads a spreadsheet what version of the schema do we assume it is?

In OCDS, it looks like we just assume it's 1.1

The problem is this bit in view.py:

   schema_bods = SchemaBODS(lib_cove_bods_config=lib_cove_bods_config)
    context.update(convert_spreadsheet(upload_dir, upload_url, file_name, file_type, lib_cove_bods_config,
                                       schema_url=schema_bods.release_pkg_schema_url))
    with open(context['converted_path'], encoding='utf-8') as fp:
        json_data = json.load(fp, parse_float=Decimal)

We need to pass the json_data to SchemaBODS so it can select the right version, but we don't have that yet ... and we need to pass the schema to convert_spreadsheet, but we don't know which version of the schema to use until we open the spreadsheet!

Group Additional Checks by type like we group validation errors

In #69 we found an file with 7000 additional check errors. That would make for a pretty big webpage :-)

Update links in footer to point to cove-bods repo

Use libcovebods package from PyPI

Show filename in UI when presenting results

It would be useful to show the name of the file that has been tested. I know that you can "see" it in the "JSON (Original)" link that is created but I don't find this very intuitive and, when working with a lot of different files, a quick visual reference would be very helpful.

Add Loading screen and background worker to app

Split out from #69

Motivation: Users uploading large files don't see the app time out and crash.

In #69 (comment) , the idea of the app "timing out" after trying to process for a certain time was raised.

A timeout approach would be difficult as currently processing a web request is single threaded. In order to "time out" nicely we'd have to change that to a multi threaded approach and make sure the main thread could stop the worker thread safely. We'd also have to impose a limit that was short enough such that almost all people would not get impatient and hit refresh, maybe 10 or 20 seconds. This limits the file size that can be uploaded and processed. Also the maximum size would depend on how effective the server was at the time, which would vary and thus be hard to calculate. What I'm about to suggest may seem like more work than the timeout suggestion, but I think the timeout suggestion is actually more difficult to do well than it may seem on first glance and a timeout gets us less in terms of functionality and future options.

Instead I'd suggest a process where we add a message queue and a background worker.

Any small files uploaded would be processed at once, using the same mechanism as before.

Any large files would be saved to disk and DB, a message sent through the queue to a worker and the user shown a loading page.

There are several advantages to this:

files of larger sizes that would take several minutes to process can be processed by the tool
the threading model for processing a web request and the background worker would stay single threaded, which is easier
as large files are processed by a worker and the number of workers that can run at once would be fixed, it's harder to DOS the server by submitting many large files at the same time
currently, every time someone looks at a data page the results are recalculated. This would involve saving the results of calculations so that subsequent accesses were faster.
lays groundwork for more options later - for instance, when given a URL the background worker could download the file. This would provide faster "loading" feedback to the user and allow larger files to be downloaded. Or while a large file is being processed, the user may be able to see initial results from what has been processed so far.
the background worker pattern is a standard and well used one for web apps.

To do this we would:

Add celery library to process messages through queue.

Add RabbitMQ server.

Add background worker that processes files and saves results to disk. This is basically a wrapper around libcovebods so is not too complex.

When new data sent to "/" process it as before (save to disk and db then redirect user to "/data/XXX" page), but if the file is above a certain size also send a message to the worker.

When user looks at "/data/XXX" page, change process to:

look on disk for cached results
if there, show results
if file is above a certain size, we are waiting for background worker to save results - show loading page
if file is below a certain size, calculate results and save them to disk then show results

An issue here is that currently the home page of the tool comes from libcoveweb, which is shared with other tools. We would have to have a way to extend the library from the cove app itself.

Add Sample Mode

Split out from #69

Motivation: Users who have large files won't be able to process them using the DRT. Instead they can follow some simple instructions to create a sample file locally of a smaller size and upload that instead. This won't be able to check everything, but it will be able to check a lot of things and hopefully that will still help users.

On the home page of the tool, every upload method will have a "sample data" or "full data" switch in the UI. This could be radio buttons or a select box. ("full data" by default)

In sample mode, certain tests are not run - there wouldn't be any tests that need a full file to check properly, and thus can't be checked in a sample file. Eg:

Person or Entity statement not used by a ownership/control statement
ownership/control statement refers to a person or entity that does not exist
ordering checks

To do this work:

The sample mode would be added as a flag to libcovebods. (off by default)

Then the tool would be updated with a switch in the UI that is passed to libcovebods.

An issue here is that currently the home page of the tool comes from libcoveweb, which is shared with other tools. It might be possible to add the functionality to the shared library, but behind a switch so that other tools don't get it. Or it might be possible to extend the library from the cove app itself.

Make source_url parameters work

From #101

(This direct link for validation from the workbook also does not work:

https://datareview.openownership.org/?source_url=https://docs.google.com/spreadsheets/d/1XT5UvwaUcFS65UH5kyj7hDOAKYG-U30eSBoQxbR7A1A/export?format=xlsx

...but I assume that that is a related issue.)

Excel file from link not being processed by Data Review Tool

It used to be possible to paste the url in B9 in this workbook into the Data Review Tool and have it validate the Excel version of the workbook.

It no longer works.

Note that downloading the Excel version of the workbook and uploading it to the tool does work.

(This direct link for validation from the workbook also does not work:

https://datareview.openownership.org/?source_url=https://docs.google.com/spreadsheets/d/1XT5UvwaUcFS65UH5kyj7hDOAKYG-U30eSBoQxbR7A1A/export?format=xlsx

...but I assume that that is a related issue.)

The issue was found by @kathryn-ods.

Wording of error message needs changing

I attempted to upload an invalid file and received the error message:

"BODS JSON should have an list as the top level, the JSON you supplied does not."

This should either read "a list" or it is missing something between 'an' and 'list'.

URL is http://cove.cove-dev-bods.default.opendataservices.uk0.bigv.io/data/2ce19931-ab71-423a-ba20-493e726a5a2d

Enable ingesting of JSON lines files

As part of the work to update the data review tool to handle large datasets #69, the tool needs to be able to ingest JSON lines files.

If it's not simple to detect whether a .json file is in lines fomat, it would be fine to have users select a 'File is in JSON lines format' checkbox.

Improve test processing and reporting for very large files

As part of the work to update the data review tool to handle large datasets #69, we'll need to develop the test processing and reporting capabilities of the tool.

Hundreds or thousands of validation errors could be reported for a large file, so we will need to specify what useful behaviour is in such instances.

Where a field is meant to be ISO 3166-1 or ISO 3166-2, validate that as an additional check

Broken out from: openownership/data-standard#272

( openownership/data-standard@348af53 should offer some clarification on which fields to check)

Add information to home page about file size limits

From #69 (comment)

Update documentation on datareview.openownership.org to recommend current use of DRT only for .json files smaller than 100mb. Prompt users to email [email protected] if they want help to analyse larger files. Exact language TBC.

Assuming this is new content in the "Using the data review tool " section, this shouldn't take long to sort out - as long as reviewing and approving the new content takes, essentially.

Would we need to provide different guidance for maximum recommended file sizes across the four accepted formats?

In theory yes, but we could start with a low limit that would apply to all.

Terms needs updating

https://datareview.openownership.org/terms/

Update Django to the latest LTS

Currently we use Django 1.11 LTS which reached EOL in April. We should upgrade to 2.2 LTS. This change needs to be made in this repo, and lib-cove-bods.

Our other cove instances have been updated, so the latest lib-cove-web should work with 2.2. The (360/IATI)cove pull request might be useful to look at, for what changes may need to be made https://github.com/OpenDataServices/cove/pull/1281/files (rename MIDDLEWARE and use latest dealer).