Coder Social home page Coder Social logo

openownership / cove-bods Goto Github PK

View Code? Open in Web Editor NEW
4.0 9.0 2.0 289 KB

Check that your data complies with the Beneficial Ownership Data Standard (BODS) using our open source data review tool

Home Page: https://datareview.openownership.org/

License: Other

Python 8.46% Shell 0.03% HTML 17.56% Sass 9.07% SCSS 64.63% Dockerfile 0.20% Procfile 0.06%
beneficial-ownership beneficial-ownership-data

cove-bods's Introduction

openownership-cove-bods-alpha

Dev installation

git clone https://github.com/openownership/cove-bods.git openownership-cove-bods
cd openownership-cove-bods
virtualenv .ve --python=/usr/bin/python3
source .ve/bin/activate
pip install -r requirements_dev.txt
python manage.py migrate
python manage.py compilemessages
python manage.py runserver

You may need to pass 0.0.0.0:8000 to runserver in the last step, depending on your development environment.

Note: requires gettext to be installed. This should come by default with Ubuntu, but just in case:

apt-get update && apt-get install gettext

Dev with Docker

Docker is used in production, so sometimes you may want to run locally with Docker to debug issues:

docker compose -f docker-compose.dev.yml down # (if running)
docker compose -f docker-compose.dev.yml build --no-cache
docker compose -f docker-compose.dev.yml up # (to restart)

To run commands, make sure environment is running (see up command above) then:

docker compose -f docker-compose.dev.yml run bods-cove-app-dev python manage.py migrate

Translations

We use Django's translation framework to provide this application in different languages. We have used Google Translate to perform initial translations from English, but expect those translations to be worked on by humans over time.

Translations for Translators

Translators can provide translations for this application by becomming a collaborator on Transifex https://www.transifex.com/OpenDataServices/cove

Translations for Developers

For more information about Django's translation framework, see https://docs.djangoproject.com/en/1.8/topics/i18n/translation/

If you add new text to the interface, ensure to wrap it in the relevant gettext blocks/functions.

In order to generate messages and post them on Transifex:

First check the Transifex lock <https://opendataservices.plan.io/projects/co-op/wiki/CoVE_Transifex_lock>, because only one branch can be translated on Transifex at a time.

Then:

python manage.py makemessages -l en
tx push -s

In order to fetch messages from transifex:

tx pull -a

In order to compile them:

python manage.py compilemessages

Keep the makemessages and pull messages steps in thier own commits seperate from the text changes.

To check that all new text is written so that it is able to be translated you could install and run django-template-i18n-lint

pip install django-template-i18n-lint
django-template-i18n-lint cove

Adding and updating requirements

Add a new requirements to requirements.in or requirements_dev.in depending on whether it is just a development requirement or not.

Then, run pip-compile requirements.in && pip-compile requirements_dev.in this will populate requirements.txt and requirements_dev.txt with pinned versions of the new requirement and its dependencies.

pip-compile --upgrade requirements.in && pip-compile --upgrade requirements_dev.in will update all pinned requirements to the latest version. Generally we don't want to do this at the same time as adding a new dependency, to make testing any problems easier.

cove-bods's People

Contributors

bjwebb avatar dependabot[bot] avatar kd-ods avatar odscjames avatar radix0000 avatar rhiaro avatar stephenabbott avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cove-bods's Issues

Enable ingesting of JSON lines files

As part of the work to update the data review tool to handle large datasets #69, the tool needs to be able to ingest JSON lines files.

If it's not simple to detect whether a .json file is in lines fomat, it would be fine to have users select a 'File is in JSON lines format' checkbox.

Add Loading screen and background worker to app

Split out from #69

Motivation: Users uploading large files don't see the app time out and crash.

In #69 (comment) , the idea of the app "timing out" after trying to process for a certain time was raised.

A timeout approach would be difficult as currently processing a web request is single threaded. In order to "time out" nicely we'd have to change that to a multi threaded approach and make sure the main thread could stop the worker thread safely. We'd also have to impose a limit that was short enough such that almost all people would not get impatient and hit refresh, maybe 10 or 20 seconds. This limits the file size that can be uploaded and processed. Also the maximum size would depend on how effective the server was at the time, which would vary and thus be hard to calculate. What I'm about to suggest may seem like more work than the timeout suggestion, but I think the timeout suggestion is actually more difficult to do well than it may seem on first glance and a timeout gets us less in terms of functionality and future options.

Instead I'd suggest a process where we add a message queue and a background worker.

Any small files uploaded would be processed at once, using the same mechanism as before.

Any large files would be saved to disk and DB, a message sent through the queue to a worker and the user shown a loading page.

There are several advantages to this:

  • files of larger sizes that would take several minutes to process can be processed by the tool
  • the threading model for processing a web request and the background worker would stay single threaded, which is easier
  • as large files are processed by a worker and the number of workers that can run at once would be fixed, it's harder to DOS the server by submitting many large files at the same time
  • currently, every time someone looks at a data page the results are recalculated. This would involve saving the results of calculations so that subsequent accesses were faster.
  • lays groundwork for more options later - for instance, when given a URL the background worker could download the file. This would provide faster "loading" feedback to the user and allow larger files to be downloaded. Or while a large file is being processed, the user may be able to see initial results from what has been processed so far.
  • the background worker pattern is a standard and well used one for web apps.

To do this we would:

Add celery library to process messages through queue.

Add RabbitMQ server.

Add background worker that processes files and saves results to disk. This is basically a wrapper around libcovebods so is not too complex.

When new data sent to "/" process it as before (save to disk and db then redirect user to "/data/XXX" page), but if the file is above a certain size also send a message to the worker.

When user looks at "/data/XXX" page, change process to:

  • look on disk for cached results
  • if there, show results
  • if file is above a certain size, we are waiting for background worker to save results - show loading page
  • if file is below a certain size, calculate results and save them to disk then show results

An issue here is that currently the home page of the tool comes from libcoveweb, which is shared with other tools. We would have to have a way to extend the library from the cove app itself.

Improve test processing and reporting for very large files

As part of the work to update the data review tool to handle large datasets #69, we'll need to develop the test processing and reporting capabilities of the tool.

Hundreds or thousands of validation errors could be reported for a large file, so we will need to specify what useful behaviour is in such instances.

Update Django to the latest LTS

Currently we use Django 1.11 LTS which reached EOL in April. We should upgrade to 2.2 LTS. This change needs to be made in this repo, and lib-cove-bods.

Our other cove instances have been updated, so the latest lib-cove-web should work with 2.2. The (360/IATI)cove pull request might be useful to look at, for what changes may need to be made https://github.com/OpenDataServices/cove/pull/1281/files (rename MIDDLEWARE and use latest dealer).

Show filename in UI when presenting results

It would be useful to show the name of the file that has been tested. I know that you can "see" it in the "JSON (Original)" link that is created but I don't find this very intuitive and, when working with a lot of different files, a quick visual reference would be very helpful.

Add information to home page about file size limits

From #69 (comment)

Update documentation on datareview.openownership.org to recommend current use of DRT only for .json files smaller than 100mb. Prompt users to email [email protected] if they want help to analyse larger files. Exact language TBC.

Assuming this is new content in the "Using the data review tool " section, this shouldn't take long to sort out - as long as reviewing and approving the new content takes, essentially.

Would we need to provide different guidance for maximum recommended file sizes across the four accepted formats?

In theory yes, but we could start with a low limit that would apply to all.

Display schema description in validation error message

This is something we already do for OCDS CoVE, so we could borrow that code.

For example, in the linked file above, the 'ocid' error message:

"Open Contracting ID
A globally unique identifier for this Open Contracting Process. Composed of an ocid prefix and an identifier for the contracting process. For more information see the Open Contracting Identifier guidance"

i.e. the field description from the schema is pulled in and displayed to help users rectify errors.

UI does not describe how long files are stored for

From the T&Cs:

Information about how long files will be stored is made available to users directly through the interface, but will be for no longer than one year from the date of submission.

I don't think that this is the case.

Make source_url parameters work

From #101

(This direct link for validation from the workbook also does not work:

https://datareview.openownership.org/?source_url=https://docs.google.com/spreadsheets/d/1XT5UvwaUcFS65UH5kyj7hDOAKYG-U30eSBoQxbR7A1A/export?format=xlsx

...but I assume that that is a related issue.)

Excel file from link not being processed by Data Review Tool

It used to be possible to paste the url in B9 in this workbook into the Data Review Tool and have it validate the Excel version of the workbook.

It no longer works.

Note that downloading the Excel version of the workbook and uploading it to the tool does work.

(This direct link for validation from the workbook also does not work:

https://datareview.openownership.org/?source_url=https://docs.google.com/spreadsheets/d/1XT5UvwaUcFS65UH5kyj7hDOAKYG-U30eSBoQxbR7A1A/export?format=xlsx

...but I assume that that is a related issue.)

The issue was found by @kathryn-ods.

Warn about unrecognised schema version / always print version number used

I've been using the BODS data review tool to develop a sample mapping for Indonesian data. I inadvertently set up my code to output a bodsVersion of 0.2.0, rather than 0.2, which seems to have led the tool to default to validating it as 0.1. It took me a while to work out this was the reason for some other validation errors.

It would be nice if
a) an unrecognised version number was a warning (or maybe even an error?)
b) the tool told you what version number it was using/defaulting to, even if you didn't specify one

When someone uploads a spreadsheet what version of the schema do we assume it is?

In OCDS, it looks like we just assume it's 1.1

The problem is this bit in view.py:

   schema_bods = SchemaBODS(lib_cove_bods_config=lib_cove_bods_config)
    context.update(convert_spreadsheet(upload_dir, upload_url, file_name, file_type, lib_cove_bods_config,
                                       schema_url=schema_bods.release_pkg_schema_url))
    with open(context['converted_path'], encoding='utf-8') as fp:
        json_data = json.load(fp, parse_float=Decimal)

We need to pass the json_data to SchemaBODS so it can select the right version, but we don't have that yet ... and we need to pass the schema to convert_spreadsheet, but we don't know which version of the schema to use until we open the spreadsheet!

Add Sample Mode

Split out from #69

Motivation: Users who have large files won't be able to process them using the DRT. Instead they can follow some simple instructions to create a sample file locally of a smaller size and upload that instead. This won't be able to check everything, but it will be able to check a lot of things and hopefully that will still help users.

On the home page of the tool, every upload method will have a "sample data" or "full data" switch in the UI. This could be radio buttons or a select box. ("full data" by default)

In sample mode, certain tests are not run - there wouldn't be any tests that need a full file to check properly, and thus can't be checked in a sample file. Eg:

  • Person or Entity statement not used by a ownership/control statement
  • ownership/control statement refers to a person or entity that does not exist
  • ordering checks

To do this work:

The sample mode would be added as a flag to libcovebods. (off by default)

Then the tool would be updated with a switch in the UI that is passed to libcovebods.

An issue here is that currently the home page of the tool comes from libcoveweb, which is shared with other tools. It might be possible to add the functionality to the shared library, but behind a switch so that other tools don't get it. Or it might be possible to extend the library from the cove app itself.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.