Coder Social home page Coder Social logo

opendataservices / flatten-tool Goto Github PK

View Code? Open in Web Editor NEW
101.0 101.0 18.0 1.83 MB

Tools for generating CSV and other flat versions of the structured data

Home Page: http://flatten-tool.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

flatten-tool's People

Contributors

ajparsons avatar andylolz avatar bibianac avatar bjwebb avatar caprenter avatar dplusic avatar duncandewhurst avatar edugomez avatar jpmckinney avatar kd-ods avatar kindly avatar michaelwood avatar odscjames avatar practicalparticipation avatar rhiaro avatar robredpath avatar stevieflow avatar threeaims avatar timgdavies avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flatten-tool's Issues

Conversions currently happen by loading all files into memory

I decided to do this as it's easier, the size of data people normally have/want in spreadsheets fits into memory, and I didn't want to be guilty of premature optimisation. It might bite us down the line though if get some particularly large files.

Remove OCDS specific aspects of the code

  • OCID ID
  • Top schema dict is assumed to be object (dictionary), but some JSON files will have a list as the root (split out to #47 as it doesn't affect 360 Giving)
  • Outputted JSON has base['releases'] populated, and this is hardcoded.
  • OCDS specific defaults #34
  • Default for root-id, see below

Roll-up / Summary Table option

In the 360 Giving Data Standard we have had the concept of 'roll-up' properties. These are properties which would normally end up in a sub-table, but which, for the purpose of simple use-cases where there are only one-to-one relationships, could be included in the main table.

For example, this schema generates an Excel file with only a few properties in the main tab.

However, the 360 Giving Template is based on the idea of it being possible to express many of the properties from the sub-table in main.

Often not all the properties from a referenced object/sub-table will be wanted in the summary (e.g. in summary mode we might only want the Organisation Name and Postcode to come into the main table, but not all the detailed address information or organisation locations and classifications).

To achieve this, we should:

  • Create a Summary Mode switch for use when flattening data or generating templates
  • Specify in the schema for properties of referenced objects (i.e. properties normally in sub-tables) whether they should be included in the main table when in summary mode.

This would lead to the behaviour that:

  • When generating a template in summary mode, many properties will be possible to record directly in the master sheet.
  • When flattening JSON in summary mode, information would be duplicated between the summary table, and sub-tables, for any property which should be included in summary mode.
    • (We need to think about how this would round-trip. I.e. would master or sub-tables take precedence if conflicts are found. I would suggest allowing sub-tables to overwrite master table values, but reporting error when there is a conflict/difference.)

Dates/Times should be outputted in JSON Schemas date-time format

http://tools.ietf.org/html/rfc3339#section-5.6

Currently converted into the default string representation by this line:
https://github.com/open-contracting/flattening-ocds/blob/1c3f3a9cb7f34cb1395d7559ae0d892eda995ceb/flattening_ocds/input.py#L217
For Excel dates, this looks like: 2015-01-01 00:00:00
(this is missing a T in the middle and a timezone)

Timezones should be added by assuming UTC by default, and adding a a commandline argument to specify a different timezone. (This should probably be location based e.g. 'Europe/London' in order to handle daylight savings.)

Switch separator from . to /

How easy would it be to switch to using / as the separator rather than .?

I've been looking at other tooling, and / fits better with the JSON Path RFC.

Not a priority though if complex.

Catch exceptions in order to give better error message

e.g. if a JSON file is invalid we get a traceback ending:

  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")

We could catch this and give a much better error message, without a traceback, and with useful information such as what file we were trying to open.

python2 bug

output.py needs to be updated to be python2 compatible

Traceback (most recent call last):
  File "output.py", line 77, in <module>
    output_release_schema_all_formats()
  File "output.py", line 73, in output_release_schema_all_formats
    spreadsheet_output.write_sheets()
  File "output.py", line 22, in write_sheets
    self.open()
  File "output.py", line 51, in open
    except FileExistsError:
NameError: global name 'FileExistsError' is not defined

nulls don't roundtrip, because key is ignored

E         -                'tender': {'awardCriteriaDetails': None,
E         -                           'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E         ?                ^^^^^^^^^^^
E         +                'tender': {'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E         ?                ^^^^^^^^^^^

(Diff taken from new roundtrip test:
523c837)

I'm not sure if there's a way of fixing this, or whether it's necessary to do so, so the correct answer might be to add an exception for it in the roundtripping test.

Add more information to the README

  • Description of what the column headers look like in the template (general form and what information is annotated etc.)
  • Add note about CSV encodings and Python 2

Only titles of fields including in RollUp in schema converted to fieldnames in JSON

For example, if the schema for recipient organisation includes:

  "recipientOrganization": {
      "items": {
        "$ref": "#/definitions/Organization"
      },
      "type": "array",
      "description": "-",
      "weight": 5,
      "title": "Recipient Org",
      "rollUp": ["id","name","charityNumber","companyNumber","streetAddress","addressLocality","postalCode","description","url"]      
    },

And a table row includes:

Recipient Org:Street Address Recipient Org:City Recipient Org:County Recipient Org:Postal Code Recipient Org:Country
383-387 High Street Stratford E15 4QZ UK

Then this will be converted to:

"recipientOrganization": [
                {
                    "name": "Discover", 
                    "charityNumber": 1070468, 
                    "addressLocality": "Stratford", 
                    "streetAddress": "383-387 High Street", 
                    "postalCode": "E15 4QZ", 
                    "id": "GB-CHC-1070468"
                }
            ],
           "Recipient Org": {
                "Country": "UK"
            },

Because country is not included in the schema Roll Up, and so it is not interpreted as a title alias for a field name, but instead as an additional field provided by the publisher.

This is not the desired behaviour, as we want to allow that publishers might practice a different set of roll ups in their own files in order to meet local use-cases.

I.e. The roll up specification in the schema should be used to determine:

(a) The creation of templates
(b) The creation of canonical summary spreadsheets

But should not limit the range of ways in which a publisher might express child fields in the main spreadsheet row.

Make error messages more descriptive

Currently there are a lot of ValueErrors, with no further information.

We might also want to replace some of these with warnings, and allow a faulty conversion to take place.

Handle simple arrays differently

Where properties are encountered that either contain:

(a) a simple array of values - e.g: { alternativeIdentifiers: ‘abc123’,’xyz434’ a simple array of simply arrays - e.g. { location: [ [54,3],[45,22] ] }

Then rather than create a sub-table, these values should be concatenated in in a cell, separated by ‘;’ generating cells containing:

(a) abc213; xyz434
(b) 45,3; 45,22

Are warnings too verbose by default for end users?

/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:199: UserWarning: Non-integer value "notanint" found in integer column, returning as string instead.
  warn('Non-integer value "{}" found in integer column, returning as string instead.'.format(value))
/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:193: UserWarning: Non-numeric value "notanumber" found in number column, returning as string instead.
  warn('Non-numeric value "{}" found in number column, returning as string instead.'.format(value))

(from https://gist.github.com/Bjwebb/d41b8efc71e24f5fba16)

Perhaps only the message itself should be printed by default, and not the information of what line the warning was raised on (as most users don't care about this). A verbose or debug command line flag could be added for users who do want the full output.

Add validation to spreadsheet templates

When generating an Excel template, basic validation should be included for columns, including:

  • Integer validation
  • Date validation
  • Select from list for enum values

Check whether we also need release.id in sub-tables

A data package of OCDS releases could include multiple sets of statements about a contracting process made at different points in time.

These would each have a unique release.id

In order to compile from a spreadsheet format into JSON structure we would need to match both the ocid and the release.id of a JSON object before merging properties into it.

This seems to require that release.id is also included in the sub-tables alongside ocid, and is used in the conversations between formats.

Allow header row to use titles rather than field names

The 360 Giving Data Standard makes use of user-friendly field titles, as opposed to field names, as column headings.

To allow 360 Giving Data to be converted by the tool we either need:

(a) A process that runs to convert the header row from field titles to field names prior to flattening / unflattening;

(b) The tool to be able to read field titles rather than field names;

Add unit tests

I think it's probably best to use py.test for this, since this is what other code in this organisation is using.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.