The flatten-tool from opendataservices

Conversions currently happen by loading all files into memory

I decided to do this as it's easier, the size of data people normally have/want in spreadsheets fits into memory, and I didn't want to be guilty of premature optimisation. It might bite us down the line though if get some particularly large files.

Remove OCDS specific aspects of the code

OCID ID
Top schema dict is assumed to be object (dictionary), but some JSON files will have a list as the root (split out to #47 as it doesn't affect 360 Giving)
Outputted JSON has base['releases'] populated, and this is hardcoded.
OCDS specific defaults #34
Default for root-id, see below

Add commandline argument support

(Probably using argparse)

Sheet names should be based on the reference, not the context

E.g. "item" instead of "itemsAwarded"

Add unflatten subcommand to flatten-ocds cli

Add field descriptions as notes in Excel output

If an option is set, add the field descriptions as a note on the relevant header cell when generating Excel output.

Unflatten subsheets properly

I've added a basic unflatten function. Currently this handles the main sheet correctly, but sub sheets are not yet added to the correct places.

Structure the code as a python package

Empty number fields currently cause conversion to fail

Relates to #23

Roll-up / Summary Table option

In the 360 Giving Data Standard we have had the concept of 'roll-up' properties. These are properties which would normally end up in a sub-table, but which, for the purpose of simple use-cases where there are only one-to-one relationships, could be included in the main table.

For example, this schema generates an Excel file with only a few properties in the main tab.

However, the 360 Giving Template is based on the idea of it being possible to express many of the properties from the sub-table in main.

Often not all the properties from a referenced object/sub-table will be wanted in the summary (e.g. in summary mode we might only want the Organisation Name and Postcode to come into the main table, but not all the detailed address information or organisation locations and classifications).

To achieve this, we should:

Create a Summary Mode switch for use when flattening data or generating templates
Specify in the schema for properties of referenced objects (i.e. properties normally in sub-tables) whether they should be included in the main table when in summary mode.

This would lead to the behaviour that:

When generating a template in summary mode, many properties will be possible to record directly in the master sheet.
When flattening JSON in summary mode, information would be duplicated between the summary table, and sub-tables, for any property which should be included in summary mode.
- (We need to think about how this would round-trip. I.e. would master or sub-tables take precedence if conflicts are found. I would suggest allowing sub-tables to overwrite master table values, but reporting error when there is a conflict/difference.)

Titles and parent IDs or types don't play nicely together

This is due to https://github.com/open-contracting/flattening-ocds/blob/master/flattening_ocds/input.py#L31 which replaces all colons with slashes.

A temporary fix to allow users to supply a non-title header for parent-ids or typed columns would be to not perform the conversion if we find a slash.

A longer term solution might be to use a different separator for this purpose when titles are used.

Add roundtrip test for 360 Giving

(to https://github.com/open-contracting/flattening-ocds/blob/master/flattening_ocds/tests/test_roundtrip.py )

This should have parametrized tests with/without the rollup option.

Set up tests with travis-ci

Dates/Times should be outputted in JSON Schemas date-time format

http://tools.ietf.org/html/rfc3339#section-5.6

Currently converted into the default string representation by this line:
https://github.com/open-contracting/flattening-ocds/blob/1c3f3a9cb7f34cb1395d7559ae0d892eda995ceb/flattening_ocds/input.py#L217
For Excel dates, this looks like: 2015-01-01 00:00:00
(this is missing a T in the middle and a timezone)

Timezones should be added by assuming UTC by default, and adding a a commandline argument to specify a different timezone. (This should probably be location based e.g. 'Europe/London' in order to handle daylight savings.)

Switch separator from . to /

How easy would it be to switch to using / as the separator rather than .?

I've been looking at other tooling, and / fits better with the JSON Path RFC.

Not a priority though if complex.

Catch exceptions in order to give better error message

e.g. if a JSON file is invalid we get a traceback ending:

  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")

We could catch this and give a much better error message, without a traceback, and with useful information such as what file we were trying to open.

Add requirements.txt

Convert code to use new ID format

This will be an id property instead of a somethingID property.

python2 bug

output.py needs to be updated to be python2 compatible

Traceback (most recent call last):
  File "output.py", line 77, in <module>
    output_release_schema_all_formats()
  File "output.py", line 73, in output_release_schema_all_formats
    spreadsheet_output.write_sheets()
  File "output.py", line 22, in write_sheets
    self.open()
  File "output.py", line 51, in open
    except FileExistsError:
NameError: global name 'FileExistsError' is not defined

Match JSON parsing columns to the schema annotated ones

nulls don't roundtrip, because key is ignored

E         -                'tender': {'awardCriteriaDetails': None,
E         -                           'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E         ?                ^^^^^^^^^^^
E         +                'tender': {'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E         ?                ^^^^^^^^^^^

(Diff taken from new roundtrip test:
523c837)

I'm not sure if there's a way of fixing this, or whether it's necessary to do so, so the correct answer might be to add an exception for it in the roundtripping test.

Add more information to the README

Description of what the column headers look like in the template (general form and what information is annotated etc.)
Add note about CSV encodings and Python 2

Only titles of fields including in RollUp in schema converted to fieldnames in JSON

For example, if the schema for recipient organisation includes:

  "recipientOrganization": {
      "items": {
        "$ref": "#/definitions/Organization"
      },
      "type": "array",
      "description": "-",
      "weight": 5,
      "title": "Recipient Org",
      "rollUp": ["id","name","charityNumber","companyNumber","streetAddress","addressLocality","postalCode","description","url"]      
    },

And a table row includes:

Recipient Org:Street Address	Recipient Org:City	Recipient Org:County	Recipient Org:Postal Code	Recipient Org:Country
383-387 High Street	Stratford		E15 4QZ	UK

Then this will be converted to:

"recipientOrganization": [
                {
                    "name": "Discover", 
                    "charityNumber": 1070468, 
                    "addressLocality": "Stratford", 
                    "streetAddress": "383-387 High Street", 
                    "postalCode": "E15 4QZ", 
                    "id": "GB-CHC-1070468"
                }
            ],
           "Recipient Org": {
                "Country": "UK"
            },

Because country is not included in the schema Roll Up, and so it is not interpreted as a title alias for a field name, but instead as an additional field provided by the publisher.

This is not the desired behaviour, as we want to allow that publishers might practice a different set of roll ups in their own files in order to meet local use-cases.

I.e. The roll up specification in the schema should be used to determine:

(a) The creation of templates
(b) The creation of canonical summary spreadsheets

But should not limit the range of ways in which a publisher might express child fields in the main spreadsheet row.

Do a full diff on roundtripping the sample data

Titles don't display properly in subsheets of generated template

Set up coveralls.io to track test coverage

It's linked to GitHub, and free for public repositories. I've already made the necessary change to the code 29c87e1 but I can't enable it on the https://coveralls.io/ website because I'm not an Owner.

flatten-ocds command doesn't output help information on python2

(but does on Python 3)

Add tests for unicode support

Especially wrt to python 2's non-unicode csv module - we might need to use unicodecsv for python 2

Latest version of openpyxl causes errors

https://travis-ci.org/open-contracting/flattening-ocds/builds/54588070

Make error messages more descriptive

Currently there are a lot of ValueErrors, with no further information.

We might also want to replace some of these with warnings, and allow a faulty conversion to take place.

Naming inconsistencies between Output and Intput

CSVInput vs CSVDirectoryOutput
XLSXInput vs XlsxOutput

Handle simple arrays differently

Where properties are encountered that either contain:

(a) a simple array of values - e.g: { alternativeIdentifiers: ‘abc123’,’xyz434’ a simple array of simply arrays - e.g. { location: [ [54,3],[45,22] ] }

Then rather than create a sub-table, these values should be concatenated in in a cell, separated by ‘;’ generating cells containing:

(a) abc213; xyz434
(b) 45,3; 45,22

Are warnings too verbose by default for end users?

/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:199: UserWarning: Non-integer value "notanint" found in integer column, returning as string instead.
  warn('Non-integer value "{}" found in integer column, returning as string instead.'.format(value))
/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:193: UserWarning: Non-numeric value "notanumber" found in number column, returning as string instead.
  warn('Non-numeric value "{}" found in number column, returning as string instead.'.format(value))

(from https://gist.github.com/Bjwebb/d41b8efc71e24f5fba16)

Perhaps only the message itself should be printed by default, and not the information of what line the warning was raised on (as most users don't care about this). A verbose or debug command line flag could be added for users who do want the full output.

JSON to Spreadsheet - Improve array support (when not using a schema)

E         -                'tag': ['tender'],
E         ?                       -        -
E         +                'tag': 'tender',

(Diff taken from new roundtrip test:
523c837)

Use openpyxl for output as well as input to reduce dependencies

(ie. to remove dependency on xlsxwriter).

Tests should write into tmpdir, not current working directory

https://github.com/open-contracting/flattening-ocds/blob/master/flattening_ocds/tests/test_output.py

Top schema dict is assumed to be object (dictionary), but some JSON files will have a list as the root

Moving this from #40 into it's own issue as neither Open Contracting nor 360 Giving are affected.

Add XLSX input

(To https://github.com/open-contracting/flattening-ocds/blob/master/flattening_ocds/input.py)

Add validation to spreadsheet templates

When generating an Excel template, basic validation should be included for columns, including:

Integer validation
Date validation
Select from list for enum values

Show list of formats in commandline help text

These should be read from the keys of the FORMATS dictionaries, and supplied to argparse via https://docs.python.org/3/library/argparse.html#choices

Add support for extra input/output formats

e.g. xls (pre 2007 Excel) or ods (Open Document Spreadshet)

Check whether we also need release.id in sub-tables

A data package of OCDS releases could include multiple sets of statements about a contracting process made at different points in time.

These would each have a unique release.id

In order to compile from a spreadsheet format into JSON structure we would need to match both the ocid and the release.id of a JSON object before merging properties into it.

This seems to require that release.id is also included in the sub-tables alongside ocid, and is used in the conversations between formats.

Subtables should include a column for the ID of their parent

Allow header row to use titles rather than field names

The 360 Giving Data Standard makes use of user-friendly field titles, as opposed to field names, as column headings.

To allow 360 Giving Data to be converted by the tool we either need:

(a) A process that runs to convert the header row from field titles to field names prior to flattening / unflattening;

(b) The tool to be able to read field titles rather than field names;

Add unit tests

I think it's probably best to use py.test for this, since this is what other code in this organisation is using.

Currently no support for using a separate sheet for simple arrays

This may be desirable if the array/items are long, or one of the items contains a ; or , character.

(Split out from #1)

Annotate extra information to column headers to help with conversion

The actual JSON key for this subsheet, on IDs
Types (e.g. number, boolean)

opendataservices / flatten-tool Goto Github PK

flatten-tool's People

Contributors

Stargazers

Watchers

Forkers

flatten-tool's Issues

Recommend Projects

Recommend Topics

Recommend Org