opendataservices / flatten-tool Goto Github PK
View Code? Open in Web Editor NEWTools for generating CSV and other flat versions of the structured data
Home Page: http://flatten-tool.readthedocs.io/en/latest/
License: MIT License
Tools for generating CSV and other flat versions of the structured data
Home Page: http://flatten-tool.readthedocs.io/en/latest/
License: MIT License
I decided to do this as it's easier, the size of data people normally have/want in spreadsheets fits into memory, and I didn't want to be guilty of premature optimisation. It might bite us down the line though if get some particularly large files.
(Probably using argparse)
E.g. "item" instead of "itemsAwarded"
If an option is set, add the field descriptions as a note on the relevant header cell when generating Excel output.
I've added a basic unflatten function. Currently this handles the main sheet correctly, but sub sheets are not yet added to the correct places.
Relates to #23
In the 360 Giving Data Standard we have had the concept of 'roll-up' properties. These are properties which would normally end up in a sub-table, but which, for the purpose of simple use-cases where there are only one-to-one relationships, could be included in the main table.
For example, this schema generates an Excel file with only a few properties in the main tab.
However, the 360 Giving Template is based on the idea of it being possible to express many of the properties from the sub-table in main.
Often not all the properties from a referenced object/sub-table will be wanted in the summary (e.g. in summary mode we might only want the Organisation Name and Postcode to come into the main table, but not all the detailed address information or organisation locations and classifications).
To achieve this, we should:
This would lead to the behaviour that:
This is due to https://github.com/open-contracting/flattening-ocds/blob/master/flattening_ocds/input.py#L31 which replaces all colons with slashes.
A temporary fix to allow users to supply a non-title header for parent-ids or typed columns would be to not perform the conversion if we find a slash.
A longer term solution might be to use a different separator for this purpose when titles are used.
This should have parametrized tests with/without the rollup option.
http://tools.ietf.org/html/rfc3339#section-5.6
Currently converted into the default string representation by this line:
https://github.com/open-contracting/flattening-ocds/blob/1c3f3a9cb7f34cb1395d7559ae0d892eda995ceb/flattening_ocds/input.py#L217
For Excel dates, this looks like: 2015-01-01 00:00:00
(this is missing a T in the middle and a timezone)
Timezones should be added by assuming UTC by default, and adding a a commandline argument to specify a different timezone. (This should probably be location based e.g. 'Europe/London' in order to handle daylight savings.)
How easy would it be to switch to using / as the separator rather than .?
I've been looking at other tooling, and / fits better with the JSON Path RFC.
Not a priority though if complex.
e.g. if a JSON file is invalid we get a traceback ending:
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
We could catch this and give a much better error message, without a traceback, and with useful information such as what file we were trying to open.
This will be an id
property instead of a somethingID
property.
output.py
needs to be updated to be python2 compatible
Traceback (most recent call last):
File "output.py", line 77, in <module>
output_release_schema_all_formats()
File "output.py", line 73, in output_release_schema_all_formats
spreadsheet_output.write_sheets()
File "output.py", line 22, in write_sheets
self.open()
File "output.py", line 51, in open
except FileExistsError:
NameError: global name 'FileExistsError' is not defined
E - 'tender': {'awardCriteriaDetails': None,
E - 'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E ? ^^^^^^^^^^^
E + 'tender': {'documents': [{'id': 'https://buyandsell.gc.ca/cds/public/2014/04/04/cb21d6397174fbba48c92edb694de76f/tender_document_5p404-13181.pdf',
E ? ^^^^^^^^^^^
(Diff taken from new roundtrip test:
523c837)
I'm not sure if there's a way of fixing this, or whether it's necessary to do so, so the correct answer might be to add an exception for it in the roundtripping test.
For example, if the schema for recipient organisation includes:
"recipientOrganization": {
"items": {
"$ref": "#/definitions/Organization"
},
"type": "array",
"description": "-",
"weight": 5,
"title": "Recipient Org",
"rollUp": ["id","name","charityNumber","companyNumber","streetAddress","addressLocality","postalCode","description","url"]
},
And a table row includes:
Recipient Org:Street Address | Recipient Org:City | Recipient Org:County | Recipient Org:Postal Code | Recipient Org:Country |
---|---|---|---|---|
383-387 High Street | Stratford | E15 4QZ | UK |
Then this will be converted to:
"recipientOrganization": [
{
"name": "Discover",
"charityNumber": 1070468,
"addressLocality": "Stratford",
"streetAddress": "383-387 High Street",
"postalCode": "E15 4QZ",
"id": "GB-CHC-1070468"
}
],
"Recipient Org": {
"Country": "UK"
},
Because country
is not included in the schema Roll Up, and so it is not interpreted as a title alias for a field name, but instead as an additional field provided by the publisher.
This is not the desired behaviour, as we want to allow that publishers might practice a different set of roll ups in their own files in order to meet local use-cases.
I.e. The roll up specification in the schema should be used to determine:
(a) The creation of templates
(b) The creation of canonical summary spreadsheets
But should not limit the range of ways in which a publisher might express child fields in the main spreadsheet row.
It's linked to GitHub, and free for public repositories. I've already made the necessary change to the code 29c87e1 but I can't enable it on the https://coveralls.io/ website because I'm not an Owner.
(but does on Python 3)
Especially wrt to python 2's non-unicode csv module - we might need to use unicodecsv for python 2
Currently there are a lot of ValueErrors, with no further information.
We might also want to replace some of these with warnings, and allow a faulty conversion to take place.
CSVInput vs CSVDirectoryOutput
XLSXInput vs XlsxOutput
Where properties are encountered that either contain:
(a) a simple array of values - e.g: { alternativeIdentifiers: ‘abc123’,’xyz434’ a simple array of simply arrays - e.g. { location: [ [54,3],[45,22] ] }
Then rather than create a sub-table, these values should be concatenated in in a cell, separated by ‘;’ generating cells containing:
(a) abc213; xyz434
(b) 45,3; 45,22
/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:199: UserWarning: Non-integer value "notanint" found in integer column, returning as string instead.
warn('Non-integer value "{}" found in integer column, returning as string instead.'.format(value))
/home/bjwebb/code/opencontracting/flattening-ocds/flattening_ocds/input.py:193: UserWarning: Non-numeric value "notanumber" found in number column, returning as string instead.
warn('Non-numeric value "{}" found in number column, returning as string instead.'.format(value))
(from https://gist.github.com/Bjwebb/d41b8efc71e24f5fba16)
Perhaps only the message itself should be printed by default, and not the information of what line the warning was raised on (as most users don't care about this). A verbose or debug command line flag could be added for users who do want the full output.
E - 'tag': ['tender'],
E ? - -
E + 'tag': 'tender',
(Diff taken from new roundtrip test:
523c837)
(ie. to remove dependency on xlsxwriter).
Moving this from #40 into it's own issue as neither Open Contracting nor 360 Giving are affected.
When generating an Excel template, basic validation should be included for columns, including:
These should be read from the keys of the FORMATS dictionaries, and supplied to argparse via https://docs.python.org/3/library/argparse.html#choices
e.g. xls (pre 2007 Excel) or ods (Open Document Spreadshet)
A data package of OCDS releases could include multiple sets of statements about a contracting process made at different points in time.
These would each have a unique release.id
In order to compile from a spreadsheet format into JSON structure we would need to match both the ocid and the release.id of a JSON object before merging properties into it.
This seems to require that release.id is also included in the sub-tables alongside ocid, and is used in the conversations between formats.
The 360 Giving Data Standard makes use of user-friendly field titles, as opposed to field names, as column headings.
To allow 360 Giving Data to be converted by the tool we either need:
(a) A process that runs to convert the header row from field titles to field names prior to flattening / unflattening;
(b) The tool to be able to read field titles rather than field names;
I think it's probably best to use py.test for this, since this is what other code in this organisation is using.
This may be desirable if the array/items are long, or one of the items contains a ;
or ,
character.
(Split out from #1)
If a cell contains a formula then that is being read into the JSON conversion rather than the calculated value of the cell.
The details at https://stackoverflow.com/questions/22613272/how-to-access-the-real-value-of-a-cell-using-the-openpyxl-module-for-python suggest this should be possible to change using the data_only=True flag when opening the workbook.
If this codebase is destined to support multiple standards, I would suggest that the defaults are made generic: #40
(See attatchments sheet for an example where this is wrong currently)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.