Comments (15)
I can see how this would be useful, it would need a bit of thinking as to how to fit it in with the global directives such as @totalColumns - we might also want a new @maxColumns there to cater for this scenario.
As you'll see I've tagged this for 1.2, but we don't have resource right now to work on that actively. If you want to have a go at working up an addition to the draft 1.2 spec we're happy to receive pull requests.
from csv-schema.
implemented in #17
from csv-schema.
I don't think @OptionalColums can be implemented as it creates an ambiguity in the parser
from csv-schema.
@adamretter The only way it is different than @optional
is by not requiring the column header for an otherwise empty column
from csv-schema.
Yes I understand. However it is still ambiguous
from csv-schema.
I have to disagree. Ambiguity is introduced by poorly defined requirements. What is ambiguous about a column being optional? In a schema, if you mark a column as optional, then you accept the fact that your output may not have that column defined. If it is defined, all other validations apply. Since the goal of this project is define a CSV schema, let's take a look at some other schema projects.
JSON schema
JSON Schema defaults everything to be optional unless it is specifically marked as required.
Example JSON Schema
{
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"middleName": {
"type": "string"
}
},
"required": [
"lastName",
"firstName"
],
"type": "object"
}
Valid JSON
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Donald",
"lastName": "Trump",
"middleName": "John"
}
Invalid JSON
{
"firstName": "John",
"lastName": "Doe",
"middleName": null
}
Please note how the JSON schema allows "middleName" to not be provided - but does NOT allow it to be null. This type of validation is not possible in CSV Schema without @optionalColumn
support
XML schema (XSD)
XSD uses minOccurs=0
to define optional elements
Example XSD
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="firstName" type="xs:string"/>
<xs:element name="lastName" type="xs:string"/>
<xs:element name="middleName" minOccurs="0" type="NonEmptyString"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:simpleType name="NonEmptyString">
<xs:restriction base="xs:string">
<xs:minLength value="1"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
Valid XML
<person>
<firstName>Donald</firstName>
<lastName>Trump</lastName> <!-- middleName is not provided, but it is still valid -->
</person>
<person>
<firstName>Donald</firstName>
<lastName>Trump</lastName>
<middleName>John</middleName>
</person>
Invalid XML
<person>
<firstName>Donald</firstName>
<lastName>Trump</lastName>
<middleName /> <-- If the element is provided, but must not be empty -->
</person>
SQL
SQL is probably the most apt comparison. All columns are null
by default. You are not required to insert every column on every insert:
Example SQL table
CREATE TABLE people (
firstName varchar(255) NOT NULL,
lastName varchar(255) NOT NULL,
middleName varchar(255) CHECK (middleName <> '') -- allow null but not empty
);
Valid Inserts
INSERT INTO people (firstName, lastName) values ('Donald', 'Trump');
INSERT INTO people (firstName, lastName) values ('Donald', 'Trump', null);
INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', 'John');
Invalid Inserts
INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', '');
Closing comments
Finally, I recognize this project's goal is to define a CSV as strictly as possible. Providing @optionalColumn
does not take away from that goal. The schema definitions provide for some extremely concise language, which is great, but that doesn't make using something as generic as @notEmpty
ambiguous. It is perfectly defined as notEmpty. The same can be said for @optionalColumn
.
@optionalColumn
gives users the ability to define 2 different CSV that are functionally the same with the same schema. Why can I not represent these two files using a single schema?
"firstName","lastName","middleName"
"Donald","Trump",""
"firstName","lastName"
"Donald","Trump"
from csv-schema.
@lightswitch05 The ambiguity comes from matching the columns in the CSV Schema to the columns in the CSV file. If you have one @OptionalColumn
, that might work (but only in some cases), but if you have more than one then it likely becomes impossible.
This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.
I think it might help, if I try and explain how the implementation works, and therefore demonstrate why I think @OptionalColumn
is unimplementable:
Consider the following Schema:
version 1.2
1:
2:
3: @optionalColumn
4: @optionalColumn
5:
6: @optionalColumn
7:
When the implementation has to match this against a CSV file, Column 5, 6, and 7 are now ambiguous. Actually anything after (3) is ambiguous, but I think this example pushes the point further!...
The processor has no way of knowing under what circumstances (5) should match what columns. (5) could match CSV column (3), (4), or (5) depending on the format of the CSV. You might even have a CSV, where you think CSV Schema Column (5) is matching CSV Column (4) but due to data errors (which the CSV validator would not highlight in this case) it is actually matching CSV Column (3). Therefore you are likely to introduce subtle data validation errors that were not possible before.
I could see that you might argue that, you should only use @optionalColumn
with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).
If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.
Let me know if that makes sense to you. If not I can try and explain further...
from csv-schema.
This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.
This is the point of a schema, to validate the structure - which could be particularly useful for CSVs since the structure has so many unknowns.
I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).
I agree that @optionalColumn
would be impossible to support without named headers. However, there are already existing directives that have mutual exclusivity. In fact, @noHeader
is itself mutually exclusive to the @ignoreColumnNameCase
directive. I don't believe it is an unreasonable request for an @optionalColumn
directive, making it mutually exclusive to the @noHeader
directive.
If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.
I would not be requesting this feature if I had any control over the incoming CSVs. If csv-schema is only targeting usage by those with complete control over their files, then it useless to me. I am expecting hundreds of different CSV formats- many of which contain optional columns that may or may not be supplied in the CSV. If your solution is to have a schema defined for each possible combination then the total number of schemas required to validate the incoming files quickly becomes unmanageable.
from csv-schema.
I agree that @optionalColumn would be impossible to support without named headers.
The point is the named headers in the CSV Schema are just labels, they do not infer a match against the header in the CSV file itself. That is important for the flexibility that the CSV Schema currently offers.
If you wanted to change it so that headers in the CSV Schema had to match headers in the CSV file, then you would need to do some more significant work first. After such a change, if it was acceptable to all parties, then you could consider introducing a @optionalColumn
directive. However, with the current spec and how parsing is executed and validated according to the CSV Schema, an @optionalColumn
is impossible without introducing ambiguity.
I would not be requesting this feature if I had any control over the incoming CSVs
You don't need control over the incoming CSVs. You only need control over generating the CSV Schemas, which I imagine must be within your remit ;-)
from csv-schema.
So the reference implementation is the blocker here and not the spec itself?
from csv-schema.
Dynamic CSV schema generation would be very problematic as the schemas themselves would be submitted by the users.
from csv-schema.
@adamretter as of this commit b6dd4a7b0e23d3c9515864e2a2375b6bcd42c9f2 we do check header fields against the rule names in CSV Validator (though agreed it's not explicit in the schema definition at the moment). Not doing so was causing us problems downstream, and seems counter-intertuitive.
from csv-schema.
I just stumbled on digital-preservation/csv-validator/issues/134 - looks like our data is a little too wild for this schema definition and we'll have to find an alternative solution. Thank you for your time discussing the merits of optional columns
from csv-schema.
@DavidUnderdown Yes I did see that in the reference implementation. However the spec does not say that MUST be the case, and for good reason (see above) ;-)
from csv-schema.
Closed as per above discussion
from csv-schema.
Related Issues (20)
- Slash or backslash in case expression? HOT 4
- @optionalHeader directive HOT 2
- Clarification: @optional means full, empty, or partially full? HOT 2
- Regex ignored? HOT 1
- ISO8601 is more than just YYYY-MM-DD HOT 4
- EBNF definition of Positive Integer Literal allows (infinite) zero padding
- Conditional warning HOT 1
- Question about elaboration tolerance and ordering HOT 2
- Support for Non-CSV Metadata / Front Matter / Comments in CSV Files HOT 5
- A proper logical NOT operator HOT 1
- How does integrityCheck know which folders to check? HOT 2
- Incorrect RegEx for XsdTimezoneComponent
- Support for nano seconds
- Error in Example 38 - should be positiveInteger, not integer HOT 2
- Line terminator global directive
- '@columns' global directive in sample valid? HOT 1
- Incorrect `identicalExpr` grammar rule HOT 1
- `anyExpr` inconsistent definition in specification 1.1 HOT 1
- `XsdDateTimeExpr` mismatch between description and grammar HOT 1
- Regex for timezone erroneous HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csv-schema.