It appears the @optional directive allows the values

implemented in <a class="issue-link js-issue-link" data-error-text="Failed to load tit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ability to make column headers optional,about digital-preservation/csv-schema

Comments (15)

DavidUnderdown commented on June 13, 2024

I can see how this would be useful, it would need a bit of thinking as to how to fit it in with the global directives such as @totalColumns - we might also want a new @maxColumns there to cater for this scenario.

As you'll see I've tagged this for 1.2, but we don't have resource right now to work on that actively. If you want to have a go at working up an addition to the draft 1.2 spec we're happy to receive pull requests.

from csv-schema.

lightswitch05 commented on June 13, 2024

implemented in #17

from csv-schema.

adamretter commented on June 13, 2024

I don't think @OptionalColums can be implemented as it creates an ambiguity in the parser

from csv-schema.

lightswitch05 commented on June 13, 2024

@adamretter The only way it is different than @optional is by not requiring the column header for an otherwise empty column

from csv-schema.

adamretter commented on June 13, 2024

Yes I understand. However it is still ambiguous

from csv-schema.

lightswitch05 commented on June 13, 2024

I have to disagree. Ambiguity is introduced by poorly defined requirements. What is ambiguous about a column being optional? In a schema, if you mark a column as optional, then you accept the fact that your output may not have that column defined. If it is defined, all other validations apply. Since the goal of this project is define a CSV schema, let's take a look at some other schema projects.

JSON schema

JSON Schema defaults everything to be optional unless it is specifically marked as required.

Example JSON Schema

{
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "middleName": {
            "type": "string"
        }
    },
    "required": [
        "lastName",
        "firstName"
    ],
    "type": "object"
}

Valid JSON

{
  "firstName": "John",
  "lastName": "Doe" 
},
{
  "firstName": "Donald",
  "lastName": "Trump",
  "middleName": "John"
}

Invalid JSON

{
  "firstName": "John",
  "lastName": "Doe",
  "middleName": null
}

Please note how the JSON schema allows "middleName" to not be provided - but does NOT allow it to be null. This type of validation is not possible in CSV Schema without @optionalColumn support

XML schema (XSD)

XSD uses minOccurs=0 to define optional elements

Example XSD

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="person">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="firstName" type="xs:string"/>
                <xs:element name="lastName" type="xs:string"/>
                <xs:element name="middleName" minOccurs="0" type="NonEmptyString"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:simpleType name="NonEmptyString">
        <xs:restriction base="xs:string">
            <xs:minLength value="1"/>
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

Valid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName> <!-- middleName is not provided, but it is still valid -->
</person>
<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
    <middleName>John</middleName>
</person>

Invalid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
   <middleName /> <-- If the element is provided, but must not be empty -->
</person>

SQL

SQL is probably the most apt comparison. All columns are null by default. You are not required to insert every column on every insert:

Example SQL table

CREATE TABLE people (
    firstName varchar(255) NOT NULL,
    lastName varchar(255) NOT NULL,
    middleName varchar(255) CHECK (middleName <> '') -- allow null but not empty
);

Valid Inserts

INSERT INTO people (firstName, lastName) values ('Donald', 'Trump');
INSERT INTO people (firstName, lastName) values ('Donald', 'Trump', null);
INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', 'John');

Invalid Inserts

INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', '');

Closing comments

Finally, I recognize this project's goal is to define a CSV as strictly as possible. Providing @optionalColumn does not take away from that goal. The schema definitions provide for some extremely concise language, which is great, but that doesn't make using something as generic as @notEmpty ambiguous. It is perfectly defined as notEmpty. The same can be said for @optionalColumn.

@optionalColumn gives users the ability to define 2 different CSV that are functionally the same with the same schema. Why can I not represent these two files using a single schema?

"firstName","lastName","middleName"
"Donald","Trump",""

"firstName","lastName"
"Donald","Trump"

from csv-schema.

adamretter commented on June 13, 2024

@lightswitch05 The ambiguity comes from matching the columns in the CSV Schema to the columns in the CSV file. If you have one @OptionalColumn, that might work (but only in some cases), but if you have more than one then it likely becomes impossible.

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

I think it might help, if I try and explain how the implementation works, and therefore demonstrate why I think @OptionalColumn is unimplementable:

Consider the following Schema:

version 1.2
1:
2:
3: @optionalColumn
4: @optionalColumn
5:
6: @optionalColumn
7:

When the implementation has to match this against a CSV file, Column 5, 6, and 7 are now ambiguous. Actually anything after (3) is ambiguous, but I think this example pushes the point further!...

The processor has no way of knowing under what circumstances (5) should match what columns. (5) could match CSV column (3), (4), or (5) depending on the format of the CSV. You might even have a CSV, where you think CSV Schema Column (5) is matching CSV Column (4) but due to data errors (which the CSV validator would not highlight in this case) it is actually matching CSV Column (3). Therefore you are likely to introduce subtle data validation errors that were not possible before.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

Let me know if that makes sense to you. If not I can try and explain further...

from csv-schema.

lightswitch05 commented on June 13, 2024

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

This is the point of a schema, to validate the structure - which could be particularly useful for CSVs since the structure has so many unknowns.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

I agree that @optionalColumn would be impossible to support without named headers. However, there are already existing directives that have mutual exclusivity. In fact, @noHeader is itself mutually exclusive to the @ignoreColumnNameCase directive. I don't believe it is an unreasonable request for an @optionalColumn directive, making it mutually exclusive to the @noHeader directive.

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

I would not be requesting this feature if I had any control over the incoming CSVs. If csv-schema is only targeting usage by those with complete control over their files, then it useless to me. I am expecting hundreds of different CSV formats- many of which contain optional columns that may or may not be supplied in the CSV. If your solution is to have a schema defined for each possible combination then the total number of schemas required to validate the incoming files quickly becomes unmanageable.

from csv-schema.

adamretter commented on June 13, 2024

I agree that @optionalColumn would be impossible to support without named headers.

The point is the named headers in the CSV Schema are just labels, they do not infer a match against the header in the CSV file itself. That is important for the flexibility that the CSV Schema currently offers.

If you wanted to change it so that headers in the CSV Schema had to match headers in the CSV file, then you would need to do some more significant work first. After such a change, if it was acceptable to all parties, then you could consider introducing a @optionalColumn directive. However, with the current spec and how parsing is executed and validated according to the CSV Schema, an @optionalColumn is impossible without introducing ambiguity.

I would not be requesting this feature if I had any control over the incoming CSVs

You don't need control over the incoming CSVs. You only need control over generating the CSV Schemas, which I imagine must be within your remit ;-)

from csv-schema.

lightswitch05 commented on June 13, 2024

So the reference implementation is the blocker here and not the spec itself?

from csv-schema.

lightswitch05 commented on June 13, 2024

Dynamic CSV schema generation would be very problematic as the schemas themselves would be submitted by the users.

from csv-schema.

DavidUnderdown commented on June 13, 2024

@adamretter as of this commit b6dd4a7b0e23d3c9515864e2a2375b6bcd42c9f2 we do check header fields against the rule names in CSV Validator (though agreed it's not explicit in the schema definition at the moment). Not doing so was causing us problems downstream, and seems counter-intertuitive.

from csv-schema.

lightswitch05 commented on June 13, 2024

I just stumbled on digital-preservation/csv-validator/issues/134 - looks like our data is a little too wild for this schema definition and we'll have to find an alternative solution. Thank you for your time discussing the merits of optional columns

from csv-schema.

adamretter commented on June 13, 2024

@DavidUnderdown Yes I did see that in the reference implementation. However the spec does not say that MUST be the case, and for good reason (see above) ;-)

from csv-schema.

DavidUnderdown commented on June 13, 2024

Closed as per above discussion

from csv-schema.

Ability to make column headers optional about csv-schema HOT 15 CLOSED

Comments (15)

JSON schema

Example JSON Schema

Valid JSON

Invalid JSON

XML schema (XSD)

Example XSD

Valid XML

Invalid XML

SQL

Example SQL table

Valid Inserts

Invalid Inserts

Closing comments

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent