Coder Social home page Coder Social logo

Comments (15)

DavidUnderdown avatar DavidUnderdown commented on June 13, 2024

I can see how this would be useful, it would need a bit of thinking as to how to fit it in with the global directives such as @totalColumns - we might also want a new @maxColumns there to cater for this scenario.

As you'll see I've tagged this for 1.2, but we don't have resource right now to work on that actively. If you want to have a go at working up an addition to the draft 1.2 spec we're happy to receive pull requests.

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

implemented in #17

from csv-schema.

adamretter avatar adamretter commented on June 13, 2024

I don't think @OptionalColums can be implemented as it creates an ambiguity in the parser

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

@adamretter The only way it is different than @optional is by not requiring the column header for an otherwise empty column

from csv-schema.

adamretter avatar adamretter commented on June 13, 2024

Yes I understand. However it is still ambiguous

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

I have to disagree. Ambiguity is introduced by poorly defined requirements. What is ambiguous about a column being optional? In a schema, if you mark a column as optional, then you accept the fact that your output may not have that column defined. If it is defined, all other validations apply. Since the goal of this project is define a CSV schema, let's take a look at some other schema projects.

JSON schema

JSON Schema defaults everything to be optional unless it is specifically marked as required.

Example JSON Schema

{
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "middleName": {
            "type": "string"
        }
    },
    "required": [
        "lastName",
        "firstName"
    ],
    "type": "object"
}

Valid JSON

{
  "firstName": "John",
  "lastName": "Doe" 
},
{
  "firstName": "Donald",
  "lastName": "Trump",
  "middleName": "John"
}

Invalid JSON

{
  "firstName": "John",
  "lastName": "Doe",
  "middleName": null
}

Please note how the JSON schema allows "middleName" to not be provided - but does NOT allow it to be null. This type of validation is not possible in CSV Schema without @optionalColumn support

XML schema (XSD)

XSD uses minOccurs=0 to define optional elements

Example XSD

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="person">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="firstName" type="xs:string"/>
                <xs:element name="lastName" type="xs:string"/>
                <xs:element name="middleName" minOccurs="0" type="NonEmptyString"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:simpleType name="NonEmptyString">
        <xs:restriction base="xs:string">
            <xs:minLength value="1"/>
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

Valid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName> <!-- middleName is not provided, but it is still valid -->
</person>
<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
    <middleName>John</middleName>
</person>

Invalid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
   <middleName /> <-- If the element is provided, but must not be empty -->
</person>

SQL

SQL is probably the most apt comparison. All columns are null by default. You are not required to insert every column on every insert:

Example SQL table

CREATE TABLE people (
    firstName varchar(255) NOT NULL,
    lastName varchar(255) NOT NULL,
    middleName varchar(255) CHECK (middleName <> '') -- allow null but not empty
);

Valid Inserts

INSERT INTO people (firstName, lastName) values ('Donald', 'Trump');
INSERT INTO people (firstName, lastName) values ('Donald', 'Trump', null);
INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', 'John');

Invalid Inserts

INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', '');

Closing comments

Finally, I recognize this project's goal is to define a CSV as strictly as possible. Providing @optionalColumn does not take away from that goal. The schema definitions provide for some extremely concise language, which is great, but that doesn't make using something as generic as @notEmpty ambiguous. It is perfectly defined as notEmpty. The same can be said for @optionalColumn.

@optionalColumn gives users the ability to define 2 different CSV that are functionally the same with the same schema. Why can I not represent these two files using a single schema?

"firstName","lastName","middleName"
"Donald","Trump",""
"firstName","lastName"
"Donald","Trump"

from csv-schema.

adamretter avatar adamretter commented on June 13, 2024

@lightswitch05 The ambiguity comes from matching the columns in the CSV Schema to the columns in the CSV file. If you have one @OptionalColumn, that might work (but only in some cases), but if you have more than one then it likely becomes impossible.

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

I think it might help, if I try and explain how the implementation works, and therefore demonstrate why I think @OptionalColumn is unimplementable:

Consider the following Schema:

version 1.2
1:
2:
3: @optionalColumn
4: @optionalColumn
5:
6: @optionalColumn
7:

When the implementation has to match this against a CSV file, Column 5, 6, and 7 are now ambiguous. Actually anything after (3) is ambiguous, but I think this example pushes the point further!...

The processor has no way of knowing under what circumstances (5) should match what columns. (5) could match CSV column (3), (4), or (5) depending on the format of the CSV. You might even have a CSV, where you think CSV Schema Column (5) is matching CSV Column (4) but due to data errors (which the CSV validator would not highlight in this case) it is actually matching CSV Column (3). Therefore you are likely to introduce subtle data validation errors that were not possible before.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

Let me know if that makes sense to you. If not I can try and explain further...

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

This is the point of a schema, to validate the structure - which could be particularly useful for CSVs since the structure has so many unknowns.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

I agree that @optionalColumn would be impossible to support without named headers. However, there are already existing directives that have mutual exclusivity. In fact, @noHeader is itself mutually exclusive to the @ignoreColumnNameCase directive. I don't believe it is an unreasonable request for an @optionalColumn directive, making it mutually exclusive to the @noHeader directive.

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

I would not be requesting this feature if I had any control over the incoming CSVs. If csv-schema is only targeting usage by those with complete control over their files, then it useless to me. I am expecting hundreds of different CSV formats- many of which contain optional columns that may or may not be supplied in the CSV. If your solution is to have a schema defined for each possible combination then the total number of schemas required to validate the incoming files quickly becomes unmanageable.

from csv-schema.

adamretter avatar adamretter commented on June 13, 2024

I agree that @optionalColumn would be impossible to support without named headers.

The point is the named headers in the CSV Schema are just labels, they do not infer a match against the header in the CSV file itself. That is important for the flexibility that the CSV Schema currently offers.

If you wanted to change it so that headers in the CSV Schema had to match headers in the CSV file, then you would need to do some more significant work first. After such a change, if it was acceptable to all parties, then you could consider introducing a @optionalColumn directive. However, with the current spec and how parsing is executed and validated according to the CSV Schema, an @optionalColumn is impossible without introducing ambiguity.

I would not be requesting this feature if I had any control over the incoming CSVs

You don't need control over the incoming CSVs. You only need control over generating the CSV Schemas, which I imagine must be within your remit ;-)

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

So the reference implementation is the blocker here and not the spec itself?

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

Dynamic CSV schema generation would be very problematic as the schemas themselves would be submitted by the users.

from csv-schema.

DavidUnderdown avatar DavidUnderdown commented on June 13, 2024

@adamretter as of this commit b6dd4a7b0e23d3c9515864e2a2375b6bcd42c9f2 we do check header fields against the rule names in CSV Validator (though agreed it's not explicit in the schema definition at the moment). Not doing so was causing us problems downstream, and seems counter-intertuitive.

from csv-schema.

lightswitch05 avatar lightswitch05 commented on June 13, 2024

I just stumbled on digital-preservation/csv-validator/issues/134 - looks like our data is a little too wild for this schema definition and we'll have to find an alternative solution. Thank you for your time discussing the merits of optional columns

from csv-schema.

adamretter avatar adamretter commented on June 13, 2024

@DavidUnderdown Yes I did see that in the reference implementation. However the spec does not say that MUST be the case, and for good reason (see above) ;-)

from csv-schema.

DavidUnderdown avatar DavidUnderdown commented on June 13, 2024

Closed as per above discussion

from csv-schema.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.