digital-preservation / csv-schema Goto Github PK

View Code? Open in Web Editor NEW

97.0 15.0 32.0 5.67 MB

CSV Schema

Home Page: http://digital-preservation.github.io/csv-schema

License: Mozilla Public License 2.0

CSS 0.29% HTML 99.71%

csv-schema schema-language

csv-schema's Introduction

CSV Schema

A Schema Language for CSV (Comma Separated Value) files.

This repository holds the code for creating the CSV Schema specification document, which is then published as HTML. The Schema language is formally expressed in EBNF.

You can find the the documentation and latest published specification here: http://digital-preservation.github.io/csv-schema.

Examples of CSV Schemas can be found in the example-schemas folder.

Repository Organisation

master branch holds the source code for producing the specification.
gh-pages holds the documentation and copies of each published version of the specification.
There is one tag from master each time a version of the specification is published. The tag name reflects the specification version number.

Released under the Mozilla Public Licence version 2.0.

Philosophy

A few bullet-points that guide our thinking in the design of the CSV Schema Language:

Simple CSV Schema Language. A DSL (Domain Specifc Language) was desired that could be expressed in plain text and should be simple enough that Metadata experts could easily write it without having to know a programming language or data/document modelling language such as XML or RDF. Note, the CSV Schema Language is NOT itself expressed in CSV, it is expressed in a simple text format.
Context is King! Schema rules are written for each column of the CSV file. Each set of column rules is then asserted against each row of the CSV file in turn. Each rule in the CSV Schema operates on the current context (e.g. defined Column and parsed Row), unless otherwise specified. Hopefully this makes the rules short and concise.
Streaming. Often the Metadata files that we receive are very large as they contain many records about a Collection which itself can be huge. The CSV Schema Language was designed with an eye to being able to write a Validation tool which could read the CSV file as a stream. Few steps require mnenomization of data from the CSV file, and where they do this is limited and should be easily optimisable to keep memory use to a minimum.
Sane Defaults. We try to do the right thing by default, CSV files and their bretheren (Tab Separated Values etc.) can come in many shapes and sizes, by default we parse CSV according to RFC 4180, of course we allow you to customize this behaviour in the CSV Schema.
CSV Schema is NOT a Programming Language. This is worth stressing as it was something we had to keep sight of ourselves during development; CSV Schema is a simple data definition and validation language for CSV!

csv-schema's People

Contributors

Stargazers

Watchers

csv-schema's Issues

Question about elaboration tolerance and ordering

Maybe I missed it, but I couldn't tell if the columns in a CSV file one is checking must come in the same order as they are listed in the body of a CSV schema.
Assuming that the prolog does not specify the column count, is it acceptable to have additional columns that do not match a column entry in the body, and have them just be unchecked?

I am interested in using the validator for some scientific data where there is a known set of columns that should be checked for reasonable contents, but where I'm not sure that the ordering of columns will be consistent, and where some data providers might have added additional columns of computed values to the raw values that my schema should check.

Thank you

Add URL Decode to StringProvider to facilitate comparisons between text fields and URL fields

We sometimes have both a filename field and an identifier field. The identifier field is a URL with eg the full filepath of the file, as the URL is potentially encoded to avoid illegal characters such as spaces, it is currently not possible to compare directly to the filename field to ensure that the filename is identical to that in the filepath. The URL decode function should take encodings such as %20 in the URL and convert them back to their unencoded equivalents (eg space) to facilitate this comparison, so eg a URL "file:///c:/some/directory/structure/file%20with%20spaces%20in%20its%20name.txt" would be decoded to "file:///c:/some/directory/structure/file with spaces in its name.txt" so that column validation expression like filename: in(urlDecode($identifier)) would produce a useful result

starts vs startsWith in docs

Hello,
just discovered your tool, planning to use in continuous validation process, thank you. Writing my first csvs ^-^
1.1 and 1.2 docs contain following

[41] StartsWithExpr ::= "starts(" StringProvider ")"

while samples use 'startsWith'. Same for 'ends'.
Is that correct?

Validation expression for floating point numbers?

Is there a way to describe that a column contains plain floating point numbers (except by using a regex pattern)? There is a positiveInteger validation expression so I would expect that there is also something like decimal or double but could not find it.

Thank you

Infinite number of (unnamed) columns

We have CSV files which, for each entry, contains a "header" constituted by a fixed number of columns
then followed by a "body", a variable number of columns.

Example below with a fixed number of columns 2 (letters) followed by a variable number of columns (containing numbers).

A,B,1,2,3,4,5,6,7,8
C,D,9,10,11,12

The concept is very much similar to the "varargs" notation in Java or the params keyword in C#.
I'm looking for a way to express this in the schema file.

The schema for this could be expressed as

version 1.1
@noHeader
fixed_column_1: notEmpty
fixed_column_2: notEmpty
variable_column: positiveInteger @infinite

Documentation typing error

At "5.2.1.3.1 Usage " there is probably a type errror.
"fifth_column: is($a_column) or is(concat($another_column,$third_column,"/",noExt(fourth_column),".pdf")"
$fourth_column should be preceded by the $

Example for uuid4 check is incorrect

While the grammar correctly reflects the need to use uuid4 to check a version 4 uuid is used in a column, the example given incorrectly states just uuid. This exists in at least v1.1 and v1.2 of the schema language documentation.

`XsdDateTimeExpr` mismatch between description and grammar

The specs describe the xsdDateTimeExpr as having an optional timezone:

[...] as shown, the xDateTime values may, or may not, have a component indicating a specific timezone, here Z (Zulu) for UTC (Greenwich Mean Time)

The grammar, on the other hand mandates its use, as XsdDateTimeExpr implies XsdTimeLiteral which in turn implies XsdTimezoneComponent.

XsdDateTimeExpr ::= "xDateTime" ("(" XsdDateTimeLiteral "," XsdDateTimeLiteral ")")?
XsdDateTimeLiteral ::= XsdDateWithoutTimezoneComponent "T" XsdTimeLiteral
XsdTimeLiteral ::= XsdTimeWithoutTimezoneComponent XsdTimezoneComponent
XsdTimezoneComponent ::= ((\+\|-)(0[1-9]\|1[0-9]\|2[0-4]):(0[0-9]\|[1-5][0-9])\|Z) | /* xgc:regular-expression */

The fact that XsdDateLiteral does utilizes an optional timezone may imply a mistake.

allow is to take multiple comma separated values as alternative to large or statements

One of the digital archivists tried to wrtie the syntax is("value1","value2") rather than is("value1") or is("value2"). Obviously this doesn't work currently, but as it was actually using an ExplicitContextExpr too(ie referring to the value of another column) it made for a much more compact expression, so this might be worth considering for later versions of the schema language.

Ability to use current date or year in appropriate tests

With date ranges or even simple numeric ranges if the field actually relates to a year it would sometimes be useful for the schema to be able to set the upper or lower bound of a range to the current date less a delta, or the end of the current year eg checking that we don't have any records that should be closed as the record subject's date of birth is less than 100 years ago.
eg running validation now we might check:
birth_date_year: range (1850,1916)
as then there would be no issues with opening any records, but if we get more records in the same series next year we'd have to manually update the schema to increase the year by one, if we could do eg:
birth_date_year: range(1850,currentYear-101)
the schema would always work in the desired way

Add prefixedRange to CSV Schema Language 1.2

Service numbers often have a non-numeric prefix, but then the numeric part may be within a well-defined numeric range. It's possible to do a bit of checking that the number is within an expected range using regex, but it's not easy to read, and difficult to be absolutely precise. It would be helpful if you could do something like prefixedRange("JX ",125000,145750).

There's some similarity between this and being able to supply a fixed path to prefix to a file expression (though the implementation would need to be different.

Amend URI Expression to specify whether relative URIs are allowed in 1.2

The current URI Expression allows both absolute and relative URIs, in general we want to ensure it's an absolute URI. To enable backwards compatibility this should be implemented as either a new absoluteUri expression, or by adding an optional flag to the existing uri expression in 1.2

Ability to make column headers optional

It appears the @optional directive allows the values for a column to be empty - but there doesn't appear to be a way to make the entire column header optional.

This feature would be very useful when there are many subsets of CSV's. With the ability to define an entire column as optional, you could then create a single superset schema that would validate each subset of available columns.

For example:
I might initially have a v1 CSV defined as:

version 1.1
"First Name":
"Last Name":

Later, its determined we would like to receive more information (v2), but to be backwards compatible (v1) CSV's are still accepted:

version 1.1
"First Name":
"Last Name":
"Middle Name": @optional

Now, I have a mixture of CSV's - and not a single schema that can validate them all. v1 CSV's will fail v2 validation since the "Middle Name" column is not defined regardless of it being optional. v2 CSV's will fail the v1 schema since it has an extra unknown column.

Proposed solution:

version 1.2
"First Name":
"Last Name":
"Middle Name": @optionalColumn

Making the entire column definition optional allows a single schema to validate both v1 and v2 of my CSVs

EBNF definition of Positive Integer Literal allows (infinite) zero padding

as the definition is PositiveIntegerLiteral ::= [0-9]+ /* xgc:regular-expression */
this allows integers to have an (infinite) number of leading zeroes. Typically when we say that it should be a PositiveIntegerLiteral we are anticipating that there would not be leading zeroes.
For 1.2 it would be good to revisit this.

repeat rules (within a single column)

We specify in the Scanning and Transcription Framework that some fields could potentially be a list of uuids (of "related" images, either where an image of a double page spread is split into two single pages, or where a very large master has to be imaged as a set of "tiles" of the original). At present this can only be done using regex, it would be simpler if there was a uuidList test in CSV Schema Language 1.2, which would test for a comma-separated list of uuids ie uuid1,uuid2,...uuidn

Conditional warning

The spec allows a validation rule to trigger a warning instead of an error, but it doesn’t allow for a given field to trigger an warning if a condition is met and an error in other cases.

To give a more concrete example of what I’m trying to achieve, a field I’m validating is an IMDb id for movies. They used to be strictly tt+ 7 digits, so anything else was invalid, but they recently introduced tt + 8 digits too. The latter is valid so it should not be an error, but it is fairly uncommon and may be suspicious (we had several cases of accidental extraneous characters in that field). I’d like to have an error for anything that is not tt + 7 or 8 digits, and warnings for tt + 8 digits.

Line terminator global directive

Similar to other global directive for separator etc. See digital-preservation/csv-validator#164

zeroPad string provider

For 1.2 - see digital-preservation/csv-validator#122

Allow indeterminate values in Range

Allow the use of * in Range statements, as for the Length statement to indicate that there is either no lower, or no upper bound to the range so range(0,) would allow all positive numbers, while range(,0) would allow all negative values. (see digital-preservation/csv-validator#69 for further context)

Support for nano seconds

Under section 3.2.7.1 of XMLSCHEMA-2 it reads
"'.' s+ (if present) represents the fractional seconds;"
Implying the RegEx for XsdTimeWithoutTimezoneComponent shouldn't be restrained to millaseconds, reading as
([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]+)?
instead of
([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.[0-9]{3})?

Section 3.2.7.2 also states the fraction may not end with 0. A potential candidate may be
([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(\.([0-9]+[1-9])?)?

Regex for timezone erroneous

Both the regex for XsdTimezoneComponent as well as for the optional variant are erroneous:

((\+\|-)(0[1-9]\|1[0-9]\|2[0-4]):(0[0-9]\|[1-5][0-9])\|Z)

This does not allow for timezones with two leading zeros, e.g. +00:30.

On a similar note, the minute-part of the regex could be simplified from 0[0-9]\|[1-5][0-9] to [0-5][0-9].

Support for Non-CSV Metadata / Front Matter / Comments in CSV Files

Sorry for the long message, I guess I've been thinking a lot about CSV's...

This issue is to suggest support for CSV files which contain non-CSV metadata or front matter at the top of the file, as well to raise the issue of comments within CSV files.

Although CSV files that begin with non-CSV metadata are beyond the type described in RFC 4180, they are quite common. Non-CSV data is typically used to include metadata about the data in the file, such as the equipment and parameters that went into an experiment.

I work with earth science data, where the idea of including multiple-line frontmatter in the file is quite common. I've attached a sample file from NASA as an example.

Supporting these kinds of files fully could entail a number of smaller changes, each of which might be considered independently. However, I've created one issue for the topic to try to unify discussion, at least at the initial stages.

Standards and Common Practices

There does not seem to be a widely-accepted standard for such files. I've ran across a few attempts at defining a standard, but they don't seem to have caught on widely:

https://csvy.org/ (looks more mature, though I don't think many libraries for CSV interaction support it)
https://github.com/csvspecs (looks to be work-in-progress)

As for common practices, I can speak to the spaces I'm familiar with, which are (mostly Python-based) tools for data processing used in the sciences and in data science.

The Pandas library supports specifying a comment character (i.e. '#') that denotes either whole lines or end-of-line comments:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#comments-and-empty-lines

Pandas is widely used, so this gives me the idea that at least some people use these types of comments.

The NASA Space Physics Data Facility (https://cdaweb.gsfc.nasa.gov/) uses the '#' comment character and formatting of the file I attached. The website allows you to download any of the measurements in their database in this format. But it also has several other export options, including a "normal" CSV with the metadata in a separate JSON file, as well as the raw data (in netCDF, which isn't a type of CSV at all). So perhaps they expect that people who are going to do lots of analysis will use the "normal" CSV files. This is to say that, while I think CSV Schema should support CSV files with metadata, I imagine some people would argue that real-world data collection should not be done using them.

Support within CSV Schema

As for the schema:

Ignoring Comments / Metadata

@adamretter suggested adding directives to ignore leading lines when validating CSV files (text is modified from his):

@IgnoreLeadingLines '#', which would simply ignore all lines from line 0 that start with a '#' character up until the first line that does not start with that character.
@IgnoreCommentLines '#', which would just ignore any line which starts with a '#' character.
other options, i.e. @IgnoreLeadingLinesMatching "regular expression"

I think it would be useful to be able to ignore the leading lines, and I like these directives. The difference between @IgnoreLeadingLines and @IgnoreCommentLines is helpful, since I could see situations that call for one but not the other.

Validating Comments / Metadata

I think there also should be a way to validate the contents of the non-CSV lines, as well as the CSV data itself. But I'm not sure if this is something the CSV Schema itself should support, or if this would be better handled by a more general system that supports files with multiple parts (and might make use of CSV schema to describe the CSV part). I'm not sure whether such a system exists.

On the other hand, there definitely are CSV files like this out there, so one argument is that the CSV Schema should be able to describe them.

If this is something the CSV Schema might support, it would be helpful to have multiple options:

Directives like those above to ignore commented lines, for files that are allowed to contain comments, but the comments can be anything.
A way to validate comments in some potentially-not CSV format, for files where the comments must meet certain requirements.

What seems ideal for the purpose of validating files with metadata is a way to say "this kind of header isn't CSV, but needs to be validated with X", where X is some external schema / tool. For instance, I might pass the metadata to a JSON validator or compare it with a YAML schema.

I think it would be ideal to be able to specify the type of non-CSV data in a flexible way that does not require the CSV Schema to maintain a list of supported metadata types. This would also be useful for people (such as myself) who have CSV files with metadata that is not in any standard format, but that they nonetheless may wish to use.

It would also be helpful to do what can be done to reduce the work for those implementing the language. Someone who is creating a CSV validator may have to explicitly include support for various metadata types, but hopefully this could be as simple as piping the data to existing JSON/YAML/whatever validators in their language, rather than expecting them to include their own support for each metadata type. I'm not versed enough in this area to give detailed recommendations, but it's a point to consider.

Other thoughts

Another issue to consider is end-of-line comments that occur in the data. I'm not sure how many people have files like this, but as I mentioned above, Pandas includes support for these comments. There's also the possibility of inline comments (between data elements), but that seems really far-fetched (I don't know why someone would try to create a CSV file like that).

Yet another issue is leading lines that are not marked with a comment character at all (the only way to tell is to look where the data starts). I happen to have some unfortunately-formatted files like this. Actually, if people were to adopt the CSVY standard (first link above), this would be a problem. The YAML header in CSVY could be any length, and it isn't marked by a comment character at the beginning of each line. (The end of the YAML block has the standard "---" that denotes the end of a document in YAML.)

Uploading OMNI_HRO_1MIN_27555.csv.txt…

CorrespondingRangeExpr - additional validation expression

For 1.2, see digital-preservation/csv-validator#121

Slash or backslash in case expression?

Hi,

I think there's a typo in the schema documentation:

The examples for the switch case expression (lines 1943-1954 in version 1.1, lines 1964-1975 in version 1.2) use a backslash to separate the column reference and the conditional expression, like switch(($a_column\is("true"), ....

When playing around with the csv-validator tool however, it yelled at me demanding that I use a slash character as separator. When I changed the expression to something like switch(($a_column/is("true"), ... it worked.

Thanks,
Martin

ISO8601 is more than just YYYY-MM-DD

YYYY and YYYY-MM should also be accepted as valid for xDate fields.

Broken link in http://digital-preservation.github.io/csv-schema/#toc4

Was researching for https://psv-format.github.io/ to try and figure out a possible schema for it and found that one of your webpage in http://digital-preservation.github.io/csv-schema/#toc4 has a broken link.

digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.cs

that is pointing to a .csv rather than .csvs (https://github.com/digital-preservation/csv-schema/blob/master/example-schemas/digitised_surrogate_tech_acq_metadata_v1_TESTBATCH000.csvs)

Error in Example 38 - should be positiveInteger, not integer

See digital-preservation/csv-validator#307

Incorrect RegEx for XsdTimezoneComponent

Reading section 3.2.7.3 of XMLSCHEMA-2, I belive that XsdTimezoneComponent should match '+00:00' & '-00:00', Reading as
((\+|-)(0[0-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)?
instead of
((\+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)?

`anyExpr` inconsistent definition in specification 1.1

The definitions for anyExpr appears to be wrong in version 1.1 of the specification. In contrast to the description above, the version in the appendix takes a single stringProvider instead of comma separated list.

A proper logical NOT operator

AFAIK Currently, the only way to express a negative rule is via the @matchIsFalse directive.

This means the entire column rule must be expressed either as a positive or a negative condition, and mixing positive and negative conditions (e.g. regex("[A-Z]+") and not(starts($another_column))) is not possible.

It would, therefore, be quite nice if a logical NOT operation was available that could invert the logic of arbitrary column validation expressions.

'@columns' global directive in sample valid?

I'm working on a parser for csv-schemas. I test this on the examples provided in example-schemas. One of the schemas uses @columns 27 as a global directive, which my parser fails on.

As far as I can tell, it is (in this case...) correct to do so, the line is not a valid GlobalDirective according to both the grammars, it should be @totalColumns 27. Am I overlooking something here?

Clarification: @optional means full, empty, or partially full?

Another trivial clarification, I'm afraid 🙁

I am using @optional currently in a csvs to mean a column may hold one of

any of the allowed values defined for that column (in all cells)
any of the allowed values defined for that column (in some of the cells with the rest empty)
no values (totally empty column, just a header and n blank cells)

Is that the intended meaning? It seems to be from the way the validator works?

It's not totally clear to me from the doc if a partially empty column is OK:

http://digital-preservation.github.io/csv-schema/csv-schema-1.1.html#column-definitions-examples

Is there a different directive for this?

Ability to escape characters

A use case has appeared where it would be useful to be able to escape characters so that they can be interpreted differently from default behaviour.

For example -

is("")

Causes an error because it contains ", propose being able to use backlash to escape.

is("<extref href="http://www.nationalarchives.gov.uk\">")

Incorrect `identicalExpr` grammar rule

Version 1.1 and 1.2 use the wrong literal for the identicalExpr, 'positiveInteger' instead of 'identical', as described earlier in the specification.

@optionalHeader directive

I'd like expressing in a schema that both CSV files with a header line and CSV files without can be validated. In order to disambiguate it could be also required, when this directive is used, that columns be defined only through identifiers and not offsets and the columns identifiers will be used to check if an header is present.

Regex ignored?

TEST SCHEMA:

version 1.1
@quoted
@totalColumns 2
@permitEmpty
@ignoreColumnNameCase
diacritics-allowed: length(*, 500) and regex("\S+( \S+)*")
ascii-only:  length(*, 500) // and regex("(!|"|%|&|'|\(|\)|\*|\+|,|-|\.|/|0|1|2|3|4|5|6|7|8|9|:|;|<|=|>|\?|A|B|C|D|E|F|G|H|I|J|K|L| |M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+")

TEST CSV:

diacritics-allowed,ascii-only
èûîôà,èûîôà

RESULT:
PASS

NOTE:

This is based on regex that's currently in use in XML schemas, where it appears to work OK. The intention is to filter out strings with diacritics (transliterated versions only)

How does integrityCheck know which folders to check?

I'm struggling a bit to understand how to implement integrityCheck, I was looking at the examples but the test cases were too well-formed to really explain it to me.

For instance, if you have a CSV like

filepath,foo
file:///C:/a/content/b.txt,bar
file:///C:/a/content/c.txt,baz
file:///C:/b/content/a.png,boo

Are both C:\a\content and C:\b\content checked? If "content" (or whichever subfolder was supplied) wasn't the last folder in the path, would that cause a schema validation error?

Also, if relative paths are used:

filepath,foo
b.txt,bar
c.txt,baz
a.png,boo

would filepath: integrityCheck("excludeFolder") (or includeFolder I guess?) check in %cd%\content? Is a prefix required?

digital-preservation / csv-schema Goto Github PK

csv-schema's Introduction

CSV Schema

Repository Organisation

Philosophy

csv-schema's People

Contributors

Stargazers

Watchers

Forkers

csv-schema's Issues

Standards and Common Practices

Support within CSV Schema

Ignoring Comments / Metadata

Validating Comments / Metadata

Other thoughts

Recommend Projects

Recommend Topics

Recommend Org