Sorry for the long message, I guess I've been thinking a lot about CSV's...
This issue is to suggest support for CSV files which contain non-CSV metadata or front matter at the top of the file, as well to raise the issue of comments within CSV files.
Although CSV files that begin with non-CSV metadata are beyond the type described in RFC 4180, they are quite common. Non-CSV data is typically used to include metadata about the data in the file, such as the equipment and parameters that went into an experiment.
I work with earth science data, where the idea of including multiple-line frontmatter in the file is quite common. I've attached a sample file from NASA as an example.
Supporting these kinds of files fully could entail a number of smaller changes, each of which might be considered independently. However, I've created one issue for the topic to try to unify discussion, at least at the initial stages.
Standards and Common Practices
There does not seem to be a widely-accepted standard for such files. I've ran across a few attempts at defining a standard, but they don't seem to have caught on widely:
https://csvy.org/ (looks more mature, though I don't think many libraries for CSV interaction support it)
https://github.com/csvspecs (looks to be work-in-progress)
As for common practices, I can speak to the spaces I'm familiar with, which are (mostly Python-based) tools for data processing used in the sciences and in data science.
The Pandas library supports specifying a comment character (i.e. '#') that denotes either whole lines or end-of-line comments:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#comments-and-empty-lines
Pandas is widely used, so this gives me the idea that at least some people use these types of comments.
The NASA Space Physics Data Facility (https://cdaweb.gsfc.nasa.gov/) uses the '#' comment character and formatting of the file I attached. The website allows you to download any of the measurements in their database in this format. But it also has several other export options, including a "normal" CSV with the metadata in a separate JSON file, as well as the raw data (in netCDF, which isn't a type of CSV at all). So perhaps they expect that people who are going to do lots of analysis will use the "normal" CSV files. This is to say that, while I think CSV Schema should support CSV files with metadata, I imagine some people would argue that real-world data collection should not be done using them.
Support within CSV Schema
As for the schema:
Ignoring Comments / Metadata
@adamretter suggested adding directives to ignore leading lines when validating CSV files (text is modified from his):
@IgnoreLeadingLines '#'
, which would simply ignore all lines from line 0 that start with a '#' character up until the first line that does not start with that character.
@IgnoreCommentLines '#'
, which would just ignore any line which starts with a '#' character.
- other options, i.e.
@IgnoreLeadingLinesMatching "regular expression"
I think it would be useful to be able to ignore the leading lines, and I like these directives. The difference between @IgnoreLeadingLines
and @IgnoreCommentLines
is helpful, since I could see situations that call for one but not the other.
Validating Comments / Metadata
I think there also should be a way to validate the contents of the non-CSV lines, as well as the CSV data itself. But I'm not sure if this is something the CSV Schema itself should support, or if this would be better handled by a more general system that supports files with multiple parts (and might make use of CSV schema to describe the CSV part). I'm not sure whether such a system exists.
On the other hand, there definitely are CSV files like this out there, so one argument is that the CSV Schema should be able to describe them.
If this is something the CSV Schema might support, it would be helpful to have multiple options:
- Directives like those above to ignore commented lines, for files that are allowed to contain comments, but the comments can be anything.
- A way to validate comments in some potentially-not CSV format, for files where the comments must meet certain requirements.
What seems ideal for the purpose of validating files with metadata is a way to say "this kind of header isn't CSV, but needs to be validated with X", where X is some external schema / tool. For instance, I might pass the metadata to a JSON validator or compare it with a YAML schema.
I think it would be ideal to be able to specify the type of non-CSV data in a flexible way that does not require the CSV Schema to maintain a list of supported metadata types. This would also be useful for people (such as myself) who have CSV files with metadata that is not in any standard format, but that they nonetheless may wish to use.
It would also be helpful to do what can be done to reduce the work for those implementing the language. Someone who is creating a CSV validator may have to explicitly include support for various metadata types, but hopefully this could be as simple as piping the data to existing JSON/YAML/whatever validators in their language, rather than expecting them to include their own support for each metadata type. I'm not versed enough in this area to give detailed recommendations, but it's a point to consider.
Other thoughts
Another issue to consider is end-of-line comments that occur in the data. I'm not sure how many people have files like this, but as I mentioned above, Pandas includes support for these comments. There's also the possibility of inline comments (between data elements), but that seems really far-fetched (I don't know why someone would try to create a CSV file like that).
Yet another issue is leading lines that are not marked with a comment character at all (the only way to tell is to look where the data starts). I happen to have some unfortunately-formatted files like this. Actually, if people were to adopt the CSVY standard (first link above), this would be a problem. The YAML header in CSVY could be any length, and it isn't marked by a comment character at the beginning of each line. (The end of the YAML block has the standard "---" that denotes the end of a document in YAML.)
Uploading OMNI_HRO_1MIN_27555.csv.txt…