d3 / d3-dsv Goto Github PK

A parser and formatter for delimiter-separated values, such as CSV and TSV.

License: ISC License

JavaScript 100.00%

d3-dsv's Issues

Make d3.autoType the default row accessor.

This will require a major version bump, but I think it’d be good to make d3.autoType the default row accessor for dsv.parse and dsv.parseRows. If you want to disable automatic type inference, you can pass the identity function (d => d) as the row function.

Error: Must use import to load ES Module

Not sure how to use this from my project anymore.

Here's my tsconfig

{
  "compilerOptions": {
    "target": "es2015",
    "module": "commonjs",
    "outDir": "dist",
    "rootDir": "src",
    "strict": true,
    "allowSyntheticDefaultImports": true,
    "esModuleInterop": true,
    "resolveJsonModule": true,
    "forceConsistentCasingInFileNames": true
  }
}

package.json

{
  "name": "typescript-node",
  "version": "1.0.0",
  "main": "index.js",
  "license": "MIT",
  "scripts": {
    "dev": "tsnd --respawn ./src/index.ts",
    "build": "tsc",
    "start": "node dist/index.js",
    "codegen": "graphql-codegen --config codegen.yaml"
  },
  "dependencies": {
    "@graphql-typed-document-node/core": "^3.1.0",
    "@types/d3-dsv": "^3.0.0",
    "@types/node-fetch": "^2.5.7",
    "change-case": "^4.1.2",
    "d3-dsv": "^3.0.1",
    "dotenv": "^8.2.0",
    "fp-ts": "^2.9.3",
    "graphql": "^15.4.0",
    "graphql-request": "^3.3.0",
    "immer": "^8.0.0",
    "node-fetch": "^2.6.1",
    "query-string": "^6.13.7",
    "serialize-error": "^7.0.1",
    "zod": "^1.11.11"
  },
  "devDependencies": {
    "@graphql-codegen/cli": "^1.17.10",
    "@graphql-codegen/typed-document-node": "^1.17.9",
    "@graphql-codegen/typescript": "^1.17.10",
    "@graphql-codegen/typescript-operations": "^1.17.8",
    "@types/node": "^14.14.6",
    "@types/react": "^16.9.56",
    "ts-node-dev": "^1.0.0-pre.44",
    "typescript": "^4.0.5"
  }
}

If I try to import d3 from "d3-dsv" or import * as d3 from "d3-dsv" I get:

[ERROR] 16:59:52 Error: Must use import to load ES Module: node_modules/d3-dsv/src/index.js
require() of ES modules is not supported.
require() of node_modules/d3-dsv/src/index.js from src/index.ts is an ES module file as it is a .js file whose nearest parent package.json contains "type": "module" which defines all .js files in that package scope as ES modules.
Instead rename index.js to end in .cjs, change the requiring code to use import(), or remove "type": "module" from node_modules/d3-dsv/package.json.

It seems to be asking me to make changes within the d3-dsv module? I've read this d3/d3#3469 and can't make sense of what to do as a user of the module in a typescript-node setting.

date tests fail

yarn test

# csvFormat(array) converts dates to ISO 8601
ok 131 should be equivalent
not ok 132 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'date\n2018-01-01T08:00Z'
    actual: |-
      'date\n2017-12-31T23:00Z'
    at: Test.<anonymous> (/Users/fil/Sites/d3/d3-dsv/test/csv-test.js:263:8)
    stack: |-
      Error: should be equivalent

...

not ok 150 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      '2018-01-01T08:00Z'
    actual: |-
      '2017-12-31T23:00Z'
    at: Test.<anonymous> (/Users/fil/Sites/d3/d3-dsv/test/csv-test.js:325:8)
    stack: |-
      Error: should be equivalent
          at Test.assert [as _assert] (/Users/fil/Sites/d3/d3-dsv/node_modules/tape/lib/test.js:224:54)
          at Test.bound [as _assert] (/Users/fil/Sites/d3/d3-dsv/node_modules/tape/lib/test.js:76:32)

Rename d3.dsv.parse to d3.dsvParse, etc.

Now that d3.csv is defined in the d3-request module, it feels wrong to have d3.dsv at the same level. (Also, d3-time-format defines d3.timeFormat and d3.timeParse.) One possibility is d3.dsvParse(delimiter) and d3.dsvFormat(delimiter), which return functions; these functions also have a parse.rows and format.rows methods for the row-based equivalent.

Initialization of columns Property for xxxParse(...) methods

In the edge case where the to-be-parsed string passed into xxxParse(...) methods is empty, the column property of the returned parsed array is undefined.

Although this is an edge case, it seems preferable to return an empty array of column names for consistency.

(Thanks @azoson for pointing this behaviour out in DefinitelyTyped/DefinitelyTyped#21092 and DefinitelyTyped/DefinitelyTyped#21162 . cc @gustavderdrache)

Use Object.fromEntries to convert arrays into objects?

We should investigate whether we still need to use a dynamic function to create objects efficiently, or if there’s an alternative performant approach now.

csvFormat: CR without LF

I ran into an interesting edge case. I was working with a data set emanating from a Windows environment, hence some fields derived from long text areas had CRLF line breaks. For reasons having to do with the destination of this data, I was asked to make a copy of one such column and truncate it to 250 characters.

It turns out that since JS treats the CR and LF separately, in some records it truncated directly between the CR and LF. When I used d3.csvFormat and fs.writeFileSync to save the output to a CSV file, I found an issue. Any field terminating in a CR (without an LF) was not surrounded in quotes. As a result several other programs had difficulting opening this file. When I manually quoted the field, it opened fine in other programs.

So... would it be possible to get the CSV/DSV module to recognize a lone CR as grounds to surround the field in quotes? :)

Register with bower.

dsv.format could take an optional columns array?

Sometimes it is useful to explicitly specify a column order for the generated CSV file, other times we may want to exclude one or more columns, moreover, providing a column list in input will avoid the initial scan of the row list required to extract the column names.
I would mainly implement it for the first reason, since a csv file is often opened using a spreadsheet app like Excel, where the order of the columns should not be just random.

If you are interested, I can work on a PR, since I need the feature anyway.

Thanks for your time and for the lib!

Add shoutout to d3.autoType in d3.parse documentation

Right now we have

If a row conversion function is not specified, field values are strings. For safety, there is no automatic conversion to numbers, dates, or other types. In some cases, JavaScript may coerce strings to numbers for you automatically (for example, using the + operator), but better is to specify a row conversion function.

I posit we should both link to autoType here, and also add an autoType example at the top. I'll file a PR tomorrow.

Format dates as ISO 8601 rather than using date.toString.

We should use date.toJSON rather than date.toString when coercing date objects to strings when formatting DSV. ISO 8601 is a much more suitable representation than date.toString, and if we had automatic type inference #6, we can reliably roundtrip dates as well.

Generating a map rather than an object?

What if a column is named __proto__?

Iterator interface

Have you considered complementing parse and parseRows with iterator-based versions? This might be useful e.g. if one only needs to sumarise the data.

Export individual "formatRow" and "formatValue" functions

Hello!

I've found myself more and more using d3-dsv as a sort of a Swiss Army knife of CSV manipulation — if I want to ensure that an Array of data gets properly delimited, I'll reach for formatRows and trust it'll do the right thing. But when I'm working with Node.js streams and looping through data line-by-line (meaning I don't have access to the entire "body" of data at any given time), it'd be neat if I could tap directly into formatRow.

It's not end of the world to have to use formatRows instead, but it requires me to create a dummy Array to wrap the individual row of data with every pass, which feels a little dirty.

const createStreamWriter = (outputPath, columns) => {
  const writer = fs.createWriteStream(outputPath)
  // with dsv.csvFormatRow, I could just pass in my single Array of values
  writer.write(`${dsv.csvFormatRows([columns])}\n`)

  return data => {
    writer.write(`${dsv.csvFormatBody([data], columns)}\n`)
  }
}

In the case of formatBody, it kind of defeats the point to "singularize" that, but if formatValue was made available I could reproduce the effect by creating my own single row function built on it for zipping purposes.

Thank you! Looking forward to your thoughts.

Renames for better compatibility.

In D3 3.x, the request-and-parse methods were renamed:

d3.csv ↦ d3.requestCsv
d3.html ↦ d3.requestHtml
d3.json ↦ d3.requestJson
d3.text ↦ d3.requestText
d3.tsv ↦ d3.requestTsv
d3.xml ↦ d3.requestXml

This was needed in part because this repo, d3-dsv, defines two objects:

d3.csv
d3.tsv

These objects then expose methods:

d3.csv.parse
d3.csv.parseRows
d3.csv.format
d3.csv.formatRows
d3.tsv.parse
d3.tsv.parseRows
d3.tsv.format
d3.tsv.formatRows

The downside of this renaming is that the commonly used methods were renamed (and longer), while the less commonly used methods stayed the same.

However, another option would be to retain the short names for requests:

d3.csv
d3.html
d3.json
d3.text
d3.tsv
d3.xml

And rather than exposing d3.csv and d3.tsv objects, expose the methods directly:

d3.csvParse
d3.csvParseRows
d3.csvFormat
d3.csvFormatRows
d3.tsvParse
d3.tsvParseRows
d3.tsvFormat
d3.tsvFormatRows

Repeated columns names erase each other for xParse

There is a small ambiguity in the way that the tsvParse and csvParse address parsing files with columns that non-unique names. For instance if you have a tsv like

Example A	Example B	Example A
1	5	0
2	5	0
3	5	0
4	5	0

And you run that through tsvParse then you get

[
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  columns: [ 'Example A', 'Example B', 'Example A' ]
]

The problem of course being that the data from the first Example A column is blown away during the parse. I'm not sure what the right solution to this might be: maybe including some messaging in the docs that column names need to be unique? Or maybe appending an incrementing index to the duplicated columns ('Example A-1' or something). Having recently been bit by this, this is a real hair pulling issues to find/resolve, so any help that might be offered to other people in a similar situation would no doubt be welcomed.

Maybe parse numbers?

Let's say, put whatever is

/^[+-]?\d*(?:\d*\.\d*)?(?:e-?\d+)?$/i

(not tested :) through parseFloat()
Arguably, figuring out what the data is is not the parser job, but realistically almost every use of this deals with numbers.

Comments

Some CSV files begin with comments and, while these can be stripped in 90% of the cases, sometimes the comments can actually be useful, e.g. when holding some metadata.

Do you have any plans to extend the API to handle comments? Currently the parser simply treats them as headers:

> d3.dsvFormat(",").parse("#foo\n#bar\n1,2")
[
  {"#foo": "#bar"},
  {"#foo": "1"}
]

How to use Command Line Reference

I read the api document about "d3-dsv", but don't know how to use command line, is need other package?

enclose with quotes fields that contain single quote

This is an issue to discuss about a feature request.

I have a row like

mc0002,McDonald's Acireale Drive,Via Cristoforo Colombo e Via Cefalu'  ,,,37.621493,15.141527,19,087004,mc01

To be able to load it into a database, since some fields contain a single quote, it would be really comfortable if d3.csvFormatRows would enclose those fields in double quotes, like this

mc0002,"McDonald's Acireale Drive","Via Cristoforo Colombo e Via Cefalu'  ",,,37.621493,15.141527,19,087004,mc01

TSV Parses ""'s in Columns Incorrectly

Greetings!

TLDR: Sections that begin with a quoted item but includes other text afterwards, such as:
"Hello" world
Will parse as
"Hello"<tab> world
which is two entries rather than one.

My data includes plain text and is exported to the tsv file correctly (verified with visual white space viewed in Word). However, when it is imported via d3.tsv, it splits an entry such as the one above into two, shoving over all my other data.
I do not have time to make an isolated test right now, but here are some screenshots of the incorrect parse (and one photo including an adjacent correct parse).

One instance:

Another instance:

Includes correct parse ^

Add usage to README

Potential performance issue - Columns property added to returned array from dsv.parse could lose browser engine optimization on array operations

Per https://javascript.info/array, adding a non-numeric property to an array is one way of misuse JS array, which could lead to the engine see this array as a regular object and turn off the internal optimization on array operations.
If that's true, adding non-numeric property "Columns" to the returned array from dsv.parse breaks the rule.

The ways to misuse an array from https://javascript.info/array are pasted as below:

Add a non-numeric property like arr.test = 5.
Make holes, like: add arr[0] and then arr[1000] (and nothing between them).
Fill the array in the reverse order, like arr[1000], arr[999] and so on.

I couldn't find array misuse descriptions in other JS books or MDN except https://javascript.info, so I'm not quite sure if it's still applied in latest engines.

Documentation isn't up to date

Documentation seems to be not up to date. It still mentions d3-request library, which was deprecated in January. It seems that link to d3-fetch should be provided instead.

index.js trick has problems with browserify

Unfortunately browserify isn't clever enough to transform the trick involved with reading dsv.js from index.js, and thus it isn't currently possible to use dsv as a module from there.

Decimal mark customization

In languages where the decimal mark is ",", some spreadsheets expect CSVs to use ";" as the delimiter and "," as the decimal mark.

dsv2* has an --input-delimiter and *2dsv has --output-delimiter. Allow the user to specify a decimal mark different than the default "." (--input-decimal / --output-decimal).

Streaming parse?

Like the new shapefile.

Avoid "Function" constructor

https://github.com/d3/d3-dsv/blob/master/src/dsv.js#L8

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function

The Function constructor is not allowed in browser context in browser context when a safe CSP is used (without unsafe-eval). For example it prevent the usage of Plotly with a safe CSP because it uses this package: plotly/plotly.js#897

https://github.com/d3/d3-dsv#content-security-policy

If a content security policy is in place, note that dsv.parse requires unsafe-eval in the script-src directive, due to the (safe) use of dynamic code generation for fast parsing. (See source.) Alternatively, use dsv.parseRows.

Maybe a replacement for dsv.parse (ex. dsv.parseSafe) should be given?

Build issue in angular 5 has no exported member 'DSVRowString'.

ERROR in node_modules/@types/d3-fetch/index.d.ts(9,10): error TS2305: Module '"F:/AprotechSolutions/ems-beam-webapp/node_modules/@types/d3-dsv/index"' has no exported member
'DSVParsedArray'.
node_modules/@types/d3-fetch/index.d.ts(9,26): error TS2305: Module '"node_modules/@types/d3-dsv/index"' has no exported member 'DSVRowString'.
node_modules/@types/d3-fetch/index.d.ts(9,40): error TS2305: Module '"/node_modules/@types/d3-dsv/index"' has no exported member 'DSVRowAny'.
node_modules/@types/d3-fetch/index.d.ts(9,57): error TS2497: Module '"/node_modules/@types/d3-dsv/index"' resolves to a non-module entity
and cannot be imported using this construct.
node_modules/@types/d3/index.d.ts(24,15): error TS2498: Module '"/node_modules/@types/d3-dsv/index"' uses 'export =' and cannot be used with 'export *'.

TypeError: Cannot read property 'trim' of undefined

It might happen that CSV files will omit redundant commas, in which case d3.autoType breaks with
TypeError: Cannot read property 'trim' of undefined

This might be a desirable feature, in which case it would be good to trap the error and make it explicit.

However my personal preference would be to have it a bit more fault tolerant and avoid crashing on d3.autoType({ foo: undefined }).

Especially if d3.autoType is made the default for d3.csv (#43).

dsv.parse’s returned array should have a columns property.

For example:

var data = csv.parse("a,b,c\n1,2,3\n");
data[0]; // {"a": 1, "b": 2, "c": 3}
data.columns; // ["a", "b", "c"]

Unlike Object.keys(data[0]), the columns field will be guaranteed to be in the same order as the source file. Related d3/d3#2653 d3/d3#858.

wrong parsing d3.tsvParse with only number header

I have an issue with headers that contain only numbers. It's not being parsed correctly by d3.tsvParse. The header with the number has the values of the first column.

The problem is with the last column 1234

E.g.
data string:
Distance_from_anode[nm] TAPC mCBP 4CzIPN-Me T2T 1234
70 0 0 0 0 0.000304539
71 0 0 0 0 0.000234767
72 0 0 0 0 0.00016237

is returns following incorrect result
index 70
4CzIPN-Me: 1e-99
1234: 70
Distance_from_anode[nm]: 1e-99
T2T: 0.000304539
TAPC: 1e-99
mCBP: 1e-99

index 71
4CzIPN-Me: 1e-99
1234: 71
Distance_from_anode[nm]: 1e-99
T2T: 0.000234767
TAPC: 1e-99
mCBP: 1e-99

index 72
4CzIPN-Me: 1e-99
1234: 72
Distance_from_anode[nm]: 1e-99
T2T: 0.00016237
TAPC: 1e-99
mCBP: 1e-99

Expected
index 70
4CzIPN-Me: 1e-99
1234: 0.000304539
Distance_from_anode[nm]: 70
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99

index 71
4CzIPN-Me: 1e-99
1234: 0.000234767
Distance_from_anode[nm]: 71
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99

index 72
4CzIPN-Me: 1e-99
1234: 0.00016237
Distance_from_anode[nm]: 72
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99

I'm fixing if for now to check if the header has only number and adding an underscore in that's true.

Expose d3-xhr-based convenience methods.

Usage from node v8.5.0+ with --experimental-modules

I'm trying to use the module from node v9.3.0 as shown here, but I get

The requested module does not provide an export named 'csvFormat'

$ cat index.mjs
import {csvFormat} from "d3-dsv";
$ node --experimental-modules index.mjs 
(node:9652) ExperimentalWarning: The ESM module loader is experimental.
SyntaxError: The requested module does not provide an export named 'csvFormat'
    at ModuleJob._instantiate (internal/loader/ModuleJob.js:84:17)
    at <anonymous>

dsv.parse's returned columns property doesn't reflect row conversion function

d3.csv.parse('a,b,c\n1,2,3', function (row) { return {x: row.a, y: +row.b * +row.c} })

returns [ { x: '1', y: 6 }, columns: [ 'a', 'b', 'c' ] ].

Should it return [ { x: '1', y: 6 }, columns: [ 'x', 'y' ] ] ?

I guess the question is: Is the columns property meant to reflect the input or the output?

If the latter, and given there is no guarantee that the row conversion function produces objects with identical properties for every row, perhaps columns could reflect those of the first row?

Global d3 object is overwritten

As of release 1.0.6, d3-dsv has this at the top:

(function (global, factory) {
    typeof exports === 'object' && typeof module !== 'undefined' ? factory(exports) :
    typeof define === 'function' && define.amd ? define(['exports'], factory) :
    (factory((global.d3 = {})));
}(this, (function (exports) { 'use strict';`

The effect is that global.d3 is reset, removing anything that previously loaded modules put there. All the other modules have global.d3 = global.d3 || {}

The problem is present both in d3-dsv.zip for version 1.0.6 on github, and in https://d3js.org/d3-dsv.v1.js

Format dates on UTC midnight as date-only strings.

For brevity and consistency with how we parse dates in d3.autoType (and the ECMAScript specification), it would be nice to use the [±YY]YYYY-MM-DD format for dates on UTC midnight.

Remove byte-order markers from CSV files?

I’ve noticed that Excel now saves UTF-8 CSV files with a BOM. (I’m using Microsoft Excel for Mac version 15.33, saving in “CSV UTF-8” format.)

When such files are parsed with csvParse, the key corresponding to the first column has a zero-width non-breaking space as its first character, which leads to a situation where d["keyName"] is undefined even though keyName appears when you print out d!

I’m not sure whether you think this should be addressed in the parser – if not it should at least be documented I think.

Could we possibly add a ssv() or wssv()?

Besides fetching data by tsv() and csv(), could we support ssv() or wssv() for whitespace separated values?

For example, if the data is

x y
1 3
2.2 3.4
-1 -2

the data is quite well formatted and readable by human, and it also can be considered ok to be proper data and processed by a program. For files such as

x  y 
1     3
  2.2     -1.8
 3  4

it the data is not nicely formatted, but if a human can understand it, can it be also be allowed to be processed by the program as well.

The following can preprocess the second form:

const data = `x  y 
1     3
  2.2     -1.8
 3  4`;

console.log(data.split("\n").map(line => line.trim().split(/\s+/).join("\t")).join("\n"));

d3.autotype: keep leading 0

Is there a reason, besides performance, for d3.autotype skipping leading 0?
d3.autoType({id: "06075"})
=> 6075

d3.autotype is quite convenient.
Could a test on leading 0 be added to avoid string to number conversion here?

Add -n (ndjson) support to [ct]sv2json.

Safari incorrectly defaults to UTC for date-time strings.

This is clearly a browser bug (and we don’t generally workaround browser bugs), but it’d be pretty easy to patch this I suspect.

How to read csv in nodejs

I want to read CSV in node not use browser. How would I do that?

Renaming columns?

Related d3/d3#2429.

Option to remove the header when formatting

Sometimes it's useful to format without a header, for example when appending to an existing CSV file.

Fail parsing large files

There is an issue when parsing large file. I tested with a 1.4G JSON file and it throws :

buffer.js:490
    throw new Error('toString failed');
    ^

Error: toString failed
    at Buffer.toString (buffer.js:490:11)
    at StringDecoder.write (string_decoder.js:130:21)
    at StripBOMWrapper.write (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/bom-handling.js:35:28)
    at Object.decode (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/index.js:38:23)
    at /home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/bin/dsv2json:27:35
    at ReadStream.<anonymous> (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/rw/lib/rw/read-file.js:22:33)
    at emitNone (events.js:85:20)
    at ReadStream.emit (events.js:179:7)
    at endReadableNT (_stream_readable.js:913:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

I've found this link which illustrates the same issue with big files

You can test it with

wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
sed -i '1s/^/geonameid\tname\tasciiname\talternatenames\tlatitude\tlongitude\tfeature_class\tfeature_code\tcountry_code\tcc2\tadmin1_code\tadmin2_code\tadmin3_code\tadmin4_code\tpopulation\televation\tdem\ttimezone\tmodification_date\n/' allCountries.txt
time tsv2json  < allCountries.txt > allCountries-pre.json

Do you have a recommended way to parse big files using either command line or via API ?

Note that it's working well with csv-parser :

cat allCountries.txt | csv-parser -s $'\t' > allCountries-pre.json

[RFC] Split out the CLI to a separate package

Currently, this d3-dsv package depends on commander, rw and iconv-lite. But these dependencies are actually only used by the CLI tools being exposed, not when using the library directly (as done in several chart libraries).
I think it might make sense to split the CLI tools into a separate package (depending on d3-dsv), so that projects only needing the library don't need these additional dependencies.

Option to quote empty strings?

Right now there's no way to distinguish between empty strings and missing data in formatted output.

csvFormatRows([['value', 'null', 'undefined', 'string'], [0,null,undefined,'']]);

I'd like this to return:

value
0,,,""

But instead I get:

value
0,,,

README needs API reference.

Add documentation about CSP incompatibility

It's important to highlight methods not compatible with a strong Content-Security-Policy (i.e. without unsafe-eval).

In the same way it's documented in https://github.com/d3/d3-dsv#content-security-policy

It should be documented in https://github.com/d3/d3-dsv/blob/master/README.md#csvParse : #65

d3 / d3-dsv Goto Github PK

d3-dsv's Issues

Recommend Projects

Recommend Topics

Recommend Org