d3 / d3-dsv Goto Github PK
View Code? Open in Web Editor NEWA parser and formatter for delimiter-separated values, such as CSV and TSV.
Home Page: https://d3js.org/d3-dsv
License: ISC License
A parser and formatter for delimiter-separated values, such as CSV and TSV.
Home Page: https://d3js.org/d3-dsv
License: ISC License
This will require a major version bump, but I think it’d be good to make d3.autoType the default row accessor for dsv.parse and dsv.parseRows. If you want to disable automatic type inference, you can pass the identity function (d => d
) as the row function.
Not sure how to use this from my project anymore.
Here's my tsconfig
{
"compilerOptions": {
"target": "es2015",
"module": "commonjs",
"outDir": "dist",
"rootDir": "src",
"strict": true,
"allowSyntheticDefaultImports": true,
"esModuleInterop": true,
"resolveJsonModule": true,
"forceConsistentCasingInFileNames": true
}
}
package.json
{
"name": "typescript-node",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"scripts": {
"dev": "tsnd --respawn ./src/index.ts",
"build": "tsc",
"start": "node dist/index.js",
"codegen": "graphql-codegen --config codegen.yaml"
},
"dependencies": {
"@graphql-typed-document-node/core": "^3.1.0",
"@types/d3-dsv": "^3.0.0",
"@types/node-fetch": "^2.5.7",
"change-case": "^4.1.2",
"d3-dsv": "^3.0.1",
"dotenv": "^8.2.0",
"fp-ts": "^2.9.3",
"graphql": "^15.4.0",
"graphql-request": "^3.3.0",
"immer": "^8.0.0",
"node-fetch": "^2.6.1",
"query-string": "^6.13.7",
"serialize-error": "^7.0.1",
"zod": "^1.11.11"
},
"devDependencies": {
"@graphql-codegen/cli": "^1.17.10",
"@graphql-codegen/typed-document-node": "^1.17.9",
"@graphql-codegen/typescript": "^1.17.10",
"@graphql-codegen/typescript-operations": "^1.17.8",
"@types/node": "^14.14.6",
"@types/react": "^16.9.56",
"ts-node-dev": "^1.0.0-pre.44",
"typescript": "^4.0.5"
}
}
If I try to import d3 from "d3-dsv"
or import * as d3 from "d3-dsv"
I get:
[ERROR] 16:59:52 Error: Must use import to load ES Module: node_modules/d3-dsv/src/index.js
require() of ES modules is not supported.
require() of node_modules/d3-dsv/src/index.js from src/index.ts is an ES module file as it is a .js file whose nearest parent package.json contains "type": "module" which defines all .js files in that package scope as ES modules.
Instead rename index.js to end in .cjs, change the requiring code to use import(), or remove "type": "module" from node_modules/d3-dsv/package.json.
It seems to be asking me to make changes within the d3-dsv
module? I've read this d3/d3#3469 and can't make sense of what to do as a user of the module in a typescript-node setting.
yarn test
# csvFormat(array) converts dates to ISO 8601
ok 131 should be equivalent
not ok 132 should be equivalent
---
operator: deepEqual
expected: |-
'date\n2018-01-01T08:00Z'
actual: |-
'date\n2017-12-31T23:00Z'
at: Test.<anonymous> (/Users/fil/Sites/d3/d3-dsv/test/csv-test.js:263:8)
stack: |-
Error: should be equivalent
...
not ok 150 should be equivalent
---
operator: deepEqual
expected: |-
'2018-01-01T08:00Z'
actual: |-
'2017-12-31T23:00Z'
at: Test.<anonymous> (/Users/fil/Sites/d3/d3-dsv/test/csv-test.js:325:8)
stack: |-
Error: should be equivalent
at Test.assert [as _assert] (/Users/fil/Sites/d3/d3-dsv/node_modules/tape/lib/test.js:224:54)
at Test.bound [as _assert] (/Users/fil/Sites/d3/d3-dsv/node_modules/tape/lib/test.js:76:32)
Now that d3.csv is defined in the d3-request module, it feels wrong to have d3.dsv at the same level. (Also, d3-time-format defines d3.timeFormat and d3.timeParse.) One possibility is d3.dsvParse(delimiter) and d3.dsvFormat(delimiter), which return functions; these functions also have a parse.rows and format.rows methods for the row-based equivalent.
In the edge case where the to-be-parsed string passed into xxxParse(...)
methods is empty, the column
property of the returned parsed array is undefined
.
Although this is an edge case, it seems preferable to return an empty array of column names for consistency.
(Thanks @azoson for pointing this behaviour out in DefinitelyTyped/DefinitelyTyped#21092 and DefinitelyTyped/DefinitelyTyped#21162 . cc @gustavderdrache)
We should investigate whether we still need to use a dynamic function to create objects efficiently, or if there’s an alternative performant approach now.
I ran into an interesting edge case. I was working with a data set emanating from a Windows environment, hence some fields derived from long text areas had CRLF line breaks. For reasons having to do with the destination of this data, I was asked to make a copy of one such column and truncate it to 250 characters.
It turns out that since JS treats the CR and LF separately, in some records it truncated directly between the CR and LF. When I used d3.csvFormat and fs.writeFileSync to save the output to a CSV file, I found an issue. Any field terminating in a CR (without an LF) was not surrounded in quotes. As a result several other programs had difficulting opening this file. When I manually quoted the field, it opened fine in other programs.
So... would it be possible to get the CSV/DSV module to recognize a lone CR as grounds to surround the field in quotes? :)
Sometimes it is useful to explicitly specify a column order for the generated CSV file, other times we may want to exclude one or more columns, moreover, providing a column list in input will avoid the initial scan of the row list required to extract the column names.
I would mainly implement it for the first reason, since a csv file is often opened using a spreadsheet app like Excel, where the order of the columns should not be just random.
If you are interested, I can work on a PR, since I need the feature anyway.
Thanks for your time and for the lib!
Right now we have
If a row conversion function is not specified, field values are strings. For safety, there is no automatic conversion to numbers, dates, or other types. In some cases, JavaScript may coerce strings to numbers for you automatically (for example, using the + operator), but better is to specify a row conversion function.
I posit we should both link to autoType here, and also add an autoType example at the top. I'll file a PR tomorrow.
We should use date.toJSON rather than date.toString when coercing date objects to strings when formatting DSV. ISO 8601 is a much more suitable representation than date.toString, and if we had automatic type inference #6, we can reliably roundtrip dates as well.
What if a column is named __proto__
?
Have you considered complementing parse
and parseRows
with iterator-based versions? This might be useful e.g. if one only needs to sumarise the data.
Hello!
I've found myself more and more using d3-dsv
as a sort of a Swiss Army knife of CSV manipulation — if I want to ensure that an Array of data gets properly delimited, I'll reach for formatRows
and trust it'll do the right thing. But when I'm working with Node.js streams and looping through data line-by-line (meaning I don't have access to the entire "body" of data at any given time), it'd be neat if I could tap directly into formatRow
.
It's not end of the world to have to use formatRows
instead, but it requires me to create a dummy Array to wrap the individual row of data with every pass, which feels a little dirty.
const createStreamWriter = (outputPath, columns) => {
const writer = fs.createWriteStream(outputPath)
// with dsv.csvFormatRow, I could just pass in my single Array of values
writer.write(`${dsv.csvFormatRows([columns])}\n`)
return data => {
writer.write(`${dsv.csvFormatBody([data], columns)}\n`)
}
}
In the case of formatBody
, it kind of defeats the point to "singularize" that, but if formatValue
was made available I could reproduce the effect by creating my own single row function built on it for zipping purposes.
Thank you! Looking forward to your thoughts.
In D3 3.x, the request-and-parse methods were renamed:
This was needed in part because this repo, d3-dsv, defines two objects:
These objects then expose methods:
The downside of this renaming is that the commonly used methods were renamed (and longer), while the less commonly used methods stayed the same.
However, another option would be to retain the short names for requests:
And rather than exposing d3.csv and d3.tsv objects, expose the methods directly:
There is a small ambiguity in the way that the tsvParse and csvParse address parsing files with columns that non-unique names. For instance if you have a tsv like
Example A Example B Example A
1 5 0
2 5 0
3 5 0
4 5 0
And you run that through tsvParse then you get
[
{ 'Example A': '0', 'Example B': '5' },
{ 'Example A': '0', 'Example B': '5' },
{ 'Example A': '0', 'Example B': '5' },
{ 'Example A': '0', 'Example B': '5' },
columns: [ 'Example A', 'Example B', 'Example A' ]
]
The problem of course being that the data from the first Example A column is blown away during the parse. I'm not sure what the right solution to this might be: maybe including some messaging in the docs that column names need to be unique? Or maybe appending an incrementing index to the duplicated columns ('Example A-1' or something). Having recently been bit by this, this is a real hair pulling issues to find/resolve, so any help that might be offered to other people in a similar situation would no doubt be welcomed.
Let's say, put whatever is
/^[+-]?\d*(?:\d*\.\d*)?(?:e-?\d+)?$/i
(not tested :) through parseFloat()
Arguably, figuring out what the data is is not the parser job, but realistically almost every use of this deals with numbers.
Some CSV files begin with comments and, while these can be stripped in 90% of the cases, sometimes the comments can actually be useful, e.g. when holding some metadata.
Do you have any plans to extend the API to handle comments? Currently the parser simply treats them as headers:
> d3.dsvFormat(",").parse("#foo\n#bar\n1,2")
[
{"#foo": "#bar"},
{"#foo": "1"}
]
I read the api document about "d3-dsv", but don't know how to use command line, is need other package?
This is an issue to discuss about a feature request.
I have a row like
mc0002,McDonald's Acireale Drive,Via Cristoforo Colombo e Via Cefalu' ,,,37.621493,15.141527,19,087004,mc01
To be able to load it into a database, since some fields contain a single quote, it would be really comfortable if d3.csvFormatRows
would enclose those fields in double quotes, like this
mc0002,"McDonald's Acireale Drive","Via Cristoforo Colombo e Via Cefalu' ",,,37.621493,15.141527,19,087004,mc01
Greetings!
TLDR: Sections that begin with a quoted item but includes other text afterwards, such as:
"Hello" world
Will parse as
"Hello"<tab> world
which is two entries rather than one.
My data includes plain text and is exported to the tsv file correctly (verified with visual white space viewed in Word). However, when it is imported via d3.tsv, it splits an entry such as the one above into two, shoving over all my other data.
I do not have time to make an isolated test right now, but here are some screenshots of the incorrect parse (and one photo including an adjacent correct parse).
Per https://javascript.info/array, adding a non-numeric property to an array is one way of misuse JS array, which could lead to the engine see this array as a regular object and turn off the internal optimization on array operations.
If that's true, adding non-numeric property "Columns" to the returned array from dsv.parse breaks the rule.
The ways to misuse an array from https://javascript.info/array are pasted as below:
Add a non-numeric property like arr.test = 5.
Make holes, like: add arr[0] and then arr[1000] (and nothing between them).
Fill the array in the reverse order, like arr[1000], arr[999] and so on.
I couldn't find array misuse descriptions in other JS books or MDN except https://javascript.info, so I'm not quite sure if it's still applied in latest engines.
Documentation seems to be not up to date. It still mentions d3-request
library, which was deprecated in January. It seems that link to d3-fetch
should be provided instead.
Unfortunately browserify isn't clever enough to transform the trick involved with reading dsv.js from index.js, and thus it isn't currently possible to use dsv as a module from there.
In languages where the decimal mark is ",", some spreadsheets expect CSVs to use ";" as the delimiter and "," as the decimal mark.
dsv2* has an --input-delimiter and *2dsv has --output-delimiter. Allow the user to specify a decimal mark different than the default "." (--input-decimal / --output-decimal).
Like the new shapefile.
https://github.com/d3/d3-dsv/blob/master/src/dsv.js#L8
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function
The Function
constructor is not allowed in browser context in browser context when a safe CSP is used (without unsafe-eval
). For example it prevent the usage of Plotly with a safe CSP because it uses this package: plotly/plotly.js#897
https://github.com/d3/d3-dsv#content-security-policy
If a content security policy is in place, note that dsv.parse requires unsafe-eval in the script-src directive, due to the (safe) use of dynamic code generation for fast parsing. (See source.) Alternatively, use dsv.parseRows.
Maybe a replacement for dsv.parse
(ex. dsv.parseSafe
) should be given?
ERROR in node_modules/@types/d3-fetch/index.d.ts(9,10): error TS2305: Module '"F:/AprotechSolutions/ems-beam-webapp/node_modules/@types/d3-dsv/index"' has no exported member
'DSVParsedArray'.
node_modules/@types/d3-fetch/index.d.ts(9,26): error TS2305: Module '"node_modules/@types/d3-dsv/index"' has no exported member 'DSVRowString'.
node_modules/@types/d3-fetch/index.d.ts(9,40): error TS2305: Module '"/node_modules/@types/d3-dsv/index"' has no exported member 'DSVRowAny'.
node_modules/@types/d3-fetch/index.d.ts(9,57): error TS2497: Module '"/node_modules/@types/d3-dsv/index"' resolves to a non-module entity
and cannot be imported using this construct.
node_modules/@types/d3/index.d.ts(24,15): error TS2498: Module '"/node_modules/@types/d3-dsv/index"' uses 'export =' and cannot be used with 'export *'.
It might happen that CSV files will omit redundant commas, in which case d3.autoType breaks with
TypeError: Cannot read property 'trim' of undefined
This might be a desirable feature, in which case it would be good to trap the error and make it explicit.
However my personal preference would be to have it a bit more fault tolerant and avoid crashing on d3.autoType({ foo: undefined })
.
Especially if d3.autoType is made the default for d3.csv (#43).
For example:
var data = csv.parse("a,b,c\n1,2,3\n");
data[0]; // {"a": 1, "b": 2, "c": 3}
data.columns; // ["a", "b", "c"]
Unlike Object.keys(data[0])
, the columns
field will be guaranteed to be in the same order as the source file. Related d3/d3#2653 d3/d3#858.
I have an issue with headers that contain only numbers. It's not being parsed correctly by d3.tsvParse
. The header with the number has the values of the first column.
The problem is with the last column 1234
E.g.
data string:
Distance_from_anode[nm] TAPC mCBP 4CzIPN-Me T2T 1234
70 0 0 0 0 0.000304539
71 0 0 0 0 0.000234767
72 0 0 0 0 0.00016237
is returns following incorrect result
index 70
4CzIPN-Me: 1e-99
1234: 70
Distance_from_anode[nm]: 1e-99
T2T: 0.000304539
TAPC: 1e-99
mCBP: 1e-99
index 71
4CzIPN-Me: 1e-99
1234: 71
Distance_from_anode[nm]: 1e-99
T2T: 0.000234767
TAPC: 1e-99
mCBP: 1e-99
index 72
4CzIPN-Me: 1e-99
1234: 72
Distance_from_anode[nm]: 1e-99
T2T: 0.00016237
TAPC: 1e-99
mCBP: 1e-99
Expected
index 70
4CzIPN-Me: 1e-99
1234: 0.000304539
Distance_from_anode[nm]: 70
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99
index 71
4CzIPN-Me: 1e-99
1234: 0.000234767
Distance_from_anode[nm]: 71
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99
index 72
4CzIPN-Me: 1e-99
1234: 0.00016237
Distance_from_anode[nm]: 72
T2T: 1e-99
TAPC: 1e-99
mCBP: 1e-99
I'm fixing if for now to check if the header has only number and adding an underscore in that's true.
I'm trying to use the module from node v9.3.0 as shown here, but I get
The requested module does not provide an export named 'csvFormat'
$ cat index.mjs
import {csvFormat} from "d3-dsv";
$ node --experimental-modules index.mjs
(node:9652) ExperimentalWarning: The ESM module loader is experimental.
SyntaxError: The requested module does not provide an export named 'csvFormat'
at ModuleJob._instantiate (internal/loader/ModuleJob.js:84:17)
at <anonymous>
d3.csv.parse('a,b,c\n1,2,3', function (row) { return {x: row.a, y: +row.b * +row.c} })
returns [ { x: '1', y: 6 }, columns: [ 'a', 'b', 'c' ] ]
.
Should it return [ { x: '1', y: 6 }, columns: [ 'x', 'y' ] ]
?
I guess the question is: Is the columns
property meant to reflect the input or the output?
If the latter, and given there is no guarantee that the row conversion function produces objects with identical properties for every row, perhaps columns
could reflect those of the first row?
As of release 1.0.6, d3-dsv has this at the top:
(function (global, factory) {
typeof exports === 'object' && typeof module !== 'undefined' ? factory(exports) :
typeof define === 'function' && define.amd ? define(['exports'], factory) :
(factory((global.d3 = {})));
}(this, (function (exports) { 'use strict';`
The effect is that global.d3 is reset, removing anything that previously loaded modules put there. All the other modules have global.d3 = global.d3 || {}
The problem is present both in d3-dsv.zip for version 1.0.6 on github, and in https://d3js.org/d3-dsv.v1.js
For brevity and consistency with how we parse dates in d3.autoType (and the ECMAScript specification), it would be nice to use the [±YY]YYYY-MM-DD format for dates on UTC midnight.
I’ve noticed that Excel now saves UTF-8 CSV files with a BOM. (I’m using Microsoft Excel for Mac version 15.33, saving in “CSV UTF-8” format.)
When such files are parsed with csvParse
, the key corresponding to the first column has a zero-width non-breaking space as its first character, which leads to a situation where d["keyName"]
is undefined even though keyName
appears when you print out d
!
I’m not sure whether you think this should be addressed in the parser – if not it should at least be documented I think.
Besides fetching data by tsv()
and csv()
, could we support ssv()
or wssv()
for whitespace separated values?
For example, if the data is
x y
1 3
2.2 3.4
-1 -2
the data is quite well formatted and readable by human, and it also can be considered ok to be proper data and processed by a program. For files such as
x y
1 3
2.2 -1.8
3 4
it the data is not nicely formatted, but if a human can understand it, can it be also be allowed to be processed by the program as well.
The following can preprocess the second form:
const data = `x y
1 3
2.2 -1.8
3 4`;
console.log(data.split("\n").map(line => line.trim().split(/\s+/).join("\t")).join("\n"));
Is there a reason, besides performance, for d3.autotype skipping leading 0?
d3.autoType({id: "06075"})
=> 6075
d3.autotype is quite convenient.
Could a test on leading 0 be added to avoid string to number conversion here?
This is clearly a browser bug (and we don’t generally workaround browser bugs), but it’d be pretty easy to patch this I suspect.
I want to read CSV in node not use browser. How would I do that?
Related d3/d3#2429.
Sometimes it's useful to format without a header, for example when appending to an existing CSV file.
There is an issue when parsing large file. I tested with a 1.4G JSON file and it throws :
buffer.js:490
throw new Error('toString failed');
^
Error: toString failed
at Buffer.toString (buffer.js:490:11)
at StringDecoder.write (string_decoder.js:130:21)
at StripBOMWrapper.write (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/bom-handling.js:35:28)
at Object.decode (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/index.js:38:23)
at /home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/bin/dsv2json:27:35
at ReadStream.<anonymous> (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/rw/lib/rw/read-file.js:22:33)
at emitNone (events.js:85:20)
at ReadStream.emit (events.js:179:7)
at endReadableNT (_stream_readable.js:913:12)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickCallback (internal/process/next_tick.js:98:9)
I've found this link which illustrates the same issue with big files
You can test it with
wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
sed -i '1s/^/geonameid\tname\tasciiname\talternatenames\tlatitude\tlongitude\tfeature_class\tfeature_code\tcountry_code\tcc2\tadmin1_code\tadmin2_code\tadmin3_code\tadmin4_code\tpopulation\televation\tdem\ttimezone\tmodification_date\n/' allCountries.txt
time tsv2json < allCountries.txt > allCountries-pre.json
Do you have a recommended way to parse big files using either command line or via API ?
Note that it's working well with csv-parser :
cat allCountries.txt | csv-parser -s $'\t' > allCountries-pre.json
Currently, this d3-dsv
package depends on commander
, rw
and iconv-lite
. But these dependencies are actually only used by the CLI tools being exposed, not when using the library directly (as done in several chart libraries).
I think it might make sense to split the CLI tools into a separate package (depending on d3-dsv
), so that projects only needing the library don't need these additional dependencies.
Right now there's no way to distinguish between empty strings and missing data in formatted output.
csvFormatRows([['value', 'null', 'undefined', 'string'], [0,null,undefined,'']]);
I'd like this to return:
value
0,,,""
But instead I get:
value
0,,,
It's important to highlight methods not compatible with a strong Content-Security-Policy (i.e. without unsafe-eval
).
In the same way it's documented in https://github.com/d3/d3-dsv#content-security-policy
It should be documented in https://github.com/d3/d3-dsv/blob/master/README.md#csvParse : #65
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.