Coder Social home page Coder Social logo

Maybe parse numbers? about d3-dsv HOT 17 CLOSED

d3 avatar d3 commented on May 2, 2024
Maybe parse numbers?

from d3-dsv.

Comments (17)

makc avatar makc commented on May 2, 2024

These guys suggest

/^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$/

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

Related d3/d3#387. Generally speaking I think it’s a bad thing to convert to non-string types implicitly. Even though it’s what you want in 99% of cases, the cases where it does something unexpected to your data can be very harmful. Hence we’ve been favoring the use of type-conversion functions where you can explicitly coerce the data to the type you want (typically on a per-column basis).

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

I think it’s probably worth doing this automatically, if we can. As long as there’s a way to disable it.

from d3-dsv.

curran avatar curran commented on May 2, 2024

Related projects with similar intent:

  • datalib - dl.csv supports arguments that specify types that each column should be coerced to. Not sure if it does number parsing automatically, but it seems like it does from the documentation.
  • dsv-dataset Parses numbers and dates according to a specification of column types. No automatic type detection.

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

Thanks for the references. I was aware of those projects, but it was a useful reminder.

This issue was only intended to cover detecting numbers. It appears datalib also detects booleans and dates. I wonder whether that’s possible to do safely.

Detecting booleans as “true” or “false” is simple enough, but many datasets do not use these exact strings to represent booleans; “Y” and “N”, for example, is probably more common. Also, many datasets use the empty string to indicate missing data. You wouldn’t want to inadvertently coerce the empty string to false—undefined would be more appropriate—and it would likewise be weird mix the empty string in with boolean true and false.

Similarly, what would you do with a mix of “true”, “false” and other non-empty strings? The same issue applies to detecting numbers. Do you use strings, NaN or undefined for non-numeric values if the column contains a mix of numeric and non-numeric values? Undefined for the empty string and NaN for non-numeric values is arguably more type-safe than including strings, but it loses information.

Detecting dates generically is much more difficult. Using Date.parse is dangerous because its behavior varies widely across browsers: you might test your code on one browser that understands the given date format, but users on a different browser might see invalid dates! It’s safer to strictly define the set of supported date formats, but that implies d3-dsv depending on d3-time-format and d3-time, which is a fairly significant addition!

Maybe I’ve convinced myself to stick with the status quo and coerce types explicitly.

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

Also, JavaScript already provides type coercion if you don’t mind being sloppy: for example, putting a string value into an arithmetic expression automatically coerces that string to a number. And the nice thing about leaving things as strings is that it doesn’t lose information, like greedily coercing to a number does. Though of course there are times, such as when sorting, that leaving number-like values as strings will behave unexpectedly.

Another approach to this problem might be to improve how types are coerced. Do we think the current approach is too verbose, or too tedious to type? Or perhaps we think it’s also unsafe, because it silently coerces the empty string to zero and non-numeric strings to NaN?

d3_dsv.csv(text, function(d) {
  return {
    foo: +d.foo,
    bar: !!d.bar,
    baz: parseDate(d.baz)
  };
}, function(error, data) {
  if (error) throw error;
  console.log(data);
});

I could imagine an API for constructing the above type-coercion function more declaratively (but not that much more declarative, since the above is pretty clean):

d3_dsv.csv(text, d3_dsv.type()
    .field("foo", d3_dsv.typeNumber)
    .field("bar", d3_dsv.typeBoolean)
    .field("baz", parseDate), function(error, data) {
  if (error) throw error;
  console.log(data);
});

So, the field foo is coerced to a number, but maybe it throws an error if the number is invalid rather than silently coercing to NaN?

The hypothetical API doesn’t seem like a big win, though, since the current approach is shorter (or about the same) to type and more transparent.

I suppose you could have d3_dsv.typeAuto() if you wanted to opt-in to unsafe conversion, though. :)

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

Another variation:

d3_dsv.csv(text, d3_dsv.type({
      foo: d3_dsv.typeNumber,
      bar: d3_dsv.typeBoolean,
      baz: parseDate
    }), function(error, data) {
  if (error) throw error;
  console.log(data);
});

It’s also interesting to consider whether this would be useful for renaming columns (#10) and restructuring. But, that’s also something the current approach does relatively well, perhaps even better if you use ES6 destructuring.

d3_dsv.csv(text, function(d) {
  return {
    foo: +d.Foo,
    bar: {
      confirmed: !!d.barConfirmed,
      date: parseDate(d.barDate)
    }
  };
}, function(error, data) {
  if (error) throw error;
  console.log(data);
});

from d3-dsv.

makc avatar makc commented on May 2, 2024

Ha ha,

mbostock self-assigned this 15 hours ago
mbostock removed their assignment 2 hours ago

I'm late to the party and yet did not miss anything.

from d3-dsv.

curran avatar curran commented on May 2, 2024

All very interesting ideas. It seems like automatic parsing might be best suited to leave out of D3, as other tools like Datalib will come along and evolve. Lots of open questions, like how to detect date format automatically. Also there's the classic case of id fields that are strings, like "00320", that should not be parsed as numbers. I've heard so many stories of Excel automatically parsing these kinds of identifiers (e.g. FIPS codes) and causing problems.

from d3-dsv.

makc avatar makc commented on May 2, 2024

@curran I trust microsoft that excel team has performed extensive use case studies and determined that the number of cases where this caused problems was far less than the number of cases where it was helpful.

from d3-dsv.

vogievetsky avatar vogievetsky commented on May 2, 2024

And yet, in a data file with a large number of columns, Excel will always find some column to mess up :-p

from d3-dsv.

jstcki avatar jstcki commented on May 2, 2024

@makc Whoops!

from d3-dsv.

makc avatar makc commented on May 2, 2024

@herrstucki way to blame computer program for human error. this is how robot revolution will start.

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

I realize this issue is four years old and I haven’t found a solution I’m happy with yet, but I’d like to make some progress here. At the very least, there should be some explicit option to coerce values to numbers if the value would not be NaN, even if it’s not the default behavior. For example:

function autoType(d) {
  for (const key in d) {
    if (!isNaN(d[key])) {
      d[key] = +d[key];
    }
  }
  return d;
}

This “autoType” function could then be passed as the row accessor function to dsv.parse.

If we change the default string conversion for dates to use ISO 8601 format, we could likewise add parsing for dates in ISO 8601 format to autoType, and thus have a clean way to roundtrip dates as well. (While avoiding the problem of trying to parse arbitrary date formats, which is a minefield, and should be avoided anyway in most cases by encouraging people to use the standard ISO 8601 representation.)

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

We should also guarantee that NaN is roundtripped as the number NaN, rather than coming back as the string "NaN". It might also be sensible to parse Python’s "nan" and R’s "NA" as NaN, too.

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

We could roundtrip "true" and "false" (exact strings, but case-insensitive?) to booleans, too.

from d3-dsv.

mbostock avatar mbostock commented on May 2, 2024

I’ve fleshed out a solution that I’m pretty happy with in #42.

https://github.com/d3/d3-dsv/blob/auto-type/README.md#autoType

from d3-dsv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.