Coder Social home page Coder Social logo

Comments (6)

oschonrock avatar oschonrock commented on August 23, 2024

refined code, to reproduce

#include "csv.hpp"  // single header version

int main() {

  using namespace csv;

  const int cols_n = 5000;
  const int rows_n = 2;
  std::string filename = "WideDataset.txt";
  std::ofstream ofstream(filename);
  if (!ofstream.is_open()) {
    std::cerr << "failed to open " << filename << '\n';
    exit(1);
  }

  std::random_device rd;
  std::mt19937 gen{1}; // {rd()};
  std::uniform_real_distribution<double> dist{0, 1};

  ofstream << std::setprecision(16);
  for (int r = 0; r < rows_n; r++) {
    for (int c = 0; c < cols_n; c++) {
      double num = dist(gen);
      ofstream << num;
      if (c != cols_n -1) ofstream << ',';
    }
    ofstream << "\n";

  }
  ofstream.close();

  Matrix m;
  CSVReader reader(filename);

  {
    std::vector<double> r;
    for (CSVRow& row: reader) { // Input iterator
      for (CSVField& field: row) {
        std::cout << field.get<double>() << "\n";
        // std::cout << field << "\n";
      }
    }
  }
}

I did see some segfaults as well, but mostly throws of "not a number". Here is the valgrind output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Not a number.
==708== 
==708== Process terminating with default action of signal 6 (SIGABRT)
==708==    at 0x4C1FED7: raise (raise.c:51)
==708==    by 0x4C01534: abort (abort.c:79)
==708==    by 0x492B671: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x49372B5: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937300: std::terminate() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937534: __cxa_throw (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4082E7: double csv::CSVField::get<double>() (csv.hpp:3801)
==708==    by 0x406451: main (corr.cpp:122)

but if I use the std::cout << field << "\n"; (ie print the "CSVField" instance, without converting to double), then it runs without errors.
inspecting the output of the CSVField printing where it goes wrong:

...
<CSVField> 0.3118567593128918
<CSVField> 0.24509591655933560.8756073010782790.9602685581329761...

There are no "commas", and all the values are run together. At this point I began to doubt my "sample CSV generating code". But, I checked, and here is the equivalent place in the CSV file (I changed to a constant random seed to get consistency):

...0.3118567593128918,0.2450959165593356,0.875607301078279,0.9602685581329761,...

which looks all fine...

So. The error is being thrown (correctly!) at get<double>() stage, because something has gone wrong at parsing stage, the separators are "lost" and all the fields are being run together on extremely long "rows".

Hope that helps. Will investigate further, if I get time.

from csv-parser.

vincentlaucsb avatar vincentlaucsb commented on August 23, 2024

This is really interesting. I think I'm 99% sure why this is happening.

Basically, this CSV parser stores CSV data as a giant std::string with an accompanying std::vector<unsigned short> of indices where each field starts. If you have a CSV row like
a,b,c,de,fg

then it'll be stored (roughly) as

data = "abcdefg"
fields = {0, 1, 2, 3, 5}

When you use CSVFields, you're basically creating a string_view over the larger string using the indices stored in the vector of unsigned shorts. I used naively store data in std::vector<std::string>s but this was not very fast. If you want to get into the nitty gritty then look at:

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instances of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases. On the other hand, I didn't actually expect somebody to craft rows with more than 60,000 characters so I might have to find a compromise.

from csv-parser.

oschonrock avatar oschonrock commented on August 23, 2024

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instance of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases.

Yup, that makes a lot of sense. I agree 60,000 character rows are extreme, but at least we should throw if the row is too long rather than corrupt memory, return bad data and potentially segfault... undefined behaviour bascially?

I can clone a copy down and experiment.

Are you sure that using unsigned short is beneficial from a performance POV? I have tried this many times (ie changed int to short) and measurement usually shows that it is a slow down, because modern CPUs are optimised to deal with 32 bit ints, not 16bits. The only time it's a performance benefit is when you are storing literally millions of such ints / shorts so the total memory involved is significant, and we hit cache or memory bus bandwidth bottlenecks. Is that the case here? Surely we are dealing with one row at a time? So we are dealing with thousands of ints or shorts (one for each field) not millions? The general consensus in discussions on this topic seems to be: "int is the same or faster unless memory / cache usage / bandwidth is the issue".

from csv-parser.

oschonrock avatar oschonrock commented on August 23, 2024

progress report:

  1. I grep'd for and (quite blindly) replaced all instances of unsigned short with unsigned
  2. I had to back out these 3 because they broke the tests:
tests/test_csv_field.cpp:64:    unsigned char, unsigned short, unsigned int, unsigned long long,
tests/test_csv_field.cpp:80:    unsigned char, unsigned short, unsigned int, unsigned long long int,
tests/test_csv_field.cpp:97:    unsigned char, unsigned short, unsigned int, unsigned long long int) {
  1. Result? the behaviour "changed", ie it worked for 5000 columns and 2 rows, but as soon as I changed to 10000 cols and 1000rows (my target test case) it broke again.

So "close" .. but not quite...

What do you think about making such a change from 16bit unsigned shorts to 32bit unsigned ints permanently if we can prove it doesn't harm performance?

from csv-parser.

vincentlaucsb avatar vincentlaucsb commented on August 23, 2024

I would be fine with that. I just don't want to sacrifice peformance in general for extreme cases.

There is also another minor tweak that might help expand the size of rows which I can implement when I have time. As for the failing tests, I know why they're failing (overzealous grepping) and it's not a big deal to fix them.

from csv-parser.

vincentlaucsb avatar vincentlaucsb commented on August 23, 2024

In #80, I replaced unsigned short replaced with size_t which maps to unsigned int64 for MSVC. No noticeable performance impacts.

from csv-parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.