large file with 2000 x ~20char wide double columns, g

refined code, to reproduce <div class="highlight highlight-source-c++ notranslate

progress report: I grep'd for and (quite blindly) replaced all

In <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id=

Data corruption and crash with very wide columns? about csv-parser HOT 6 CLOSED

vincentlaucsb commented on August 23, 2024

Data corruption and crash with very wide columns?

from csv-parser.

Comments (6)

oschonrock commented on August 23, 2024

refined code, to reproduce

#include "csv.hpp"  // single header version

int main() {

  using namespace csv;

  const int cols_n = 5000;
  const int rows_n = 2;
  std::string filename = "WideDataset.txt";
  std::ofstream ofstream(filename);
  if (!ofstream.is_open()) {
    std::cerr << "failed to open " << filename << '\n';
    exit(1);
  }

  std::random_device rd;
  std::mt19937 gen{1}; // {rd()};
  std::uniform_real_distribution<double> dist{0, 1};

  ofstream << std::setprecision(16);
  for (int r = 0; r < rows_n; r++) {
    for (int c = 0; c < cols_n; c++) {
      double num = dist(gen);
      ofstream << num;
      if (c != cols_n -1) ofstream << ',';
    }
    ofstream << "\n";

  }
  ofstream.close();

  Matrix m;
  CSVReader reader(filename);

  {
    std::vector<double> r;
    for (CSVRow& row: reader) { // Input iterator
      for (CSVField& field: row) {
        std::cout << field.get<double>() << "\n";
        // std::cout << field << "\n";
      }
    }
  }
}

I did see some segfaults as well, but mostly throws of "not a number". Here is the valgrind output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Not a number.
==708== 
==708== Process terminating with default action of signal 6 (SIGABRT)
==708==    at 0x4C1FED7: raise (raise.c:51)
==708==    by 0x4C01534: abort (abort.c:79)
==708==    by 0x492B671: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x49372B5: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937300: std::terminate() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937534: __cxa_throw (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4082E7: double csv::CSVField::get<double>() (csv.hpp:3801)
==708==    by 0x406451: main (corr.cpp:122)

but if I use the std::cout << field << "\n"; (ie print the "CSVField" instance, without converting to double), then it runs without errors.
inspecting the output of the CSVField printing where it goes wrong:

...
<CSVField> 0.3118567593128918
<CSVField> 0.24509591655933560.8756073010782790.9602685581329761...

There are no "commas", and all the values are run together. At this point I began to doubt my "sample CSV generating code". But, I checked, and here is the equivalent place in the CSV file (I changed to a constant random seed to get consistency):

...0.3118567593128918,0.2450959165593356,0.875607301078279,0.9602685581329761,...

which looks all fine...

So. The error is being thrown (correctly!) at get<double>() stage, because something has gone wrong at parsing stage, the separators are "lost" and all the fields are being run together on extremely long "rows".

Hope that helps. Will investigate further, if I get time.

from csv-parser.

vincentlaucsb commented on August 23, 2024

This is really interesting. I think I'm 99% sure why this is happening.

Basically, this CSV parser stores CSV data as a giant std::string with an accompanying std::vector<unsigned short> of indices where each field starts. If you have a CSV row like
a,b,c,de,fg

then it'll be stored (roughly) as

data = "abcdefg"
fields = {0, 1, 2, 3, 5}

When you use CSVFields, you're basically creating a string_view over the larger string using the indices stored in the vector of unsigned shorts. I used naively store data in std::vector<std::string>s but this was not very fast. If you want to get into the nitty gritty then look at:

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instances of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases. On the other hand, I didn't actually expect somebody to craft rows with more than 60,000 characters so I might have to find a compromise.

from csv-parser.

oschonrock commented on August 23, 2024

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instance of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases.

Yup, that makes a lot of sense. I agree 60,000 character rows are extreme, but at least we should throw if the row is too long rather than corrupt memory, return bad data and potentially segfault... undefined behaviour bascially?

I can clone a copy down and experiment.

Are you sure that using unsigned short is beneficial from a performance POV? I have tried this many times (ie changed int to short) and measurement usually shows that it is a slow down, because modern CPUs are optimised to deal with 32 bit ints, not 16bits. The only time it's a performance benefit is when you are storing literally millions of such ints / shorts so the total memory involved is significant, and we hit cache or memory bus bandwidth bottlenecks. Is that the case here? Surely we are dealing with one row at a time? So we are dealing with thousands of ints or shorts (one for each field) not millions? The general consensus in discussions on this topic seems to be: "int is the same or faster unless memory / cache usage / bandwidth is the issue".

from csv-parser.

oschonrock commented on August 23, 2024

progress report:

I grep'd for and (quite blindly) replaced all instances of unsigned short with unsigned
I had to back out these 3 because they broke the tests:

tests/test_csv_field.cpp:64:    unsigned char, unsigned short, unsigned int, unsigned long long,
tests/test_csv_field.cpp:80:    unsigned char, unsigned short, unsigned int, unsigned long long int,
tests/test_csv_field.cpp:97:    unsigned char, unsigned short, unsigned int, unsigned long long int) {

Result? the behaviour "changed", ie it worked for 5000 columns and 2 rows, but as soon as I changed to 10000 cols and 1000rows (my target test case) it broke again.

So "close" .. but not quite...

What do you think about making such a change from 16bit unsigned shorts to 32bit unsigned ints permanently if we can prove it doesn't harm performance?

from csv-parser.

vincentlaucsb commented on August 23, 2024

I would be fine with that. I just don't want to sacrifice peformance in general for extreme cases.

There is also another minor tweak that might help expand the size of rows which I can implement when I have time. As for the failing tests, I know why they're failing (overzealous grepping) and it's not a big deal to fix them.

from csv-parser.

vincentlaucsb commented on August 23, 2024

In #80, I replaced unsigned short replaced with size_t which maps to unsigned int64 for MSVC. No noticeable performance impacts.

from csv-parser.

Data corruption and crash with very wide columns? about csv-parser HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent