Coder Social home page Coder Social logo

vincentlaucsb / csv-parser Goto Github PK

View Code? Open in Web Editor NEW
825.0 27.0 137.0 10.32 MB

A modern C++ library for reading, writing, and analyzing CSV (and similar) files.

License: MIT License

C++ 96.55% Makefile 0.14% CMake 2.57% Python 0.74%
csv parser json tab-separated csv-parser statistics c-plus-plus c-plus-plus-17 csv-reader c-plus-plus-14

csv-parser's People

Contributors

7s9n avatar artpaul avatar bigerl avatar bryceschober avatar f3rm4rf3r avatar genshen avatar gexclaude avatar kimwalisch avatar lukaskerk avatar nhanders avatar nlohmann avatar onurtemizkan avatar peza8 avatar promgamer avatar rpadrela avatar rpavlik avatar ryanmarcus avatar stlsoft avatar tamaskenez avatar tobyealden avatar vincentlaucsb avatar xgdgsc avatar yosefl20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv-parser's Issues

Add ability to convert CSV Row objects to JSON

Requirements

  • Add a method to CSVRow that converts it to a valid JSON string.
    • Strings should contain minimal whitespace
    • Characters that have special meaning in JSON need to be escaped properly
    • Method should throw an error if non-Unicode characters are found
  • Add unit tests for above method

For example, if there's a row with column names "Artist" and "Album" with entries "Florida Georgia Line" and "Here's To The Good Times" respectively, then the output should be

{"Artist":"Florida Georgia Line","Album":"Here's To The Good Times"}

Suggestions

We may consider using another C++ library to handle JSON serializing instead of implementing it ourselves. That library can then be added as an optional dependency, e.g. users will need to have that library to use JSON serialization. For unit tests, that library can be included under the /tests/ directory.

Suggested library: https://github.com/nlohmann/json

References

Please consider to make "CSVField::get()" const.

I think "CSVField::get()" is almost const member function.
Please consider to make it const.

I want to write following code:

    for (const auto& row : rows) {
        for (const auto& field: row) {
            field.get<std::string_view>();
        }
    }

Crash when encountering comments / column numbers change

I often encounter CSV files that use comments (lines that start with a specific character, usually ; in my experience) to encode metadata about the file in the first few lines.

Reading these files causes a crash, I suspect because it looks like a single column CSV file at first and then the data starts with many columns.

example:

;Instrument ABCDEF
;Collected 14JUN2018 field site Alpha
time,temperature,humidity,pressure
12312351,23.3,120,234
12312352,24.0,122,233
...
...

I know comments are not mentioned in RFC4180, but even if this library does not handle them it should ignore/throw gracefully.

Add performance enhancements for column subsetting

Currently, the CSV parser stores every column. However, many use cases only require certain columns to be parsed, and there may be optimizations that can be performed.

For example, if a user only wants columns A, B, C out of A, B, … X, Y, Z then we can speed up the parsing process by skipping to the new newline once we parse C.

Notes

We want to implement this enhancement without sacrificing performance for more general use cases, and without complicating this library's public API. If additional classes are necessary, then CSVReader should be refactored as a wrapper around these helper classes.

which branch should I use?

Master branch or single-header branch?
I prefer the single-header branch, but is the branch stable, and what's the difference between master with single-header?

Build errors if #include "csv.hpp" placed after #include "windows.h" under MSYS2

Steps to reproduce:
change main cmake file to include

if(MSYS OR MINGW)
add_definitions(-DUNICODE -D_UNICODE)
endif()

change tests/cmakelist.txt to include:

if(MSYS OR MINGW)
target_compile_options(
csv_test PRIVATE -municode )
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -municode")
endif()

change single_include_test/my_header.hpp
Add #include <windows.h> before #include "csv.hpp"
run "cmake ../ -G "MSYS Makefiles"" from build directory (already created)
then make.

Without windows header before csv header all compiles success.
gcc --version
gcc.exe (Rev2, Built by MSYS2 project) 9.2.0

Getting Column Data Types with get_dtypes()

I tried your example code for some file statistics here csv::CSVStat Class Reference. Here is a minimal example:

csv::CSVStat stats(csvFilePath);
auto colDataTypes = stats.get_dtypes();
auto colNames = stats.get_col_names();

// Doesn't work for colDataTypes but works for other defined statistics like max, min and so `on...`
for (int64_t it = 0; it < colNames.size(); it++){
    std::cout << colDataTypes[it] << std::endl;
}

// Doesn't work either
for (auto &type : colDataTypes){
    std::cout << type << std::endl;
}

How can I get the colDataTypes of each column printed out? If I understood it correctly that is what the get_dtypes()function suppose to do.

Add basic support for arbitrarily large numbers (bignum)

Background

For example, the csv library currently stores large integers as long double if they exceed the limits of 64-bit integers. However, this can lead to loss of precision for very large integers. Furthermore, very large floating point values, or floating point values with many significant digits will not be stored without losing information by this method.

Task

Investigate different methods of storing large numbers and pick one to implement.

Goal

The goal of this task is to at least provide support for parsing arbitrarily large numbers without losing information (and without affecting performance). Implementing arithmetic between big numbers is entirely optional, and does not need to be fast. End users who desire performance should combine this library with a dedicated bignum library such as GMP (https://gmplib.org/).

Add option to strip leading and trailing whitespace from fields

Requirements

  • Give the user the option to strip leading and trailing whitespace characters from CSV fields
  • Should not have any performance impact if option is disabled
  • Should leverage additional threads for best performance

Suggestions

Currently CSVReader moves parsed rows into std::deque<CSVRow> record_buffer. When the user uses a CSVReader iterator or calls read_row, records are pulled from this deque.

Perhaps the implementer can implement a subclass of CSVReader (call it CSVProcessor?) that has two deques, one for rows that haven't been processed by the whitespace stripper and those that have. This new design should maximize code reuse so that we don't have to reimplement iterators for CSVProcessor.

Future Goals

If we decide to implement CSVProcessor, we should consider allowing users to add their own custom processing logic.

Misparsing csv

Hi,
Firstly i want to thank you for developing this library. Its usage is really elegant and easy.

Unfortunately, i faced with an issue which csv-parser parses the csv file wrongly. It parses fields wrongly. Am i missing something or is it a bug ? ( I generated .csv file from GNU Octave, i think its format is correct)

Scenario to reproduce the issue:
I have a csv file which contains 16384 columns and two rows. First row is an header, second row contains floating point values. .csv file is delimited by ','. (.csv file attaced as .zip)

#include <csv.hpp>

int main(int argc, char *argv[])
{
    using namespace csv;
    CSVReader reader("time-result.csv");

    for (CSVRow& row: reader) { // Input iterator
        auto i = 0;
        for (CSVField& field: row) {
            if ( i == 7003 ) // 7003th is one of the wrongly parsed fields, there are more
                std::cout << field.get<>() << std::endl;
            ++i;
        }
    }
}

Output :
e-069.99

But it should be a valid floating point numerical value string.

Library Version : 1.3.0
Compiler : gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

time-result.zip

More elegant way to handle error?

#include "csv.hpp"

int main() {
  csv::CSVFormat format;
  format.delimiter('|').column_names({"A", "B", "C"});

  csv::CSVReader csv("foo.dat", format);

  for (auto& row : csv) {
    std::cout << row.to_json() << std::endl;
  }

  return 0;
}

with foo.dat:

1|2|3|
2|3|4|

causes segmentation fault.

Changing to column_names({"A", "B", "C", "dummy"}); will do the job.

Is there any elegant way to catch such error than a segment fault? Also, Is there any option to handle such "dummy" column?

Add option to ignore quote character

It appears the quote character defaults to '"' which works in many cases but I have ran into situations
where the file has no quote character specified. In such cases when '"' is encountered the parser produces incorrect results. It can even lead to program crash.

Add install and libcsv targets into Makefile

This is an Makefile enhancement:

  • Add targets: install, libcsv.

  • Add variables: STD, PREFIX.

Target libcsv builds a static (libcsv.a) and shared (libcsv.so) libraries.

Target install install libraries into $(PREFIX)/lib directory, single header into $(PREFIX)/single_include directory and all other headers into $(PREFIX)/include directory.

Variable STD is the argument for -std g++ option, by default is c++11.

Variable PREFIX gives the base installation directory, by default is /usr.

Makefile.gz

Incorrect parsing of subunitary double values

The following fails to parse correctly and causes the unit test to fail.

std::string s("0.15");
long double out;    
REQUIRE(data_type(s, &out) == CSV_DOUBLE);
REQUIRE(is_equal(out, 0.15));

The parsed value is actually 1.5 instead of 0.15.

Data corruption and crash with very wide columns?

large file with 2000 x ~20char wide double columns, generated like this:

  const int cols_n = 2000;
  std::string big_filename = "MedDataset.txt";
  std::ofstream ofstream(big_filename);
  if (!ofstream.is_open()) {
    std::cerr << "failed to open " << big_filename << '\n';
    exit(1);
  }

  std::random_device rd;
  std::mt19937 gen{rd()};
  std::uniform_real_distribution<double> dist{0, 1};

  ofstream << std::setprecision(16);
  for (int r = 0; r < 1000; r++) {
    for (int c = 0; c < cols_n; c++) {
      double num = dist(gen);
      ofstream << num;
      if (c != cols_n -1) ofstream << ',';
    }
    ofstream << "\n";

  }
  ofstream.close();

parsing like this:

  CSVReader reader("MedDataset.txt");

  {
    std::vector<double> r;
    for (CSVRow& row: reader) { // Input iterator
      for (CSVField& field: row) {
        r.push_back(field.get<double>());
      }
      // use vector...
      r.clear();
    }
  }

getting this error during parsing..

terminate called after throwing an instance of 'std::runtime_error' 
  what():  Not a number.

If I reduce the cols_n = 2000 to 1800 it runs just fine.

I have visually inspected the file and not seeing any weird characters. All programmatically produced.

It feels like there some sort of "buffer overflow" due to the very large row --- roughly 32kb....?? 100% percent reproducible for me eventhough the values of the fields are random.

clang++ -O2  -std=c++17   ...    -lpthread
clang++ --version
clang version 8.0.0-3 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu

Add ability to convert CSV to Markdown

  • Add classes and/or functions that would allow library users to convert CSV data to Markdown tables.
  • Add unit tests for said functionality

References

https://help.github.com/en/articles/organizing-information-with-tables

Requirements

  • Users should be able to convert a CSV file to Markdown with just one function call
  • Users should be able to compose their own Markdown documents with arbitrary column names and rows
  • CSV fields containing pipes should be escaped (see reference above)

Suggestions

  • Implement a MarkdownWriter class with the following:
    • Overloaded operator<< that accepts CSVRow as an input
    • Alternatively, operator<< that accepts any random-access iterator (over strings?) as input
    • set_column_names(std::vector<std::string>) method

Field containing only whitespace, with trimming

Expected, when a field contains only whitespace and "trim" is set on format, that it will behave like an empty field, i.e. "1,,2".

However in this case parsing of the row seems incorrect. See attached test case for an example of failure.

TEST_CASE("Test trim empty field") {
  CSVFormat format;
  format.column_names({ "A", "B", "C" })
        .trim({' '});

  std::stringstream csv_string;
  csv_string << "1, two,3" << std::endl
      << "4, ,5" << std::endl
      << "6,7,8 " << std::endl;

  auto rows = parse(csv_string.str(), format);
  CSVRow row;
  rows.read_row(row);

  // First Row
  REQUIRE(row[0].get<uint32_t>() == 1);
  REQUIRE(row[1].get<std::string>() == "two");
  REQUIRE(row[2].get<uint32_t>() == 3);

  // Second Row
  rows.read_row(row);

  REQUIRE(row[0].get<uint32_t>() == 4);
  REQUIRE(row[1].is_null());
  REQUIRE(row[2].get<uint32_t>() == 5);

  // Third Row
  rows.read_row(row);
  REQUIRE(row[0].get<uint32_t>() == 6);
  REQUIRE(row[1].get<uint32_t>() == 7);
  REQUIRE(row[2].get<uint32_t>() == 8);
}

Cmake Build Error form v1.1.2 to v1.3.0

I'm coming from version 1.1.2 and want to use 1.3.0. When I run cmake to build my project I get this errors.

/CLionProjects/student_research/dev/hmmenc_client/main.cpp: In function ‘int main(int, char**)’:
/CLionProjects/student_research/dev/hmmenc_client/main.cpp:536:136: error: ‘type_name’ is not a member of ‘csv::internals’; did you mean ‘type_num’?
  536 |                                         cout << colNames[i] << "  has:  " << item.second << "  entries of type:  "  << csv::internals::type_name(item.first) << endl;
      |                                                                                                                                        ^~~~~~~~~
      |                                                                                                                                        type_num
/CLionProjects/student_research/dev/hmmenc_client/main.cpp:820:80: error: ‘class csv::CSVStat’ has no member named ‘correct_rows’
  820 |                                             auto idManualLim = (uint64_t)stats.correct_rows; // All Rows
      |                                                                                ^~~~~~~~~~~~
/CLionProjects/student_research/dev/hmmenc_client/main.cpp:821:79: error: ‘class csv::CSVStat’ has no member named ‘correct_rows’
  821 |                                             if (idManualLim > (uint64_t)stats.correct_rows)
      |                                                                               ^~~~~~~~~~~~
/CLionProjects/student_research/dev/hmmenc_client/main.cpp:823:79: error: ‘class csv::CSVStat’ has no member named ‘correct_rows’
  823 |                                                 idManualLim = (uint64_t)stats.correct_rows;
      |                                                                               ^~~~~~~~~~~~
/CLionProjects/student_research/dev/hmmenc_client/main.cpp:848:82: error: ‘type_name’ is not a member of ‘csv::internals’; did you mean ‘type_num’?
  848 |                                                 columnNameType = csv::internals::type_name(item.first);
      |                                                                                  ^~~~~~~~~
      |                                                                                  type_num

I guess that stats.correct_rows form csv::CSVStat stats() was changed to stats.num_rows, correct?

Next thing I use is:

// Get the type of the values in the column
auto columnNameIndex = (uint64_t)readerInfo.index_of(columnName);
string columnNameType;
for (auto item : colDataTypes[columnNameIndex])
{
    columnNameType = csv::internals::type_name(item.first);
    cout << columnName << " has " << item.second << " elements of type: " <<  columnNameType << endl;
}

Could you please tell me what should I use instead of type_name in csv::internals::type_name(item.first); to get the same effect? The v1.3.0 doesn't have that named member in csv::internals.

Reduce overhead of creating CSV Rows

Currently, CSVRow objects store their data in a contiguous string. However, a separate vector of index positions (size_t) is also maintained so we know where every individual field starts.

Creating this vector is responsible for the majority of calls to new and is a significant source of CPU overhead. Whoever is responsible for this task should either

  • Reduce the overhead of creating these vectors, e.g. by using a memory pool allocator (easier)
  • Find a more efficient parsing mechanism that avoids using this vector (harder)

full range of uint16_t not handled by get()

A uint16_t should handle values from 0 to 65535, however when a string containing the value 65535 is parsed using CSVRow.get<uint16_t>(), it reports a C++ overflow error exception.

Add support for unsigned integer conversions

Add the ability to use unsigned integer types with CSVField::get<>(). Currently, attempting to do so will fail a static_assert.

Requirements

  • If the value is negative, an error is thrown
  • Bounds checking should be performed to prevent overflow

Why is the first column ignored?

Windows 10, Visual Studio 2017
My data

EMPLOYEEKEY	FIRSTNAME	HIREDATE	LASTNAME	TITLE	
2	Kevin	2006-08-26	Brown	Marketing Assistant	
3	Roberto	2007-06-11	Tamburello	Engineering Manager	
4	Rob	2007-07-05	Walters	Senior Tool Designer	
5	Rob	2007-07-05	Walters	Senior Tool Designer	
6	Thierry	2007-07-11	D'Hers	Tool Designer	
7	David	2007-07-20	Bradley	Marketing Manager	
8	David	2007-07-20	Bradley	Marketing Manager	
9	JoLynn	2007-07-26	Dobney	Production Supervisor - WC60	
10	Ruth	2007-08-06	Ellerbrock	Production Technician - WC10

My Code

void f0()
{
	csv::Reader foo;

	foo.configure_dialect("my_dialect")
		.delimiter("\t")
		.quote_character('"')
		.double_quote(true)
		.skip_initial_space(false)
		.trim_characters(' ', '\t')
		//	.ignore_columns("foo", "bar")
		.header(true)
		.skip_empty_rows(true);

	foo.read("sample.csv");
	auto rows = foo.rows();
	for (auto& row : rows)
	{
		auto key = row["EMPLOYEEKEY"];
		auto fname = row["FIRSTNAME"];
		auto hdate = row["HIREDATE"];
		auto lname = row["LASTNAME"];
		auto title = row["TITLE"];
		std::cout << key << " " << fname << " " << hdate << " " << lname << " " << title << "\n";
	}

```}

**Outpu**t

Kevin 2006-08-26 Brown Marketing Assistant
Roberto 2007-06-11 Tamburello Engineering Manager
Rob 2007-07-05 Walters Senior Tool Designer
Rob 2007-07-05 Walters Senior Tool Designer
Thierry 2007-07-11 D'Hers Tool Designer
David 2007-07-20 Bradley Marketing Manager
David 2007-07-20 Bradley Marketing Manager
JoLynn 2007-07-26 Dobney Production Supervisor - WC60
Ruth 2007-08-06 Ellerbrock Production Technician - WC10

Incorrect values in empty fields starting a row

Hello. It looks like if a row starts with an emtpy field that field and all the subsequent empty fields get initialized by the first non-empty field in the row.

E. g. parsing

category,subcategory,project name
,,foo-project
bar-category,,bar-project

gives

row 0
0 foo-project
1 foo-project
2 foo-project
row 1
0 bar-category
1 
2 bar-project

Example code used is

	std::string csvString(R"(category,subcategory,project name
,,foo-project
bar-category,,bar-project
	)");
	auto format = csv::CSVFormat();
	csv::CSVReader reader(format);
	reader.feed(csvString);
	reader.end_feed();
	auto rowNum = 0;
	for (auto row: reader) {
		qDebug() << "row" << rowNum;
		auto colNum = 0;
		for (auto col: row)	{fi
			qDebug() << colNum << col.get<>().c_str();
			colNum += 1;
		}
		rowNum += 1;
	}

	return 0;

try catch not working in CSVReader object if strict parsing is there

The CSVFormat object is below

 CSVFormat csvFileFormat;
 csvFileFormat.column_names("name,age");
 csvFileFormat.strict_parsing();

The data in csv file is like

A,10
B,5000
C,30
D,100
E,100,5000

Note that the fifth row has three columns.

I am creating a csv reader object but it is not catching the std::runtime_error during the csv reader object creation

try {
        CSVReader reader(filePath,csvFileFormat);
}catch(std::runtime_error& e){
   std::cout << "Error" << std::endl;
}

My main motive is to get all the malformed records present inside the csv file so that I can write those malformed records in a separate file.

Unused-but-set variable

With -Werror, I get the following compiler error:

csv-parser/src/csv_reader.cpp: In member function ‘void csv::CSVGuesser::second_guess()’:
csv-parser/src/csv_reader.cpp:82:14: error: variable ‘current_delim’ set but not used [-Werror=unused-but-set-variable]
         char current_delim;
              ^

Bug: Delimiter guessing overwrites column names.

I observed that the col_names vector is overwritten when I pass in multiple delimiters.
This is because the CSVReader::CSVReader(csv::string_view filename, CSVFormat format) constructor overrides the format object that's passed in.

minor irritation -Wdefaulted-function-deleted warning...

clang++ --version
clang version 8.0.0-3 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu

gives these with latest master.

clang++ -O2  -Wall -Wshadow -std=c++17 -o build/corr corr.cpp -lpthread
In file included from corr.cpp:3:
/home/oliver/c/leet/include/csv.hpp:4151:9: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
        CSVReader(CSVReader&&) = default;     // Move constructor
        ^
/home/oliver/c/leet/include/csv.hpp:4302:20: note: move constructor of 'CSVReader' is implicitly deleted because field 'feed_lock' has a deleted move constructor
        std::mutex feed_lock;                /**< Allow only one worker to write */
                   ^
/usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/bits/std_mutex.h:97:5: note: 'mutex' has been explicitly marked deleted here
    mutex(const mutex&) = delete;
    ^
In file included from corr.cpp:3:
/home/oliver/c/leet/include/csv.hpp:4153:20: warning: explicitly defaulted move assignment operator is implicitly deleted [-Wdefaulted-function-deleted]
        CSVReader& operator=(CSVReader&& other) = default;
                   ^
/home/oliver/c/leet/include/csv.hpp:4302:20: note: move assignment operator of 'CSVReader' is implicitly deleted because field 'feed_lock' has a deleted move assignment operator
        std::mutex feed_lock;                /**< Allow only one worker to write */
                   ^
/usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/bits/std_mutex.h:98:12: note: 'operator=' has been explicitly marked deleted here
    mutex& operator=(const mutex&) = delete;
                                                                                                                    

-Wno-defaulted-function-deleted suppresses them obviously.

This csv file crashes the parser

This file crashes the parser. The file is parsed correctly in Excel and OpenOffice/LibreOffice
My code:

void f0()
{
	csv::CSVFormat format;
	format.delimiter(',').quote('"').header_row(0);
	csv::CSVReader reader("problem.txt",format);
	auto column_names = reader.get_col_names();
	std::cout << column_names.size() << std::endl;
	for (auto& cv : column_names)
	{
		std::cout << cv << "\t";
	}
	std::cout << "\n";

	for (csv::CSVRow& row : reader)
	{
		std::cout << row.size() << std::endl;
		for (auto& rv : row)
		{
			std::cout << rv << "\t";
		}
		std::cout << "\n";
	}
}
The data:

ACCOUNT_TYPE,ACCOUNT_NUMBER,TRANSACTION_DATE,CHEQUE_NUMBER,DESCRIPTION1,DESCRIPTION2,CAD,USD
Chequing,07451-1007186,1/2/1987,,"Bill Payment","Purchase Order",-4.00,,
Saving,07451-1007186,1/29/1987,,"Account Payable Pmt","Mac 6000 INCO",210424.25,,
Chequing,07451-1007186,2/1/1987,,"Misc Payment","Purchase Order",-200.00,,
Chequing,07451-1007186,2/5/1987,,"Membership fees","VAT-Y 4007633",-917.33,,
Chequing,07451-1007186,2/5/1987,,"Membership fees","TXINS 4007659",-950.69,,
Saving,07451-1007186,2/26/1987,,"Account Payable Pmt","Mac 6000 INCO",79034.35,,
Chequing,07451-1007186,2/28/1987,,"Membership fees","VAT-Y 7453902",-7905.02,,
Chequing,07451-1007186,2/28/1987,,"Membership fees","TXINS 7454013",-823.93,,
Chequing,07451-1007186,3/1/1987,,"Bill Payment","Purchase Order",-8.00,,
Saving,07451-1007186,3/4/1987,,"Online transfer sent - 1872","Great Outdoors",-17000.00,,

Segfault(SIGSEV) from CSVStat::get_mins

Hi, I found returning the number of columns in CSVStat::get_ mins of include/internal/csv_stat.cpp’(commit 6323ff8) crashes with the attached .csv file (test.csv). I think this may be related to the line #41-42 of include/internal/csv_stat.cpp. The crash was observed on Ubuntu 18.04.3 with kernel 4.15.0-72-generic and x86_64.

The crash can be reproduced by the following command:
$./csv_stat test.csv

Here’s the the crash stack trace taken with GDB:

#0 0x00005555555da731 in std::__1::allocator::construct<long double, long double const&> (this=,
__p=0x555555a2e520, __args=) at /home/cockatiel01/LLVM/bin/../include/c++/v1/memory:1811
#1 std::__1::allocator_traits<std::__1::allocator >::__construct<long double, long double const&> (__a=...,
__p=0x555555a2e520, __args=) at /home/cockatiel01/LLVM/bin/../include/c++/v1/memory:1716
#2 std::__1::allocator_traits<std::__1::allocator >::construct<long double, long double const&> (__a=...,
__p=0x555555a2e520, __args=) at /home/cockatiel01/LLVM/bin/../include/c++/v1/memory:1562
#3 std::__1::vector<long double, std::__1::allocator >::__push_back_slow_path<long double const&> (
this=0x7fffffffdbc0, __x=) at /home/cockatiel01/LLVM/bin/../include/c++/v1/vector:1613
#4 0x00005555555c5821 in std::__1::vector<long double, std::__1::allocator >::push_back (this=,
__x=) at /home/cockatiel01/LLVM/bin/../include/c++/v1/vector:1632
#5 csv::CSVStat::get_mins (this=0x7fffffffdc50)
at /home/jihyunee/ang-csv-parser/csv-parser-fast/include/internal/csv_stat.cpp:52
#6 0x0000555555570604 in main (argc=, argv=)
at /home/jihyunee/ang-csv-parser/csv-parser-fast/programs/csv_stats.cpp:15

This crash was found with Angora fuzzer, and test.csv is originated from ints_join.csv
in tests/data/fake_data directory.

Hope this help.
test.csv.zip

Implement new API for handling malformed rows

Background

Currently, CSVReader rejects all rows not the same size as the predetermined header row. This causes issues when parsing CSV files which are not quite up to spec.

Although it is possible to handle weird rows by creating a subclass of CSVReader and overriding CSVReader::bad_row_handler, that's kind of annoying.

Solution

CSVFormat will get a new method called allow_variable_lengths(false). CSVReader will then simply not perform row length checking until read_row() is called. This may even lead to performance improvements as the nested if/else branches in CSVReader::write_record will no longer be necessary.

For the default case (reject different length rows), CSVReader will behave as it has before, i.e. bad rows are tossed out and ignored with no user intervention.

Behavior for Variable Length Rows

If a user wants to keep rows of different length but still use CSVReader's format guessing ability, then when iterating over the read rows, then the library will provide a size() method (and potentially others such as is_weird_length(), is_shorter(), etc. so that the user can tell which rows are malformed.

Indexing Operator

If "foobar" is the name of the 16th column, and some malformed row has <16 columns, then row["foobar"] shall result in an error being thrown.

If a CSV mostly has 16 columns but some row has >16 columns, then the extra columns should only be retrieved using operator[](size_t) and not operator[](string). The CSVRow iterator should iterate through all entries of shorter and longer rows without crashing.

charconv isn't checked by cmake step

cmake complains about Doxygen being missing but goes on without it OK. During the make, I get this:

[ 45%] Building CXX object programs/CMakeFiles/csv_generator.dir/csv_generator.cpp.o
/home/chris/Apps/CsvParser/programs/csv_generator.cpp:2:10: fatal error: charconv: No such file or directory
#include <charconv>
^~~~~~~~~~
compilation terminated.
programs/CMakeFiles/csv_generator.dir/build.make:62: recipe for target 'programs/CMakeFiles/csv_generator.dir/csv_generator.cpp.o' failed

This is with clang++ 6.0.0-1ubuntu2 . That little ^ pointer is actually under "<" in the header reference.

BTW, what package includes that header?

Thanks
madGambol

single_include compilation error

First of all, thanks for doing this library.

When compiling files in single_include_test directory, the following compilation errors occurred:

$ g++ -pthread --std=c++14 -o file1 file1.cpp
In file included from my_header.hpp:2:0,
                 from file1.cpp:1:
csv.hpp:3975:28: error: enclosing class of constexpr non-static member function ‘bool csv::CSVRow::iterator::operator==(const csv::CSVRow::iterator&) const’ is not a literal type
             constexpr bool operator==(const iterator& other) const {
                            ^~~~~~~~
csv.hpp:3945:15: note: ‘csv::CSVRow::iterator’ is not literal because:
         class iterator {
               ^~~~~~~~
csv.hpp:3945:15: note:   ‘csv::CSVRow::iterator’ has a non-trivial destructor
csv.hpp:3979:28: error: enclosing class of constexpr non-static member function ‘bool csv::CSVRow::iterator::operator!=(const csv::CSVRow::iterator&) const’ is not a literal type
             constexpr bool operator!=(const iterator& other) const { return !operator==(other); }
                            ^~~~~~~~

To fix the errors above I've just modified the following lines:

(the lines commented are the original ones)

csv.hpp

        class iterator {
   ......
            /** Two iterators are equal if they point to the same field */
//|            constexpr bool operator==(const iterator& other) const {
            inline bool operator==(const iterator& other) const {
                return this->i == other.i;
            };

//|            constexpr bool operator!=(const iterator& other) const { return !operator==(other); }
            inline bool operator!=(const iterator& other) const { return !operator==(other); }

file1.hpp

//|int foobar(int argc, char** argv) {
int main(int argc, char** argv) {
    using namespace csv;

The file2.hpp is ok.

To-Do List

  • Fix memory issues detected by Valgrind
  • Add a reading test using a small CSV file (<50 lines)
  • Rewrite parser using string_views/so it doesn't allocate a bunch of vectors
  • Add conversion to Markdown
  • Create a single-header version

Add functions for generating SQL CREATE TABLE commands

Add functions that take a filename as input, parses the data types of every column, and generates a CREATE TABLE command.

Suggested databases to support:

  • SQLite
  • PostgreSQL
  • MySQL

Suggestions

CSVStats can be used to determine the proper data types.

How can I get the list of records that don't contain the exact number of columns as provided in the header

This means that let's say that total number of header columns are 5

  • First record has 5 comma separated values
  • Second record has 8 comma separated values
  • Third record has 2 comma separated values

From the above scenario we can say that the record second and third are invalid as the number of column is more than the number of header column in second record
and the number of column in third record is less than the number of header column

### I want to get these error records so that I can dump then in some new file.

Please Help

Cannot build master

Something is wrong with the master branch. I am unable to build (using cmake) in a clean environment.

docker run -it ubuntu bash
> apt-get update -y && apt-get upgrade -y
> apt-get install -y build-essential doxygen git cmake make python3-dev python3
> git clone https://github.com/vincentlaucsb/csv-parser.git
> cd cdv-parser
> mkdir build
> cd build
> cmake -DCSV_CXX_STANDARD=11 ../
> make

This is the initial error message that appears:

/csv-parser/programs/data_type_bench.cpp:1:10: fatal error: charconv: No such file or directory
 #include <charconv>
          ^~~~~~~~~~
compilation terminated.
programs/CMakeFiles/data_type_bench.dir/build.make:62: recipe for target 'programs/CMakeFiles/data_type_bench.dir/data_type_bench.cpp.o' failed
make[2]: *** [programs/CMakeFiles/data_type_bench.dir/data_type_bench.cpp.o] Error 1
CMakeFiles/Makefile2:184: recipe for target 'programs/CMakeFiles/data_type_bench.dir/all' failed
make[1]: *** [programs/CMakeFiles/data_type_bench.dir/all] Error 2
Makefile:94: recipe for target 'all' failed
make: *** [all] Error 2

I see the same error when trying to build for C++17.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.