Coder Social home page Coder Social logo

ariafallah / csv-parser Goto Github PK

View Code? Open in Web Editor NEW
139.0 8.0 29.0 997 KB

Fast, header-only, extensively tested, C++11 CSV parser

License: MIT License

C++ 99.58% Makefile 0.05% JavaScript 0.04% Rust 0.12% CMake 0.22%
cpp csv parser csv-parser cpp11

csv-parser's Introduction

CSV Parser

Fast, header-only, C++11 CSV parser.

Usage

Configuration

You initialize the parser by passing it any input stream of characters. For example, you can read from a file

std::ifstream f("some_file.csv");
CsvParser parser(f);

or you can read from stdin

CsvParser parser(std::cin);

Moreover, you can configure the parser by chaining configuration methods like

CsvParser parser = CsvParser(std::cin)
  .delimiter(';')    // delimited by ; instead of ,
  .quote('\'')       // quoted fields use ' instead of "
  .terminator('\0'); // terminated by \0 instead of by \r\n, \n, or \r

Parsing

You can read from the CSV using a range based for loop. Each row of the CSV is represented as a std::vector<std::string>.

#include <iostream>
#include "../parser.hpp"

using namespace aria::csv;

int main() {
  std::ifstream f("some_file.csv");
  CsvParser parser(f);

  for (auto& row : parser) {
    for (auto& field : row) {
      std::cout << field << " | ";
    }
    std::cout << std::endl;
  }
}

Behind the scenes, when using the range based for, the parser only ever allocates as much memory as needed to represent a single row of your CSV. If that's too much, you can step down to a lower level, where you read from the CSV a field at a time, which only allocates the amount of memory needed for a single field.

#include <iostream>
#include "./parser.hpp"

using namespace aria::csv;

int main() {
  CsvParser parser(std::cin);

  for (;;) {
    auto field = parser.next_field();
    switch (field.type) {
      case FieldType::DATA:
        std::cout << *field.data << " | ";
        break;
      case FieldType::ROW_END:
        std::cout << std::endl;
        break;
      case FieldType::CSV_END:
        std::cout << std::endl;
        return 0;
    }
  }
}

It is possible to inspect the current cursor position using parser.position(). This will return the position of the last parsed token. This is useful when reporting things like progress through a file. You can use file.seekg(0, std::ios::end); to get a file size.

csv-parser's People

Contributors

ariafallah avatar benreinhold-nm avatar feliks-montez avatar fwuehr95 avatar sebkajeka avatar sydmontague avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv-parser's Issues

field_object.type is wrong

Sorry, my English is not good.

I have a csv file with two empty lines.

when I run the below code:

auto field = csv.next_field();
std::cout << (field.type == FieldType::DATA) << std::endl;

The result is 1 (true).
I think field.type should be FieldType::ROW_END instead of FieldType::DATA.

Parser can fail for large files (>128 KiB) due to stale reference

Description

When constructing a parser a std::istream referenced is used and stored. However, the lifecycle of the std::istream can be shorter than the parser, leaving it with a stale reference.

This won't affect small files (<= INPUTBUF_CAP), since those will be buffered on construction. However, with larger files it will continuously attempt to read from the stream. If the reference has become stale at that point it will result in an error.

This can for example happen when a Parser is constructed and stored as a class member with a locally scoped std::ifstream.

Sample Code

struct Test {
  std::unique_ptr<aria::csv::Parser> parser;
  
  Test(std::string path) {
    std::ifstream stream(path, std::ios::in);
    parser = std::make_unique<area::csv::Parser>(stream);
  }
};

Solutions/Workarounds

The problem can be avoided by making sure the std::istream doesn't go out of scope before the parser, making this error mostly a user error.

However, I think for a future revision it might be worthwhile to have the parser take possession of the passed stream, for example in form of a smart pointer.

Until then this issue might serve as help for the next poor fellow who falls into that trap. ;)

Empty fields at end of row are omitted

Using this example CSV:

F1 F2 F3 F4 F5 F6 F7
a b c d e f
A B C D E F G
F1,F2,F3,F4,F5,F6,F7
a,b,c,d,e,f,
A,B,C,D,E,F,G

Each row is supposed to contain 7 columns. However, the parser will only return 6 columns for the first data row. This seems to be caused by the fact that the last field is empty, i.e. the comma is directly followed by a CR LF.

This should not happen since according to RFC4180 The last field in the record must not be followed by a comma. Therefore the last, empty, field should be recognized as such instead of being skipped.

It only happens with unquoted CSV. a,b,c,d,e,f,"" would parse properly for example.


I was able to fix this issue locally by adjusting the finite state machine

case State::START_OF_FIELD:
	m_cursor++;
	if (c == m_terminator) {
		handle_crlf(c);
		return Field(FieldType::ROW_END);
	}

to

case State::START_OF_FIELD:
	m_cursor++;
	if (c == m_terminator) {
		handle_crlf(c);
		m_state = State::END_OF_ROW;
		return Field(m_fieldbuf);
	}

From what I understand this should only handle the special case of a CR LF directly after a separator char, i.e. the case that seems to be handled improperly.

Thank You! Fastest CSV parser

I have tested about 25 different CSV parsers and this one was the best performing of them all. Thank you for writing a parser that is efficient.

Parsing UTF8 BOM files

Hi

I am trying:

	std::ifstream f(pathInfoDB);
	aria::csv::CsvParser parser(f);

	for (auto& row : parser) {
		for (auto& field : row) {
			CString a(field.c_str());
			::OutputDebugString(a);
		}
	}

But my files are UTF8-BOM and as a result I am getting gibberish at the beginning. How do I handle?

'./catch.hpp' file not found

parser_test.cpp:4:10: fatal error: './catch.hpp' file not found
#include "./catch.hpp"
^~~~~~~~~~~~~
1 error generated.

Something is wrong with input stream

        explicit CsvParser(std::istream& input) : m_input(input) {
            // Reserve space upfront to improve performance
            m_fieldbuf.reserve(FIELDBUF_CAP);
            if (!m_input.good()) {
                throw std::runtime_error("Something is wrong with input stream");
            }
        }

What license is the code under?

Hello Aria,
you have produced nice code.
What is the license of your code?
Public Domain, GPL, BSD, LGPL, MIT, ...?

I couldn't get any information on it.

Thanks

Crash in assignment of `std::unique_ptr<char[]>`

Hi, thanks for the nice work. I'm using the csv-parser since I while and it works well. However with the latest update there is a crash with clang-11 on Ubuntu 20.04. I could trace it in debugger to the following line:

std::unique_ptr<char[]> m_inputbuf = std::unique_ptr<char[]>(new char[INPUTBUF_CAP]{});

The code crashes with SIGSEGV. I could not immediately see whats wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.