Coder Social home page Coder Social logo

csv-split's Introduction

CSV Split

Build Tests

A fast command line program to split CSV files.

Usage

$ csv-split -h
Split CSV files by lines

VERSION: v0.0.2

USAGE:
        csv-split [OPTIONS] <FILE_TO_SPLIT>

OPTIONS:
        -n, --new-file-name <NEW_FILE>                          Name of the new files. This will be appended with an incremented number (default: "split")
        -e, --exclude-headers                                   Exclude headers in new files (default: false)
        -l, --line-count <COUNT>                                Number of lines per file (default: 1)
        -d, --delimiter <DELIMITER>                             Character used for column separation (default: ',')
        -r, --remove-columns <COL1>[<DELIM><COL2><DELIM>...]    Specify column names to be removed during processing.
        -i, --include-remainders                                Include remainder rows in the split files (default: false).
        -h, --help                                              Display this message

Simply run the executable with the desired inputs and flags. Then it will create a set of files of the form <NEW_FILE><INDEX>.csv that contain the rows of the original file in the range [INDEX+2,INDEX+2+COUNT) where line 1 is the input header.

The program processes the input line by line using getline, which is provided by GCC. Therefore

  1. the memory usage is only affected by the line length and desired split length and
  2. the time complexity is approximately O(L*I*W) where L is the number of lines in the input, I is the number of columns in the input and W is the number of characters in the longest word in the input. This is obviously an upper bound, so the runtime is going to be more accurately represented if W is chosen to be the average word length in the input.

Installation

If you are using Arch Linux, you can use the AUR package.

Otherwise, it is recommended to use the latest release. You can simply download the binary and run it or use the archive and extract it and then follow the other instructions.

If you are absolutely sure you want the newest (yet unstable) version, you can also clone the repository and change into directory:

$ git clone https://github.com/miltfra/csv-split
$ cd csv-split

To create the binary in bin/:

$ make

To install the binary to /usr/local/bin/ (requires root priviliges):

# make install

To install the binary to local directory (e.g. $HOME/.local/bin):

$ make DESTDIR=$HOME/.local/bin install-local

Installations can be undone with the correspodning uninstall and uninstall-local commands.

To force a complete rebuild, use

$ make clean
$ make build

Testing

You can verify that everything compiled successfully by running

$ make check

after make.

If you want to verify that the software behaves as intended, run

$ make test

and check the output. If you find a bug, please file a bug report on GitHub.

About

This program is heavily inspired by imartingraham/csv-split. It makes use of the already existing interface and aims to improve on speed by switching from Ruby to C.

This software has not yet been fully tested and should therefore be used with care.

csv-split's People

Contributors

mifrandir avatar

Stargazers

 avatar Nicolas Pulido M. avatar Ravi Undupitiya avatar  avatar Ján Bočínec avatar  avatar Chris Alcantara avatar

Watchers

James Cloos avatar  avatar

csv-split's Issues

Fix `remove_single_column` test on GitHub CI.

For some reason, the test is failing on GitHub but passing on my machine. The only difference I can see is that GitHub uses GCC 9 and I have GCC 10. Still, this should not be a problem as the issue seems to be a segfault which, as far as I understand, should not be fixed by using a newer GCC version.

I can fix the bug

actually everything is okay but the code is having a little bit problem

Migrate to Lua for the CLI

It could make the code a lot easier and much more extensible if we used Lua to handle the command line parsing. We could then just pass the necessary configuration to the C API and still maintain the relevant performance.

Add option to read more than one line at a time

Currently, the program is bottlenecked by the fact that we are only reading one line at a time.

I might lead to serious performance gains if we limited the number of I/O operations (especially writes) by reading a variable number of lines. This number could be

  1. specified by the user,
  2. inferred from a given memory target and a line length heuristic, or
  3. inferred from a memory limit and actual line lengths and therefore dynamically change per batch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.