Comments (8)
@BurntSushi Very good, I appreciate your being understanding. I'll add some clarifying text. A fuller analysis is hard, as you say.
from tsv-utils.
@BurntSushi The "join benchmark" now describes the steps to reproduce the data should you wish to run this test.
from tsv-utils.
Ah, sorry about the noise, I found them at the bottom of the post. Thanks!
from tsv-utils.
No problem. It'd be fantastic to have someone else take a shot at reproducing these. You have some nice tools, by the way!
from tsv-utils.
All righty, so I got around to trying a couple of the benchmarks. I only tried out your tools, csvtk and xsv. I skipped the join benchmark because I wasn't sure how to recreate the data. To produce TREE_GRM_ESTN_14mil.csv
, I did:
$ xsv slice -e 14000000 TREE_GRM_ESTN.csv > TREE_GRM_ESTN_14mil.csv
I got my tsv-utils binaries from your releases. Specifically, tsv-utils-dlang-v1.1.11_linux-x86_64_ldc2
. I got csvtk
from their releases too, specifically 0.7.1
. And I also grabbed xsv
from my releases as well, specifically xsv 0.12.1
. (Note that your benchmark uses xsv 0.10.x
. There have been a number of performance improvements since then, some of which were motivated by your benchmark. :-))
Overall, I really liked your benchmark. It identified a few weak spots in xsv
where it really should have been faster. With that said, I do have one very strong criticism to lodge against your benchmarks: you never actually point out that tools like xsv
and, to a lesser extent, csvtk
, are designed to handle CSV while your tools require a much stricter format. (You do mention that the tools aren't exact equivalents, but I think a benchmark should point out where they are different and how they might impact the interpretation of the benchmarks.) From the docs in your csv2tsv
tool:
The key difference is that CSV uses escape sequences to
represent newlines and field separators in the data, whereas TSV disallows these
characters in the data.
I am lodging this criticism because this single design decision has wide reaching implications on the performance of the tools you're benchmarking. To be clear, I think the comparison itself is still interesting, because it shows that folks might benefit from shoving their data into a stricter TSV format if they can.
Note that I said "to a lesser extent, csvtk
" above because it too does not actually handle arbitrary CSV data. From the multicorecsv
parser's README, which is used inside csvtk
:
muticorecsv does not support reading CSV files with properly quoted/escaped newlines! If you have \n in your source data fields, multicorecsv Read() will not work for you.
This is also a critical assumption that csvtk
makes that xsv does not make because it impacts performance. This one assumption permits parallelizing the parsing of CSV data. You can't parallelize arbitrary CSV data without some kind of index, which is why the xsv index
command exists. (It also calls into question the correctness of tools like ParaText, although I haven't really dug into that one yet.)
One other thing: the results in your blog post are somewhat difficult to read, since your tables use labels like "Toolkit 1", but as far as I can see, you never actually say what "Toolkit 1" is?
OK, with that out of the way, I figured I'd share my data. Sorry if I sound bitter! (I've spent a lot of time on making CSV parsing fast while handling all the corner cases that, say, Python's CSV parser handles.) Nevertheless, nice work on the tsv utilities, they are quite fast! :-) And they certainly make me wonder whether xsv
might benefit from supporting a stricter format! :P
For regular expression filtering:
command | time |
---|---|
csvtk grep -t -l -f 10 -r -p '[RD].*(ION[0-2])' < TREE_GRM_ESTN_14mil.tsv > /dev/null |
21.7s |
xsv search -s COMPONENT '[RD].*(ION[0-2])' TREE_GRM_ESTN_14mil.tsv > /dev/null |
6.2s |
tsv-filter -H --regex 10:'[RD].*(ION[0-2])' TREE_GRM_ESTN_14mil.tsv > /dev/null |
7.5s |
csvtk's performance isn't too surprising, since Go's regexp engine isn't that fast. From looking at a profile of tsv-filter
, it looks like D's regex engine is also holding you back, because you really ought to be faster than xsv
here.
For column selection:
command | time |
---|---|
csvtk cut -t -l -f 1,8,19 all_train.tsv > /dev/null |
24.1s |
xsv select 1,8,19 all_train.tsv > /dev/null |
6.5s |
tsv-select -f 1,8,19 all_train.tsv > /dev/null |
3.5s |
This one isn't too interesting. tsv-select
is faster because it makes stronger assumptions about the format of the data, so it can do a lot less work.
For summary statistics:
command | time |
---|---|
csvtk stats2 -t -l -f 3,5,20 all_train.tsv > /dev/null |
36.1s |
xsv stats -s 3,5,20 all_train.tsv > /dev/null |
32.244s |
with indexing xsv stats -s 3,5,20 all_train.tsv > /dev/null |
3.9s |
tsv-summarize -H --count --sum 3,5,20 --min 3,5,20 --max 3,5,20 --mean 3,5,20 --stdev 3,5,20 all_train.tsv > /dev/null |
15.4s |
This is an interesting one. Your single threaded performance is quite impressive, and from profiling, it looks like there might be room for improvement in Rust's parsing of floating point numbers. xsv
works really well with indexing because it enables parallelism, and I'm somewhat disappointed you didn't mention it in your article. :-( Indexing is fast and cheap, but it makes commands like stats
, join
, frequency
, and slice
much faster.
Given that you only handle strict TSV formats, it seems like you could probably benefit from parallelism on this one without any sort of indexing.
For CSV to TSV conversion:
command | time |
---|---|
csvtk csv2tab TREE_GRM_ESTN_14mil.csv > /dev/null |
34.6s |
xsv fmt -t'\t' TREE_GRM_ESTN_14mil.csv > /dev/null |
18.9s |
csv2tsv TREE_GRM_ESTN_14mil.csv > /dev/null |
24.9s |
This benchmark was interesting, because up until very recently, xsv
performed really terribly, and it was mostly because I neglected to ever try to optimize CSV writing. As you can see, that's fixed now. :-)
from tsv-utils.
Excellent! I'm glad someone tried to reproduce these. And, seems like your tools have gotten faster since I ran my benchmarks. Very good!
A big picture thing I'd like to communicate - the purpose behind my benchmarks was not to identify the which toolkit is the fastest. And, my reason for anonymizing the timing of what I called the "specialty toolkits" was to avoid a shootout and especially language flame wars. Also, it appears most of the specialty toolkits are written by one or two people, who frankly, are doing a service by open-sourcing their software. There is simply no reason to bash people who are doing this.
Then what was the purpose of the benchmarks? I was doing an evaluation of D. I wanted to see what the performance would be like writing in a somewhat obvious style, using standard libraries, etc. Think of a large software team at a company. To get an idea of what might be expected, I needed some baselines. I took every native compiled thing I could find that had equivalent functionality. And, I tested these tools, including yours, after completing mine. That is, I didn't study other tools and figure out what was needed to do to beat them.
So, from my perspective, what was significant was that the D programs did well against so many different implementations, written by a number of people. I was shocked they finished first in every metric I tried, except for csv-to-tsv conversion, which is far slower than it should be for reasons I haven't identified yet. And, some tool should be faster on the "join" metric, what I wrote can be much faster.
Now, as you point out, the different tools have different functionality choices that affect performance. The CSV tools need to handle escape characters, and even with a "TSV" mode, it's asking an awful lot to support both and optimize the performance for both. And, the Awk family of tools is handling an arbitrary expression stack. My tools (tsv-filter) do not. Handling an arbitrary expression stack will be slower, despite very serious attempts to optimize it (ie. mawk). (And yes, both using TSV and not supporting arbitrary expressions are deliberate design decisions for these reasons.)
As to why the performance benchmarks page doesn't go into more details about both issues - Mainly, the page is already too long. The page does draw the conclusion that D is showing up very favorably on the performance front. And, one should infer that processing TSV should be faster than CSV, but that's hardly a new observation. However, the benchmarks certainly do not conclude D is "faster" than another language (in this case, C, Rust, or Go). The individual tools also, especially those that handle CSV and arbitrary expression trees.
Perhaps though, trying to keep the page from growing longer was a mistake, I'll take a look and see if I can add a bit more of an explanation.
By the way, to me, an interesting comparison is GNU DataMash. It should be faster, at least when data is in sorted order. It's not. This partly says that you can't draw conclusions from a single comparison point.
I'm happy to have further conversations about these topics. For the next several weeks I'm going to have trouble responding quickly, so don't take silence the wrong way.
from tsv-utils.
@jondegenhardt Thanks for the response! To clarify: I care less about "which language is faster" and a lot more about "readers should be given enough information to interpret what the benchmark results mean." My tool happened to be involved, so I'm bleating about it, that's all. :-) I think a few clarifying sentences is really all you need, although I think the best case is providing an analysis of the results. But, having done that myself for other tools, that's not a reasonable request because of how much time it takes. It's hard.
from tsv-utils.
@jondegenhardt Thanks! :-)
from tsv-utils.
Related Issues (20)
- AUR package with LTO & PGO enabled HOT 2
- How to best use the code as a library? HOT 4
- Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] HOT 8
- bufferedByLine does not work with File due to @safe <> @system conflict HOT 3
- Issue with installing on Windows 10 using D / build failure HOT 28
- tsv-summarize: Slice SummarizerBase._operators when invoking std.algorithm.each
- Inconsistent newline handling on Windows HOT 2
- Status of Windows build HOT 6
- Bulding tsv-utils with LTO and PGO on Archlinux HOT 14
- Homebrew install HOT 6
- Package tsv-utils for conda(-forge)? HOT 1
- No linux release assets for v2.2.1
- -bash: ./tsv-pretty: cannot execute binary file HOT 1
- Ability to produce proper CSV files
- Sort using column names
- tsv-append: limit number of rows per file? [feature request]
- Error [tsv-filter]: Not enough fields in line. File: c.tsv, Line: 1425063 HOT 1
- ENH: Add ARM64 build assets for native functionality on M1 macs (the future)
- Q: any API doc? how to skip empty field in csvReader?
- Updated benchmarks including qsv
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tsv-utils.