keenerd / gz-sort Goto Github PK
View Code? Open in Web Editor NEWA utility for sorting really big files. http://kmkeen.com/gz-sort/
License: GNU General Public License v3.0
A utility for sorting really big files. http://kmkeen.com/gz-sort/
License: GNU General Public License v3.0
Fedora:
$ git rev-parse HEAD; uname -a; cc --version; make test
c596bcb2921430d476a61bf0f6852cd84ad46e84
Linux sijmens-pc.home 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
cc (GCC) 8.0.1 20180324 (Red Hat 8.0.1-0.20)
Copyright © 2018 Free Software Foundation, Inc.
Dit is vrije software; zie de broncode voor kopieercondities. Er is GEEN
garantie; zelfs niet voor VERKOOPBAARHEID of GESCHIKTHEID voor een bepaald
doel.
cc -O3 -std=gnu99 -Wall -Werror -pedantic -Wextra -pthread -c -o gz-sort.o gz-sort.c
cc gz-sort.o -lz -lpthread -o gz-sort
rm -f tests/*.gz
./tests/_setup.sh
./tests/_run-all.sh
./tests/small_unique.sh
removed 2 non-unique lines
./tests/small_sort.sh
./tests/simple.sh
./tests/pass-through.sh
./tests/small_buffer.sh
removed 25218 non-unique lines
ERROR - ./tests/small_buffer.sh (unique)
./tests/unique.sh
removed 25218 non-unique lines
ERROR - ./tests/unique.sh
./tests/threaded.sh
removed 25218 non-unique lines
ERROR - ./tests/threaded.sh (unique)
rm -f tests/*.gz
OpenBSD (on my wip/openbsd branch):
$ git rev-parse HEAD; uname -a; cc --version; make test
ce2982fd81eef3278cc8ae6f2b3031c69c8410a8
OpenBSD openbsd.[REDACTED] 6.3 GENERIC#100 amd64
OpenBSD clang version 5.0.1 (tags/RELEASE_501/final) (based on LLVM 5.0.1)
Target: amd64-unknown-openbsd6.3
Thread model: posix
InstalledDir: /usr/bin
rm -f tests/*.gz
./tests/_setup.sh
./tests/_run-all.sh
./tests/pass-through.sh
./tests/simple.sh
./tests/small_buffer.sh
removed 1520 non-unique lines
./tests/small_sort.sh
tput: not enough arguments (3) for capability `setaf'
ERROR - ./tests/small_sort.sh
./tests/small_unique.sh
removed 2 non-unique lines
./tests/threaded.sh
removed 1520 non-unique lines
./tests/unique.sh
removed 1520 non-unique lines
rm -f tests/*.gz
(Interestingly, this only happened the second run on OpenBSD. First run went fine.)
Could your script/function allow to work on the uncompressed row of data instead of operating on the whole line/record itself? For instance, on a TSV file, could I parse out column 2 and 5 (create that as my distinct sorted/unique record) and only output that portion to a destination?
Also, it would be fantastic if this script could run across multiple machines in a parallelized manner (in a clustered environment) to coordinate sorts/uniques....That might speed things up tremendously...I work in a nearly "all Windows, all the time" environment -- I'd love to port this thing too (but maybe Cygwin will be my savior)
Another odd thing to mention (but it could be of interest to you or future readers obsessed with compression/decompression matters ) -- I once (a while back) fell down the rabbit hole of compression libraries and stumbled on a number of variant compression libraries besides the ever popular zlib/gzip... Check out "pigz" (http://zlib.net/pigz/) -- it's incredibly fast on multi-core machines (source #1: http://vbtechsupport.com/2094/ source #2: http://vbtechsupport.com/1614/ )
Sorry to drop this in the "issues" area -- github really needs a community notes/features/tips/comments area or something -- not sure if this is the best forum for this input.
Also, I picked up on your script via "onethingwell.org" -- this looks very interesting!
(here's the link: http://onethingwell.org/post/139856185833/gz-sort )
Thanks again,
-Doug Jr.
Hi, would it be possible for you to publish gz-sort as a standard linux package that may be installed system wide with standard apt-get install
? Thanks!
Any chance you'd be willing to switch to the less-restrictive MIT license?
i am noticing utf input files are sorting much differently in gz-sort than gnu sort. is this a known issue and is there some plan to make gz-sort compatible as a drop in replacement for gnu sort with respect to utf input? thanks!
note, i am attaching a sample input file.
$ cat utf1000.txt | sort | md5sum
e48750df42a4b31030f63d7b61ab2bc7 -
$ cat utf1000.txt | gz-sort | md5sum
eb25fdf69e602183470f7377e0864b62 -
$ make
cc -O3 -std=gnu99 -Wall -Werror -pedantic -Wextra -pthread -c -o gz-sort.o gz-sort.c
gz-sort.c: In function ‘nway_chop_and_presort’:
gz-sort.c:409:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&report, "%s line count: %ld\n%s %s", misc->label, out.line_counter, misc->label, "chop/presort");
^
gz-sort.c: In function ‘first_pass’:
gz-sort.c:433:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&report, "%s line count: %ld\n%s %s", misc->label, in1.line_counter, misc->label, label2);
^
gz-sort.c: In function ‘middle_passes’:
gz-sort.c:558:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&report, "%s merge %ld", misc->label, average);
^
gz-sort.c: In function ‘nway_merge_pass’:
gz-sort.c:679:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&report, "%i-way merge", misc->nway);
^
gz-sort.c: In function ‘main’:
gz-sort.c:791:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&temp_path, "%s.temp", output_path);
^
gz-sort.c:810:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&(nway_table[i].label), "T%i", i+1);
^
gz-sort.c:811:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&(nway_table[i].in_path), "%s.T%i.temp", output_path, i+1);
^
gz-sort.c:812:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
asprintf(&(nway_table[i].out_path), "%s.T%i.gz", output_path, i+1);
^
cc1: all warnings being treated as errors
<builtin>: recipe for target 'gz-sort.o' failed
make: *** [gz-sort.o] Error 1
-u (uniq) option doesn't even work
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.