Coder Social home page Coder Social logo

gz-sort's People

Contributors

keenerd avatar sjmulder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gz-sort's Issues

Test failures on Fedora and OpenBSD

Fedora:

$ git rev-parse HEAD; uname -a; cc --version; make test
c596bcb2921430d476a61bf0f6852cd84ad46e84
Linux sijmens-pc.home 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
cc (GCC) 8.0.1 20180324 (Red Hat 8.0.1-0.20)
Copyright © 2018 Free Software Foundation, Inc.
Dit is vrije software; zie de broncode voor kopieercondities.  Er is GEEN
garantie; zelfs niet voor VERKOOPBAARHEID of GESCHIKTHEID voor een bepaald
doel.

cc -O3 -std=gnu99 -Wall -Werror -pedantic -Wextra -pthread    -c -o gz-sort.o gz-sort.c
cc   gz-sort.o  -lz -lpthread -o gz-sort
rm -f tests/*.gz
./tests/_setup.sh
./tests/_run-all.sh
./tests/small_unique.sh
removed 2 non-unique lines
./tests/small_sort.sh
./tests/simple.sh
./tests/pass-through.sh
./tests/small_buffer.sh
removed 25218 non-unique lines
ERROR - ./tests/small_buffer.sh (unique)
./tests/unique.sh
removed 25218 non-unique lines
ERROR - ./tests/unique.sh
./tests/threaded.sh
removed 25218 non-unique lines
ERROR - ./tests/threaded.sh (unique)
rm -f tests/*.gz

OpenBSD (on my wip/openbsd branch):

$ git rev-parse HEAD; uname -a; cc --version; make test
ce2982fd81eef3278cc8ae6f2b3031c69c8410a8
OpenBSD openbsd.[REDACTED] 6.3 GENERIC#100 amd64
OpenBSD clang version 5.0.1 (tags/RELEASE_501/final) (based on LLVM 5.0.1)
Target: amd64-unknown-openbsd6.3
Thread model: posix
InstalledDir: /usr/bin
rm -f tests/*.gz
./tests/_setup.sh
./tests/_run-all.sh
./tests/pass-through.sh
./tests/simple.sh
./tests/small_buffer.sh
removed 1520 non-unique lines
./tests/small_sort.sh
tput: not enough arguments (3) for capability `setaf'
ERROR - ./tests/small_sort.sh
./tests/small_unique.sh
removed 2 non-unique lines
./tests/threaded.sh
removed 1520 non-unique lines
./tests/unique.sh
removed 1520 non-unique lines
rm -f tests/*.gz

(Interestingly, this only happened the second run on OpenBSD. First run went fine.)

Some feature requests -- not an issue

Could your script/function allow to work on the uncompressed row of data instead of operating on the whole line/record itself? For instance, on a TSV file, could I parse out column 2 and 5 (create that as my distinct sorted/unique record) and only output that portion to a destination?

Also, it would be fantastic if this script could run across multiple machines in a parallelized manner (in a clustered environment) to coordinate sorts/uniques....That might speed things up tremendously...I work in a nearly "all Windows, all the time" environment -- I'd love to port this thing too (but maybe Cygwin will be my savior)

Another odd thing to mention (but it could be of interest to you or future readers obsessed with compression/decompression matters ) -- I once (a while back) fell down the rabbit hole of compression libraries and stumbled on a number of variant compression libraries besides the ever popular zlib/gzip... Check out "pigz" (http://zlib.net/pigz/) -- it's incredibly fast on multi-core machines (source #1: http://vbtechsupport.com/2094/ source #2: http://vbtechsupport.com/1614/ )

Sorry to drop this in the "issues" area -- github really needs a community notes/features/tips/comments area or something -- not sure if this is the best forum for this input.

Also, I picked up on your script via "onethingwell.org" -- this looks very interesting!
(here's the link: http://onethingwell.org/post/139856185833/gz-sort )

Thanks again,
-Doug Jr.

apt-get install gz-sort?

Hi, would it be possible for you to publish gz-sort as a standard linux package that may be installed system wide with standard apt-get install? Thanks!

MIT License?

Any chance you'd be willing to switch to the less-restrictive MIT license?

utf sorting working differently than gnu sort?

i am noticing utf input files are sorting much differently in gz-sort than gnu sort. is this a known issue and is there some plan to make gz-sort compatible as a drop in replacement for gnu sort with respect to utf input? thanks!

note, i am attaching a sample input file.

$ cat utf1000.txt | sort | md5sum
e48750df42a4b31030f63d7b61ab2bc7 -

$ cat utf1000.txt | gz-sort | md5sum
eb25fdf69e602183470f7377e0864b62 -

utf1000.txt

Unable to compile

$ make
cc -O3 -std=gnu99 -Wall -Werror -pedantic -Wextra -pthread    -c -o gz-sort.o gz-sort.c
gz-sort.c: In function ‘nway_chop_and_presort’:
gz-sort.c:409:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
     asprintf(&report, "%s line count: %ld\n%s %s", misc->label, out.line_counter, misc->label, "chop/presort");
     ^
gz-sort.c: In function ‘first_pass’:
gz-sort.c:433:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
     asprintf(&report, "%s line count: %ld\n%s %s", misc->label, in1.line_counter, misc->label, label2);
     ^
gz-sort.c: In function ‘middle_passes’:
gz-sort.c:558:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
         asprintf(&report, "%s merge %ld", misc->label, average);
         ^
gz-sort.c: In function ‘nway_merge_pass’:
gz-sort.c:679:5: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
     asprintf(&report, "%i-way merge", misc->nway);
     ^
gz-sort.c: In function ‘main’:
gz-sort.c:791:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
         asprintf(&temp_path, "%s.temp", output_path);
         ^
gz-sort.c:810:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
         asprintf(&(nway_table[i].label), "T%i", i+1);
         ^
gz-sort.c:811:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
         asprintf(&(nway_table[i].in_path), "%s.T%i.temp", output_path, i+1);
         ^
gz-sort.c:812:9: error: ignoring return value of ‘asprintf’, declared with attribute warn_unused_result [-Werror=unused-result]
         asprintf(&(nway_table[i].out_path), "%s.T%i.gz",  output_path, i+1);
         ^
cc1: all warnings being treated as errors
<builtin>: recipe for target 'gz-sort.o' failed
make: *** [gz-sort.o] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.