mayank-02 / msort Goto Github PK
View Code? Open in Web Editor NEWSort lines of text files
License: MIT License
Sort lines of text files
License: MIT License
Input file attached...
$ /tmp/msort.new uuid.10k.txt > /dev/null
Segmentation fault (core dumped)
Backtrace,
$ gdb /tmp/msort.new core
...
Reading symbols from /tmp/msort.new...
[New LWP 2773816]
Core was generated by `/tmp/msort.new uuid.10k.txt'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000056143a4bedf6 in merge (f1=<error reading variable: Cannot access memory at address 0x7ffe1f355de8>,
f2=<error reading variable: Cannot access memory at address 0x7ffe1f355de0>,
f3=<error reading variable: Cannot access memory at address 0x7ffe1f355dd8>) at msort.c:751
751 void merge(FILE *f1, FILE *f2, FILE *f3) {
...
(gdb) bt
#0 0x000056143a4bedf6 in merge (f1=<error reading variable: Cannot access memory at address 0x7ffe1f355de8>,
f2=<error reading variable: Cannot access memory at address 0x7ffe1f355de0>,
f3=<error reading variable: Cannot access memory at address 0x7ffe1f355dd8>) at msort.c:751
#1 0x000056143a4bf789 in handleMerges (numFiles=3, x=114 'r') at msort.c:957
#2 0x000056143a4bffef in main (argc=2, argv=0x7ffe20356568) at msort.c:1182
See the commentary for -n here,
https://www.unix.com/man-page/posix/1p/sort/
$ cat numeric-format-decimals.txt
.6
.19
0.10
.11
.5
0.55
.17
.12
With GNU sort,
$ sort -n numeric-format-decimals.txt
0.10
.11
.12
.17
.19
.5
0.55
.6
With Msort
$ /tmp/msort -n numeric-format-decimals.txt
.6
.19
0.10
.11
.5
0.55
.17
.12
Sorting the attached 3-line file under valgrind gives the attached errors on my system. I don't think that there's anything special about the input file as it's trivial.
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
$ valgrind --version
valgrind-3.15.0
Msort compiled with "gcc -ggdb" or "gcc -g" as /tmp/msort-g,
$ valgrind --log-file=abc.valgrind.txt /tmp/msort-g abc.txt
a
b
c
This produces the correct output but please see the valgrind log.
Can you reproduce this? Is this real?
The first error is,
==3408920== Invalid read of size 4
==3408920== at 0x4A337D7: fgets (iofgets.c:47)
==3408920== by 0x10AB49: fillBuffer (msort.c:666)
==3408920== by 0x10B32D: handleMerges (msort.c:846)
==3408920== by 0x10BE19: main (msort.c:1131)
==3408920== Address 0x4ba5b70 is 0 bytes inside a block of size 472 free'd
==3408920== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3408920== by 0x4A33042: _IO_deallocate_file (libioP.h:863)
==3408920== by 0x4A33042: fclose@@GLIBC_2.2.5 (iofclose.c:74)
==3408920== by 0x10B391: handleMerges (msort.c:856)
==3408920== by 0x10BE19: main (msort.c:1131)
==3408920== Block was alloc'd at
==3408920== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3408920== by 0x4A33AAD: __fopen_internal (iofopen.c:65)
==3408920== by 0x4A33AAD: fopen@@GLIBC_2.2.5 (iofopen.c:86)
==3408920== by 0x10B28E: handleMerges (msort.c:827)
==3408920== by 0x10BE19: main (msort.c:1131)
@mayank-02 Are you getting notified of these?
See the INPUT FILES section of this document,
https://www.unix.com/man-page/posix/1p/sort/
INPUT FILES
The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line.
GNU sort,
$ echo -ne "a\nb\nc" | sort
a
b
c
$
msort,
$ echo -ne "a\nb\nc" | msort
a
b
c$
See the OPERANDS section of this document,
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html
Msort,
$ echo "hello" | msort -
Can't open file : No such file or directory
GNU sort,
$ echo "hello" | sort -
hello
Msort appears to create temporary files for merging in the current working directory. Should probably use the system configuration for tmp files which probably resolves to /tmp/
Using scan-build,
https://clang-analyzer.llvm.org/scan-build.html
Didn't investigate, but look at the third error,
$ scan-build clang msort.c -lm
scan-build: Using '/usr/lib/llvm-10/bin/clang' for static analysis
msort.c:38:5: warning: Value stored to 'endpos1' is never read
endpos1 = length1;
^ ~~~~~~~
msort.c:39:5: warning: Value stored to 'endpos2' is never read
endpos2 = length2;
^ ~~~~~~~
msort.c:189:15: warning: The left operand of '==' is a garbage value
} while(x == y && i < endpos1 && j < endpos2);
~ ^
msort.c:791:9: warning: Value stored to 'k' is never read
k = 0;
^ ~
4 warnings generated.
scan-build: 4 bugs found.
If you run this yourself, you can use "scan-view" to examine the report in detail.
Please add a -c option to test if a file is already sorted. From the GNU sort man page,
-c, --check, --check=diagnose-first
check for sorted input; do not sort
Might be a duplicate bug but the other one was for decimals without a leading zero. These have leading zeros.
Please confirm that you can reproduce these test results.
$ cat test.txt
0.6095308
6.754819
0.1447246
GNU sort,
$ LC_ALL=C sort -n test.txt
0.1447246
0.6095308
6.754819
#Msort,
$ LC_ALL=C /tmp/msort -n test.txt
0.6095308
0.1447246
6.754819
And this example is even more disturbing,
$ /tmp/msort -n test.txt
1.738268
1.148224
$ cat test.txt
1.738268
1.148224
GNU sort,
$ LC_ALL=C sort -n test.txt
1.148224
1.738268
Msort,
$ LC_ALL=C /tmp/msort -n test.txt
1.738268
1.148224
Should this be STRSIZE or something else? A constant doesn't make sense.
$ grep qsort msort.c
Using the attached 383 line file containing UUIDs, the output of msort doesn't match that of GNU sort or Busybox sort (which return the same result).
$ md5sum uuid.1234
27e48a1f7971314287a3d70c3a1f9923 uuid.1234
GNU sort,
$ LC_ALL=C sort -t- -k3,3 -k4,4 uuid.1234 | md5sum
3882c4e3465944a98eb35f7d0e9cea73 -
Busybox sort,
$ LC_ALL=C /tmp/sort -t- -k3,3 -k4,4 uuid.1234 | md5sum
3882c4e3465944a98eb35f7d0e9cea73 -
Msort,
$ LC_ALL=C /tmp/msort -t- -k3,3 -k4,4 uuid.1234 | md5sum
473e9fe8912e001ab79a2b41d3efca4f -
Also probably the same as issue #7
Sorting multiple files also appears broken,
$ cat a.txt
a
$ cat b.txt
b
$ /tmp/msort a.txt
a
$ /tmp/msort b.txt
b
$ /tmp/msort a.txt b.txt
Segmentation fault (core dumped)
Can you reproduce this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.