freznicek / awk-crashcourse Goto Github PK
View Code? Open in Web Editor NEWAWK language course
AWK language course
awk-crashcourse/examples/word-count.awk
Line 12 in 77070d2
Is it accounting for the trailing \n ?
hi,
just a very minor comment -
summing up length($0) only works if the input is guaranteed to be ASCII-only.
i accidentally discovered that, even if gawk unicode mode, to get an exact byte count, for UTF8 inputs or even purely binary files like a .gz or a .mp4, a simple
match($0, /$/) - 1
does the trick. the minus 1 is needed since it matches the first available position, which is immediately after the input itself.
Conversely, if one definitely knows RT is a fixed-width of 1 byte (e.g. only \n ),
then a byte count is even simpler -
at each row, add up
byte_cnt += match($0, /$/)
then at END { } section, byte_cnt will be accurate. In byte/POSIX/C mode, match( ) doesn't offer any speed up, so for those, use length( ) instead.
% time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo
in0: 408MiB 0:00:00 [1011MiB/s] [1011MiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -e | mawk ; ) 13.25s user 0.71s system 100% cpu 13.865 total
% time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo
in0: 408MiB 0:00:00 [1.13GiB/s] [1.13GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -b -e | mawk ; ) 13.47s user 0.66s system 100% cpu 14.042 total
time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print length }' | ecp); echo
in0: 408MiB 0:00:00 [1.15GiB/s] [1.15GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -b -e | mawk ; ) 0.28s user 0.67s system 115% cpu 0.825 total
one can obtain a tiny speed-up summing row-by-row instead of all at once , while for mawk2, theirs is implemented in a manner such that match-only is hardly any slow down on small inputs:
time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print byte_cnt }' | ecp); echo
in0: 408MiB 0:00:13 [30.3MiB/s] [30.3MiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -e | mawk ; ) 13.49s user 0.28s system 101% cpu 13.553 total
time ( pvE0 < "${m3r}" | mawk2 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print byte_cnt }' | ecp); echo
in0: 408MiB 0:00:00 [1.47GiB/s] [1.47GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | mawk2 | mawk ; ) 0.11s user 0.28s system 124% cpu 0.310 total
time ( pvE0 < "${m3r}" | mawk2 'BEGIN { FS="^$" } { byte_cnt += length($0) } END { print byte_cnt+NR }' | ecp); echo
in0: 408MiB 0:00:00 [1.50GiB/s] [1.50GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | mawk2 | mawk ; ) 0.10s user 0.27s system 124% cpu 0.300 total
here, i've thrown in a 224MB .7z binary file, and gawk does it just fine without any error messages (i've also added the gnu-wc output for reference) :
f='./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z'; gwc -lcm "${f}" | lgp3; time ( pvE0 < "${f}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print byte_cnt - (RT=="") }' | ecp); echo
920308 125659415 235672582 ./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z
in0: 224MiB 0:00:07 [28.6MiB/s] [28.6MiB/s] [===================================>] 100%
235672582
( pvE 0.1 in0 < "${f}" | gawk -e | mawk ; ) 7.83s user 0.22s system 101% cpu 7.892 total
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.