Coder Social home page Coder Social logo

onetrueawk / awk Goto Github PK

View Code? Open in Web Editor NEW
2.0K 67.0 159.0 4.07 MB

One true awk

License: Other

C 43.69% Roff 30.74% Yacc 3.38% Makefile 0.86% Shell 8.15% Awk 11.72% OpenEdge ABL 0.56% Perl 0.06% Logos 0.05% Nextflow 0.02% Stata 0.04% E 0.01% Filebench WML 0.01% Forth 0.01% Standard ML 0.02% Max 0.01% Pascal 0.01% Reason 0.65% RPC 0.04% AMPL 0.01%

awk's Introduction

The One True Awk

This is the version of awk described in The AWK Programming Language, Second Edition, by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 2024, ISBN-13 978-0138269722, ISBN-10 0138269726).

What's New?

This version of Awk handles UTF-8 and comma-separated values (CSV) input.

Strings

Functions that process strings now count Unicode code points, not bytes; this affects length, substr, index, match, split, sub, gsub, and others. Note that code points are not necessarily characters.

UTF-8 sequences may appear in literal strings and regular expressions. Arbitrary characters may be included with \u followed by 1 to 8 hexadecimal digits.

Regular expressions

Regular expressions may include UTF-8 code points, including \u.

CSV

The option --csv turns on CSV processing of input: fields are separated by commas, fields may be quoted with double-quote (") characters, quoted fields may contain embedded newlines. Double-quotes in fields have to be doubled and enclosed in quoted fields. In CSV mode, FS is ignored.

If no explicit separator argument is provided, field-splitting in split is determined by CSV mode.

Copyright

Copyright (C) Lucent Technologies 1997
All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that the name Lucent Technologies or any of its entities not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

LUCENT DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL LUCENT OR ANY OF ITS ENTITIES BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Distribution and Reporting Problems

Changes, mostly bug fixes and occasional enhancements, are listed in FIXES. If you distribute this code further, please please please distribute FIXES with it.

If you find errors, please report them to the current maintainer, [email protected]. Please also open an issue in the GitHub issue tracker, to make it easy to track issues. Thanks.

Submitting Pull Requests

Pull requests are welcome. Some guidelines:

  • Please do not use functions or facilities that are not standard (e.g., strlcpy(), fpurge()).

  • Please run the test suite and make sure that your changes pass before posting the pull request. To do so:

    1. Save the previous version of awk somewhere in your path. Call it nawk (for example).
    2. Run oldawk=nawk make check > check.out 2>&1.
    3. Search for BAD or error in the result. In general, look over it manually to make sure there are no errors.
  • Please create the pull request with a request to merge into the staging branch instead of into the master branch. This allows us to do testing, and to make any additional edits or changes after the merge but before merging to master.

Building

The program itself is created by

make

which should produce a sequence of messages roughly like this:

bison -d  awkgram.y
awkgram.y: warning: 44 shift/reduce conflicts [-Wconflicts-sr]
awkgram.y: warning: 85 reduce/reduce conflicts [-Wconflicts-rr]
awkgram.y: note: rerun with option '-Wcounterexamples' to generate conflict counterexamples
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o awkgram.tab.o awkgram.tab.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o b.o b.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o main.o main.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o parse.o parse.c
gcc -g -Wall -pedantic -Wcast-qual -O2 maketab.c -o maketab
./maketab awkgram.tab.h >proctab.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o proctab.o proctab.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o tran.o tran.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o lib.o lib.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o run.o run.c
gcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o lex.o lex.c
gcc -g -Wall -pedantic -Wcast-qual   -O2 awkgram.tab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o   -lm

This produces an executable a.out; you will eventually want to move this to some place like /usr/bin/awk.

If your system does not have yacc or bison (the GNU equivalent), you need to install one of them first. The default in the makefile is bison; you will have to edit the makefile to use yacc.

NOTE: This version uses ISO/IEC C99, as you should also. We have compiled this without any changes using gcc -Wall and/or local C compilers on a variety of systems, but new systems or compilers may raise some new complaint; reports of difficulties are welcome.

This compiles without change on Macintosh OS X using gcc and the standard developer tools.

You can also use make CC=g++ to build with the GNU C++ compiler, should you choose to do so.

A Note About Releases

We don't usually do releases.

A Note About Maintenance

NOTICE! Maintenance of this program is on a ''best effort'' basis. We try to get to issues and pull requests as quickly as we can. Unfortunately, however, keeping this program going is not at the top of our priority list.

Last Updated

Mon 05 Feb 2024 08:46:55 IST

awk's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awk's Issues

(g)sub: backslash behaviour

$ gawk 'BEGIN { v="ab\\cd"; gsub(/\\/, "\\\\\\\\", v); print v; }'
ab\\cd
$ mawk 'BEGIN { v="ab\\cd"; gsub(/\\/, "\\\\\\\\", v); print v; }'
ab\\cd
$ awk 'BEGIN { v="ab\\cd"; gsub(/\\/, "\\\\\\\\", v); print v; }'
ab\\\\cd

The one true AWK behaves differently from both gawk and mawk when it comes to backslashes in the replacement string argument of (g)sub.

POSIX says for gsub/sub: "An <ampersand> ( '&' ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character."

This suggests the behaviour of gawk and mawk (considering that the double quotes already remove one level of backslashes).

feature request: -f like option, but the last one processed

When writing a awk script that parses its arguments, it's sometimes useful to pass to the script an option-like argument (such as -n). But awk process those arguments and do not pass them to the script.

For example, consider this script echo.awk.

#!/usr/bin/awk -f

BEGIN {
	if (ARGV[1] == "-n")
		nl = 1
	for (i = nl + 1; i < ARGC; i++)
		printf (i == nl + 1 ? "" : " ") ARGV[i]
	if (!nl)
		print ""
}

Now call it with the following arguments:

$ echo.awk -n Hello World
awk: unknown option -n ignored
Hello World
$

The argument -n is not passed to the script. It is, however, processed by awk(1) and ignored. If it were an option known by awk, the behavior can be totally different from the expected. To obtain the expected behavior, we need to call the script with the option after a --.

$ ./echo.awk -- -n Hello World
Hello World$

I'm asking for an option for onetrueawk that, like -f, reads a program from a file; but, unlike -f, is the last option to be parsed, and pass all following arguments (including option-like arguments) to the script.

This new option would be extremely useful in shebangs on awk scripts.

warnings in manual page

$ man --warnings -E UTF-8 -l -Tutf8 -Z ./awk.1 >/dev/null
<standard input>:11: warning: macro `CT' not defined
<standard input>:201: warning: macro `TF' not defined

I'm smaller. No you're not.

$ nawk 'BEGIN { A="000100"; B="0010"; C=010;  if(B<C) print "\nB: I\47m smaller than C."; if(A<B) print "A: But I\47m smaller than you, B.\nAWK: Oh, really???" }'

B: I'm smaller than C.
A: But I'm smaller than you, B.
AWK: Oh, really???

Possible undefined behaviour of `ctype(3)` in `b.c:relex()` when using multiple `[:...:]` character classes in bracket expressions

When using multiple [:...:] character classes in a bracket expression (and with
a non-C locale (e.g. en_US.UTF-8)) only the first one seems honored, e.g.:

% echo "UPPER" | ./a.out '/[[:lower:][:upper:]]/ { print }'

should print UPPER but does not print it.

Inspecting that via the -d option we can indeed see that the [:upper:] is not
expanded in cclenter():

% ./a.out -d '/[[:lower:][:upper:]]/ { print }'
[...]
lex token 47
lex REGEXPR
lex token 47
reparse <[[:lower:][:upper:]]>
cclenter: in = |abcdefghijklmnopqrstuvwxyz|, out = |abcdefghijklmnopqrstuvwxyz|
lex token 123
[...]

After some debugging it seems it is actually triggered by the following code
in b.c:

826                                          for (i = 0; i < NCHARS; i++) {
827                                                  if (!adjbuf((char **) &buf, &bufsz, bp-buf+1, 100, (char **) &bp, "relex2"))
828                                                      FATAL("out of space for reg expr %.10s...", lastre);
829                                                  if (cc->cc_func(i)) {
830                                                          *bp++ = i;
831                                                          n++;
832                                                  }
833                                          }

cc_func() is a ctype(3) function and NCHARS is (256+3), so
also cc->cc_func(256), cc->cc_func(257) and cc->cc_func(258)
can be invoked leading to a possible undefined behaviour.

(I will share a patch that changes < NCHARS to <= UCHAR_MAX in the
next minutes but of course any possible other suggestions to avoid
that is welcomed! Thank you very much!)

Regression in hex conversions

Prior to cc9e9b6 something like

echo abc | awk '{printf("%d\n", "0x" $1);}'

would print 2748. This is the number you also get with 'gawk -Wposix'.
Now '0' is printed. It is my opinion that the 'new' behavior existed for too long to revert back to the old, pre C99 behavior.

This change breaks awk scripts in the FreeBSD kernel build. All hex numbers are now converted to 0.

The latest standard allows either behavior, alas, so I can't point there to have it say definitively.

Historical implementations of awk did not parse hexadecimal integer or floating constants like "0xa" and "0xap0". Due to an oversight, the 2001 through 2004 editions of this standard required support for hexadecimal floating constants. This was due to the reference to atof(). This version of the standard allows but does not require implementations to use atof() and includes a description of how floating-point numbers are recognized as an alternative to match historic behavior. The intent of this change is to allow implementations to recognize floating-point constants according to either the ISO/IEC 9899:1990 standard or ISO/IEC 9899:1999 standard, and to allow (but not require) implementations to recognize hexadecimal integer constants.

This is somewhat surprising to our users as well.

I need awk for armv7

Busybox do not advise, I have this bug cut. I need the full version of awk for android

`noreturn` check currently broken by `__STDC__` typo for `__STDC_VERSION__`

visible via this new false positive warning:

run.c:1667:2: warning: variable 'u' is used uninitialized whenever switch
      default is taken [-Wsometimes-uninitialized]
        default:        /* can't happen */
        ^~~~~~~

fix:

diff --git a/awk.h b/awk.h
index 36a4286..5a55301 100644
--- a/awk.h
+++ b/awk.h
@@ -25,7 +25,7 @@ THIS SOFTWARE.
 #include <assert.h>
 #include <stdint.h>
 #include <stdbool.h>
-#if __STDC__ <= 199901L
+#if __STDC_VERSION__ <= 199901L
 #define noreturn
 #else
 #include <stdnoreturn.h>

(__STDC__ is only ever 0 or 1. __STDC_VERSION__ tells you which standard.)

awk fails to replace "/ere/" with "$0 ~ /ere/" according to POSIX

FreeBSD has a bug filed against it: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235887

Tim Chase wrote in that bug:

I've hit a case in which /ere/ doesn't expand the same as
"$0 ~ /ere/" which it should do according to the POSIX spec[0].

The goal was to meet the criterion "one and only one of multiple
regex matches", so I used

jot 20 | awk '/1/ + /5/ == 1'

(this can be expanded for any number of expressions, e.g.
"/1/ + /5/ + /7/ == 1", but the example using jot 20 makes it
easier to demonstrate the problem, looking for lines containing 1 or 5
but not 15)

This gives a parse error:

$ jot 20 | awk '/1/ + /5/ == 1'
awk: syntax error at source line 1
context is
/1/ + >>> / <<<
awk: bailing out at source line 1

Strangely, wrapping the expressions in parens works as expected:

$ jot 20 | awk '(/1/) + (/5/) == 1'

However manually performing the replacement documented above
according to the POSIX spec:

$ jot 20 | awk '$0 ~ /1/ + $0 ~ /5/ == 1'

parses fine (instead of giving the syntax error), so awk isn't doing the
"/ere/ -> $0 ~ /ere/" replacement POSIXly. However, this also doesn't
give results I'd consider correct (it returns "5" and "15"). Again,
wrapping those expansions in parens gives the expected/correct results:

$ jot 20 | awk '($0 ~ /1/) + ($0 ~ /5/) == 1'

As a side note, gawk parses the original notation ('/1/ + /5/ == 1')
fine and it does the same as the parenthesized versions above.

-tkc

[0] """

When an ERE token appears as an expression in any context other than
as the right-hand of the '˜' or "!˜" operator or as one of the
built-in function arguments described below, the value of the
resulting expression shall be the equivalent of:

$0 ˜ /ere/

"""
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html

Bug in parsing 100

Bug.zip

I recently encountered a bug in both the 20070501 (standard on macOS 10.15 Catalina) and 20180827 (installed using Homebrew) when using an AWK script to convert percentage marks into grades for students, where a 100% mark generates a G- grade (source code, sample input (sample.txt) and sample output (*.tsv) in attached archive, Bug.zip). The problem also manifests with mawk and bioawk, and gawk is the only AWK implementation that performs correctly.

$ cd Bug
$ ./converttogrades-2-awk-20070501.awk < sample.txt > sample-awk-20070501.tsv
$ ./converttogrades-2.awk < sample.txt > sample-awk.tsv # awk 20180827 installed using Homebrew
$ ./converttogrades-2-gawk.awk < sample.txt > sample-gawk.tsv # installed using Homebrew; conflicts with current awk, so only one can be active at at time
$ ./converttogrades-2-mawk.awk < sample.txt > sample-mawk.tsv
$ ./converttogrades-2-bioawk.awk < sample.txt > sample-bioawk.tsv

sample.txt:

Student ID Student Name Registration Status LAB_REP:377237 EXAM:377238 Final Grade
12345678 "Poole, Frank" Web Registered 74 74
12345679 "Bowman, Dave" Web Registered 100 100
12345670 "9000, HAL" Registered 69 69

Incorrect .tsv output from all of the above awks except gawk:

Student ID Student Name Registration Status LAB_REP:377237 EXAM:377238 Final Grade
12345678 Poole, Frank Web Registered 74 A-
12345679 Bowman, Dave Web Registered 100 G-
12345670 9000, HAL Registered 69 B+

Correct .tsv output from gawk:

Student ID Student Name Registration Status LAB_REP:377237 EXAM:377238 Final Grade
12345678 "Poole, Frank" Web Registered 74 A-
12345679 "Bowman, Dave" Web Registered 100 A+
12345670 "9000, HAL" Registered 69 B+

Should allow unmatched parenthesis in regular expressions.

From FreeBSD bug 198552
awk disallows the use of an unmatched right parenthesis in regular
expressions:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ awk 'BEGIN { print match("...)", ")") }'
awk: illegal primary in regular expression ) at
 source line number 1
 context is
        BEGIN { print match("...)", >>>  ")") <<<
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The unmatched right parenthesis should be allowed in this context, as POSIX.1-2008
(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_03)
states the following in section 9.4.3 (ERE Special Characters):

"The <right-parenthesis> shall be special when matched with a
preceding <left-parenthesis>, both outside a bracket expression."

and in section 9.5.1 (BRE/ERE Grammar Lexical Conventions)
(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_05_01),
the description of the "SPEC_CHAR" token contains the following
exception:

"The close-parenthesis shall be considered special in this context
only if matched with a preceding open-parenthesis."

Note: the fix in the FreeBSD bug database no longer applies, and even doing the obvious things to make it apply causes dozens of tests to fail, so obviously is a non-starter. It's also been rejected in the past by bwk.

No help? Man O Man...

One True awk

$ nawk --help
nawk: no program given

$ nawk -help
nawk: unknown option -help ignored

nawk: no program given

$ nawk --h
nawk: no program given

$ nawk -h
nawk: unknown option -h ignored

nawk: no program given

$ nawk --?
nawk: no program given

$ nawk -?
nawk: unknown option -? ignored

nawk: no program given

$ nawk help
^C
$ nawk ?
nawk: syntax error at source line 1
 context is
         >>> ? <<<
nawk: bailing out at source line 1
$ nawk
usage: nawk [-F fs] [-v var=value] [-f progfile | 'prog'] [file ...]

Busybox awk

$ busybox awk --help
BusyBox v1.31.1 (2019-12-04 03:47:16 UTC) multi-call binary.

Usage: awk [OPTIONS] [AWK_PROGRAM] [FILE]...

        -v VAR=VAL      Set variable
        -F SEP          Use SEP as field separator
        -f FILE         Read program from FILE
        -e AWK_PROGRAM

Add str_pad function to the AWK

I want to add a new built-in function to the AWK as a PR that add a string to a certain length with another string:

str_pad(str, pad, len, type)

  • str(string): input string.
  • pad(string): pad string.
  • len(int): If len is negative, returns the input string. Otherwise, the function adds a pad to the string until it reaches the size of len.
  • type(int): type could be -1, 0, and 1. by default it is -1, which means the pad adds on the right side(default mode). 0 means both sides. -1 means left side.
IN: awk 'BEGIN {printf str_pad("init", "_", 6, -1)"\n"}'
OUT: __init
IN: awk 'BEGIN {printf str_pad("init", "_", 8, 0)"\n"}'
OUT: __init__
IN: awk 'BEGIN {printf str_pad("init", "_", 6, 1)"\n"}'
OUT: init__

Can I do that?
Thank you.

awk only gets first record when RS="\0"

The following programs, which list the files in the current directory, work on GNU awk, but on One True Awk™ they only list the first found file (i.e., the first record). (Run them on a directory with more than one files)

#!/usr/bin/awk -f

function ls(dir,    i, n, s, a, cmd) {
	cmd = "find \"" dir "\" -mindepth 1 -maxdepth 1 -print0"
	while ((cmd | getline s) > 0) {
		a[++n] = s
	}
	close(cmd)
	printf "%d file%s\n", n, ((n > 1) ? "s" : "")
	for (i = 1; i <= n; i++) {
		printf "%s\n", a[i]
	}
}

BEGIN {
	RS="\0"
	ls(".")
}
find . -mindepth 1 -maxdepth 1 -print0 | awk 'BEGIN{RS="\0"} 1'

/ere/ not the same as $0 ~ /ere/

As reported by Tim Chase in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235887

% jot 20 | nawk '/1/ + /5/ == 1'
nawk: syntax error at source line 1
context is
/1/ + >>> / <<<
nawk: bailing out at source line 1
% jot 20 | gawk '/1/ + /5/ == 1'
1
5
10
11
12
13
14
16
17
18
19
%

The full bug report from the above link is reproduced below:

I've hit a case in which /ere/ doesn't expand the same as
"$0 ~ /ere/" which it should do according to the POSIX spec[0].

The goal was to meet the criterion "one and only one of multiple
regex matches", so I used

jot 20 | awk '/1/ + /5/ == 1'

(this can be expanded for any number of expressions, e.g.
"/1/ + /5/ + /7/ == 1", but the example using jot 20 makes it
easier to demonstrate the problem, looking for lines containing 1 or 5
but not 15)

This gives a parse error:

$ jot 20 | awk '/1/ + /5/ == 1'
awk: syntax error at source line 1
context is
/1/ + >>> / <<<
awk: bailing out at source line 1

Strangely, wrapping the expressions in parens works as expected:

$ jot 20 | awk '(/1/) + (/5/) == 1'

However manually performing the replacement documented above
according to the POSIX spec:

$ jot 20 | awk '$0 ~ /1/ + $0 ~ /5/ == 1'

parses fine (instead of giving the syntax error), so awk isn't doing the
"/ere/ -> $0 ~ /ere/" replacement POSIXly. However, this also doesn't
give results I'd consider correct (it returns "5" and "15"). Again,
wrapping those expansions in parens gives the expected/correct results:

$ jot 20 | awk '($0 ~ /1/) + ($0 ~ /5/) == 1'

As a side note, gawk parses the original notation ('/1/ + /5/ == 1')
fine and it does the same as the parenthesized versions above.

-tkc

[0] """

When an ERE token appears as an expression in any context other than
as the right-hand of the '˜' or "!˜" operator or as one of the
built-in function arguments described below, the value of the
resulting expression shall be the equivalent of:

$0 ˜ /ere/

"""
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html

Null O Null

It seems one true awk has a wierd behavior when dealing with a null character "\0".

One True awk

$ nawk 'BEGIN { null="\x0"; printf("|%c|\0|%c|", "\0", null) }' | od -v -An -t x1
 7c 7c
$ nawk 'BEGIN { null="\x0"; printf("|%c|%c|\0|", "\0", null) }' | od -v -An -t x1
 7c 7c 7c

Busybox awk

$ bawk 'BEGIN { null="\x0"; printf("|%c|\0|%c|", "\0", null) }' | od -v -An -t x1
 7c
$ bawk 'BEGIN { null="\x0"; printf("|%c|%c|\0|", "\0", null) }' | od -v -An -t x1
 7c

GNU awk

$ gawk 'BEGIN { null="\x0"; printf("|%c|\0|%c|", "\0", null) }' | od -v -An -t x1
 7c 00 7c 00 7c 00 7c
$ gawk 'BEGIN { null="\x0"; printf("|%c|%c|\0|", "\0", null) }' | od -v -An -t x1
 7c 00 7c 00 7c 00 7c

test p.50 uses obsolete sort syntax

specifically this line:

                print cc ":" pop[cc] | "sort -t: +0 -1 +2nr" }

fails on Android with "Unknown option 1".

i'd no idea what those options meant, but POSIX (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html) says "Earlier versions of this standard also allowed the - number and + number options. These options are no longer specified by POSIX.1-2017 but may be present in some implementations."

GNU sort can actually explain what they should be translated to:

~$ sort --debug -t: +0 -1 +2nr
sort: using ‘en_US.UTF-8’ sorting rules
sort: obsolescent key ‘+0 -1’ used; consider ‘-k 1,1’ instead
sort: obsolescent key ‘+2’ used; consider ‘-k 3’ instead
sort: key 2 is numeric and spans multiple fields
~$ 

(i could send a pull request, but it seems weird and unhelpful to send a pull request that's a binary diff on a .tar file... should the .tar file actually be extracted in the git repository?)

Heap buffer overflow from the fix to issue 83

The fix for #83 changed the code to insert 2 chars, but the call to adjbuf just above it only allows for 1 char. This can cause a heap buffer overflow.

E.g. in git master on CentOS 6.10:
$ echo \\ | env LANG=C MALLOC_CHECK_=1 ./a.out '/[[:cntrl:]01234567[:graph:]]/' *** glibc detected *** a.out: realloc(): invalid pointer: 0x00000000008c5be0 *** \ $

gawk strtonum?

if i sent a patch adding gawk's strtonum (described here), would there be any interest in that? it seems like parsing hex is otherwise quite awkward.

i'm interested in the more general question of what would be accepted, but strtonum is the only gawk extension that i've seen Android OEMs using in practice so far, to pull constants out of .h files. (we've switched them over to one-true-awk even on the host when building Android for the Q release.) my suggested workaround is to just use awk to cut out the constant, and then use $(( )) shell arithmetic to convert hex to decimal.

i don't know how you'd want to measure whether something was generally useful enough/in widespread enough use. as perhaps an argument against strtonum, it looks like busybox's awk only implements the gawk bitwise functions and gawk time functions, so they at least haven't felt the need for strtonum yet.

(aside: mawk has the time functions too, so it looks like one-true-awk is the holdout there. i can find several users of systime in awk scripts quite easily, but not any of the other functions.)

anyway, happy to do the work, but was curious whether there's a policy and what that is... thanks!

awk doesn't accept "-v" assigns when the variable value starts with '='

I have a program that passes filenames into an awk script, and recently stumbled across a problem with a filename whose first character was '='.

This caused awk to fail:

awk: invalid -v option argument: filename==main

No amount of alternate shell escaping helped, so I fixed this (and all my other awk scripts) the messy way of prefixing the variable with another, "safe", character, and then stripping it off withing the awk BEGIN section.

Looking into this, the problem is caused in lib.c, function "isclvar", where it ends as follows:

return *s == '=' && s > os && *(s+1) != '=';

Fixing this involved changing it to

return *s == '=' && s > os;

I'm unsure why the additional check for '=' is present - my only guess is that it was to stop people accidentally doing -v "varname==value"

I checked the manpages, and posix, and there's nothing mentioning this restriction, and i tested it with GNU-awk, and the problem didn't occur.

Thoughts?

Cheers, Jamie

Clarification about UTF-8 support

Given that the main focus of awk is processing text, I miss some clarification in the Readme and/or in the manpage, about Unicode support. From what I have read, awk is independent from the text encoding, but I'm not really sure if that's completely true, or if some side effects or unexpected behavior could happen if awk is used in a 100% UTF-8 scenario, where everything is encoded in UTF-8, and strings with many international characters are to be expected.

Thanks a lot in advance.

Switch 0FF?

Busybox awk

$ bawk 'BEGIN { print "\x0FF" }' | od -v -An -t x1
 0f 46 0a
$ bawk 'BEGIN { print "\x0FFFF" }' | od -v -An -t x1
 0f 46 46 46 0a

GNU awk

$ gawk 'BEGIN { print "\x0FF" }' | od -v -An -t x1
 0f 46 0a
$ gawk 'BEGIN { print "\x0FFFF" }' | od -v -An -t x1
 0f 46 46 46 0a

One True awk

$ nawk 'BEGIN { print "\x0FF" }' | od -v -An -t x1
 ff 0a
$ nawk 'BEGIN { print "\x0FFFF" }' | od -v -An -t x1
 ff 0a

Add tags

so they can show up here

http://github.com/onetrueawk/awk/tags

here is example:

git tag 20180827 1d9d86418a8e77a0270b5cff4ba97c9c4106b750
git tag 20180823 0f4e1ba922ccd45f7bd47701e923d98566ceb7c5
git tag 20180822 32093f5bbf567525d88566a449a89c72d2845e7e
git tag 20130105 3ed9e245db3f306c2898b5ad7b541b230d121070
git tag 20121220 87b94932e6f12ad16e9fc1af1a7b5b66ae9381e2
git push --tags

Positional specifier causes segfault

Gawk allows printf ordering:

$ awk 'BEGIN {printf "%2$s %1$s\n", "a", "b"}'
b a

and "gracefully" fails under POSIX mode:

$ awk --posix 'BEGIN {printf "%2$s %1$s\n", "a", "b"}'
awk: cmd. line:1: fatal: `$' is not permitted in awk formats

however BWK awk just segfaults:

$ awk 'BEGIN {printf "%2$s %1$s\n", "a", "b"}'
Segmentation fault (core dumped)

tested under FreeBSD

No help? Man O Man...

One True awk

$ nawk --help
nawk: no program given

$ nawk -help
nawk: unknown option -help ignored

nawk: no program given

$ nawk --h
nawk: no program given

$ nawk -h
nawk: unknown option -h ignored

nawk: no program given

$ nawk --?
nawk: no program given

$ nawk -?
nawk: unknown option -? ignored

nawk: no program given

$ nawk help
^C
$ nawk ?
nawk: syntax error at source line 1
 context is
         >>> ? <<<
nawk: bailing out at source line 1
$ nawk
usage: nawk [-F fs] [-v var=value] [-f progfile | 'prog'] [file ...]

What could fs mean? ForeShadow? FairytaleS? ForbiddenSacred?

Busybox awk

$ busybox awk --help
BusyBox v1.31.1 (2019-12-04 03:47:16 UTC) multi-call binary.

Usage: awk [OPTIONS] [AWK_PROGRAM] [FILE]...

        -v VAR=VAL      Set variable
        -F SEP          Use SEP as field separator
        -f FILE         Read program from FILE
        -e AWK_PROGRAM

`split`, LOCALE, and POSIX.

Hey @arnoldrobbins! Thank you for your tireless work maintaining nawk and gawk!

I was recently working on awk script for converting fixedwidth text to delimited text (https://github.com/mavenraven/fixedwidth2csv/blob/master/fixedwidth2csv). I was testing the script against the common different awk implementations, and I noticed that split in nawk doesn't respect LC_CTYPE, e.g.

git clone https://github.com/onetrueawk/awk && cd awk && make && echo "🐈ello"| LC_CTYPE="en_US.UTF-8" ./a.out  '{ split($0, chars, ""); for (c in chars) { print chars[c] }}'

prints




e
l
l
o

. On the other hand, for gawk:

echo "🐈ello"| LC_CTYPE="en_US.UTF-8" gawk '{ split($0, chars, ""); for (c in chars) { print chars[c] }}'

prints

🐈
e
l
l
o

and

echo "🐈ello"| LC_CTYPE="C" gawk  '{ split($0, chars, ""); for (c in chars) { print chars[c] }}'

prints





e
l
l
o

as expected.

The standard says:

LC_CTYPE
Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files), the behavior of character classes within regular expressions, the identification of characters as letters, and the mapping of uppercase and lowercase characters for the toupper and tolower functions.

I personally read this as split should support LOCALE, but it's not explicit.

I'm sure you're already aware of all this since you both maintain both projects. I was wondering if there was explicit decision made for nawk to not support this behavior, or if its a feature where a PR would be welcome?

Again, thanks for all of your hard work!

Unexpected behavior with operators

Well first of all, I have no idea which one is right and which one is wrong. To me busybox awk's implemention seems right the others were wrong. But here some tests.

Busybox awk

$ bawk 'BEGIN { print !"a" }'
0
$ bawk 'BEGIN { print !!"a" }'
1
$ bawk 'BEGIN { print !!!"a" }'
0
$ bawk 'BEGIN { print +"a" }'
0
$ bawk 'BEGIN { print ++"a" }'
1
$ bawk 'BEGIN { print +++"a" }'
1
$ bawk 'BEGIN { print -"a" }'
0
$ bawk 'BEGIN { print --"a" }'
-1
$ bawk 'BEGIN { print ---"a" }'
-1

GNU awk

$ gawk 'BEGIN { print !"a" }'
0
$ gawk 'BEGIN { print !!"a" }'
1
$ gawk 'BEGIN { print !!!"a" }'
0
$ gawk 'BEGIN { print +"a" }'
0
$ gawk 'BEGIN { print ++"a" }'
gawk: cmd. line:1: BEGIN { print ++"a" }
gawk: cmd. line:1:                 ^ syntax error
$ gawk 'BEGIN { print +++"a" }'
gawk: cmd. line:1: BEGIN { print +++"a" }
gawk: cmd. line:1:                 ^ syntax error
$ gawk 'BEGIN { print -"a" }'
0
$ gawk 'BEGIN { print --"a" }'
gawk: cmd. line:1: BEGIN { print --"a" }
gawk: cmd. line:1:                 ^ syntax error
$ gawk 'BEGIN { print ---"a" }'
gawk: cmd. line:1: BEGIN { print ---"a" }
gawk: cmd. line:1:                 ^ syntax error

One True awk

$ nawk 'BEGIN { print !"a" }'
0
$ nawk 'BEGIN { print !!"a" }'
1
$ nawk 'BEGIN { print !!!"a" }'
0
$ nawk 'BEGIN { print +"a" }'
0
$ nawk 'BEGIN { print ++"a" }'
nawk: syntax error at source line 1
 context is
        BEGIN { print >>>  ++"a" <<<
nawk: illegal statement at source line 1
$ nawk 'BEGIN { print +++"a" }'
nawk: syntax error at source line 1
 context is
        BEGIN { print >>>  +++ <<< "a" }
nawk: illegal statement at source line 1
$ nawk 'BEGIN { print -"a" }'
0
$ nawk 'BEGIN { print --"a" }'
nawk: syntax error at source line 1
 context is
        BEGIN { print >>>  --"a" <<<
nawk: illegal statement at source line 1
$ nawk 'BEGIN { print ---"a" }'
nawk: syntax error at source line 1
 context is
        BEGIN { print >>>  --- <<< "a" }
nawk: illegal statement at source line 1
$ nawk 'BEGIN { print --"a" }'

I can understand if double operators (++"string") throws an error but even with triple opeators like this (+++"string"), does not make any sense.
Shouldn't it convert "string" to decimal first .e.g. ("string"+0) then do the arithmetic operation?
This behavior is so unconsistent with the following example:

 $ bawk 'BEGIN { print "999" + 1 }'
1000
 $ gawk 'BEGIN { print "999" + 1 }'
1000
 $ nawk 'BEGIN { print "999" + 1 }'
1000

THE END.

syntax error with "less than"

Example problem:

# awk 'BEGIN {print 1 < 2}'
awk: syntax error at source line 1

Workaround:

# awk 'BEGIN {print (1 < 2)}'
1

Note that Gawk and Mawk work as expected.

Include definition of the TF troff macro in awk.1

This is a cosmetic issue. The troff macro .TF is used in awk.1, but not provided by groff. Please consider including its definition in awk.1 (basically taken from Plan 9 man macro package):

.de TF
.IP "" "\w'\fB\\$1\ \ \fP'u"
.PD 0
..

is the code useful here?

run.c 1545line:

		(void) mbtowc(NULL, NULL, 0);	/* reset internal state */
		/*
		 * Reset internal state here too.
		 * Assign result to avoid a compiler warning. (Casting to void
		 * doesn't work.)
		 * Increment said variable to avoid a different warning.
		 */
		int unused = wctomb(NULL, L'\0');
		unused++;

unused and wctomb(NULL, L'\0') seems useless.

nan inf issues

Version 20201208 has issues with nan and inf.
Examples
echo "-inf 123 -inf" | ./nawk '{print $1+0,$2+0,$3+0}'
gives 2.12199e-314 123 123

echo "-inf " | ./nawk '{print $0+0}'
gives 0

This fix diff_lib.c.txt to lib.c solved the above issues and I did not encounter other issues.

Interval expressions incorrectly handle x{0} case

The One True Awk incorrectly handles interval expressions of the form a{0}. For example:

$ echo abc | gawk '/ab{0}c/'
$ echo abc | nawk '/ab{0}c/'
abc

See the results from the new T.int-expr test case.

awk can't build against tcc

Changing HOSTCC to HOSTCC = tcc -g -Wall -pedantic -Wcast-qual in the makefile results in make errors with tcc at lib.c

tcc -g -Wall -pedantic -Wcast-qual   -O2   -c -o lib.o lib.c
lib.c:636: error: incompatible types for redefinition of 'cmdname'
make: *** [<builtin>: lib.o] Error 1

It does, however, build against gcc. Tested against the current staging branch, master branch, and the one release tag.

system:
musl libc
tcc
bison (instead of yacc)
Also, the -pedantic flag should probably be -Wpedantic to be more portable.

awk leaks memory

After applying the patch in Issue #112 and running valgrind --leak-check=full with the same program, there is a leak:

==5226== Memcheck, a memory error detector
==5226== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5226== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==5226== Command: ./a.out /^#!\ ?\/.\/[a-z]{0,2}awk/\ {sub(/^#!\ ?\/.\/[a-z]{0,2}awk/,"#!\ awk");\ print}
==5226==
#! awk
==5226==
==5226== HEAP SUMMARY:
==5226== in use at exit: 76,019 bytes in 438 blocks
==5226== total heap usage: 541 allocs, 103 frees, 96,472 bytes allocated
==5226==
==5226== 25 bytes in 1 blocks are definitely lost in loss record 79 of 151
==5226== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5226== by 0x5279AF9: strdup (strdup.c:42)
==5226== by 0x1118D8: tostring (tran.c:535)
==5226== by 0x11BA43: regexpr (lex.c:548)
==5226== by 0x11BF9B: yylex (lex.c:184)
==5226== by 0x10C53E: yyparse (awkgram.tab.c:2257)
==5226== by 0x10BB5E: main (main.c:215)
==5226==
==5226== LEAK SUMMARY:
==5226== definitely lost: 25 bytes in 1 blocks
==5226== indirectly lost: 0 bytes in 0 blocks
==5226== possibly lost: 0 bytes in 0 blocks
==5226== still reachable: 75,994 bytes in 437 blocks
==5226== suppressed: 0 bytes in 0 blocks
==5226== Reachable blocks (those to which a pointer was found) are not shown.
==5226== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5226==
==5226== For counts of detected and suppressed errors, rerun with: -v
==5226== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Similarly:

$ valgrind --leak-check=full ./a.out \

'BEGIN { x = "a"; sub(/a/, "b", x); print x }'
==5774== Memcheck, a memory error detector
==5774== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5774== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==5774== Command: ./a.out BEGIN\ {\ x\ =\ "a";\ sub(/a/,\ "b",\ x);\ print\ x\ }
==5774==
b
==5774==
==5774== HEAP SUMMARY:
==5774== in use at exit: 40,518 bytes in 348 blocks
==5774== total heap usage: 382 allocs, 34 frees, 54,076 bytes allocated
==5774==
==5774== 1 bytes in 1 blocks are definitely lost in loss record 9 of 110
==5774== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5774== by 0x5279AF9: strdup (strdup.c:42)
==5774== by 0x1118D8: tostring (tran.c:535)
==5774== by 0x1119B3: setsymtab (tran.c:248)
==5774== by 0x11B2D1: word (lex.c:506)
==5774== by 0x11C533: yylex (lex.c:191)
==5774== by 0x10C53E: yyparse (awkgram.tab.c:2257)
==5774== by 0x10BB5E: main (main.c:215)
==5774==
==5774== LEAK SUMMARY:
==5774== definitely lost: 1 bytes in 1 blocks
==5774== indirectly lost: 0 bytes in 0 blocks
==5774== possibly lost: 0 bytes in 0 blocks
==5774== still reachable: 40,517 bytes in 347 blocks
==5774== suppressed: 0 bytes in 0 blocks
==5774== Reachable blocks (those to which a pointer was found) are not shown.
==5774== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5774==
==5774== For counts of detected and suppressed errors, rerun with: -v
==5774== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

multiple threads in sort within awk to accelerate the speed

I am trying to use sort within awk and use multiple threads to accelerate the speed. However, I didn't notice the system try to use multiple threads, any suggestion? Thanks.

zcat $1 | awk '{print $2,$3,$4,$5,$2"-"$3"-"$4"-"$5}'|awk 'NR == 1;NR>1{print $0|"sort --parallel=50 -k1,1 -k2,2n"}' OFS="\t"| bgzip > $1.jnj.gz

image

Key for array is decimal or octal?

Busybox awk

$ bawk 'BEGIN { arr[010]="A"; print arr[010]; print arr[10]; print arr[0010]; print arr[8] }'
A

A
A
$

GNU awk

$ gawk 'BEGIN { arr[010]="A"; print arr[010]; print arr[10]; print arr[0010]; print arr[8] }'
A

A
A
$

One True awk

$ nawk 'BEGIN { arr[010]="A"; print arr[010]; print arr[10]; print arr[0010]; print arr[8] }'
A
A
A

$

No t.Foo.ok for all the tests

while I can test awk against the latest awk with regress, I get a few unexpected differences when I test again gawk. It's unclear to me which one is right: nawk or gawk in these cases. Is there a way to know, or is nawk right by definition?

Array creation

AWK should have a variadic function for array creation. Currently if you wish to
create an array, you must use this:

dd[1] = "aa"
dd[2] = "bb"
dd[3] = "cc"

Or:

split("aa bb cc", dd)

the split syntax is problematic if your elements contain spaces. That can be
worked around by using a custom separator:

split("aa bb\1cc", dd, "\1")

but then it will fail again if your separator happens to be part of one of the
elements. Many other languages have syntax for array literals, for example C:

char *dd[] = {"aa", "bb", "cc"};

Python:

dd = ['aa', 'bb', 'cc']

JavaScript:

var dd = ['aa', 'bb', 'cc'];

Ruby:

dd = ['aa', 'bb', 'cc']

Go:

dd := []string {"aa", "bb", "cc"}

with AWK, it could look like one of these:

anew(dd, "aa", "bb", "cc")
dd[2] == "bb"
aset(dd, "aa", "bb", "cc")
dd[2] == "bb"

UTF-8: multiple bracket expressions containing character classes aren't matched

In UTF-8 locales, multiple bracket expressions containing character classes aren't matched correctly.

Symptoms:

$ echo 'é' | awk '/[[:alpha:]]/'               # ok
é
$ echo 'éé' | awk '/[[:alpha:]][[:alpha:]]/'   # NOT ok
(no output)

System: macOS 10.14.6. Using onetrueawk version 20190717.

My locale:

$ locale
LANG="nl_NL.UTF-8"
LC_COLLATE="nl_NL.UTF-8"
LC_CTYPE="nl_NL.UTF-8"
LC_MESSAGES="nl_NL.UTF-8"
LC_MONETARY="nl_NL.UTF-8"
LC_NUMERIC="nl_NL.UTF-8"
LC_TIME="nl_NL.UTF-8"
LC_ALL=

In an ISO-8859-1 locale (LANG="nl_NL.ISO8859-1") this works fine.

Add match groups?

Hi,

I was going to look at how hard it might be to add match groups to OpenBSD's awk, which turns out to be onetrueawk!

In GNU awk you can capture regex groups into an array like this:

match($0, pattern, ary) {print ary[1]}

The closest we can get in regular awk is:

awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

Which is quite cumbersome and I always have to look it up from this stackoverflow question:
https://stackoverflow.com/a/4673336/109414

Would such a feature be accepted in onetrueawk?

Thanks

New release

Hi there,
The last release is from 2018. Would you mind to create a new release ? It could be really useful for packagers.
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.