idank / bashlex Goto Github PK

View Code? Open in Web Editor NEW

535.0 535.0 94.0 157 KB

Python parser for bash

License: GNU General Public License v3.0

Python 99.94% Makefile 0.06%

bashlex's People

Contributors

Stargazers

Watchers

Forkers

pombredanne dmreiland vikasgorur blurrymoi jmillermi tjbearriver fawaf keheliya josephfrazier walterjia charanrajtc absolutarin sunlinjin abusalimov robincoello ztidwell hardware-forest-utopia www3838438 dngros connectthefuture matt- nickdiego nsslabcuus doronbehar sahwar itssujeeth endvroy ifduyue deepagp dyslexictech dyna-dot sereysethy envp hui-hy eshack94 chikin-4x mm4nn dummy-andra joerick mgorny themetaphysicalcrook qiyeboy cjoshea9 fufsob d-d-up hexfocus-fork crackercat henryiii zhileiren sqlcyi2008 persianyagami90xs tommalvoriddle ekmixon d4g33z hugovk milahu wholetthedog-out ljjgdfs quinviver blankcanvasstudio hskte a9696615446 panda-re frostruan product liuchaoss aboody-faran lizhenyuanh azurecloudmonk clintonkildepstein davzucky hsiam261 tplessas tmpillbox verhovsky emadyay omadjoudj sitedata webzcc hiveminds arpitjain799 iq-scm jk-mtnv tomasohara ido123net kylefromnvidia hollisticgit eldadcool

bashlex's Issues

case statement parsing?

It looks like, that bashlex has problems with parsing of case statements. Please try read following into parser.parse:
case "$1" in
start)
start
;;

    stop)
        stop
        ;;
     
    *)
        echo $"Usage: $0 {start|stop}"
        exit 1

esac

I have following error message:
Traceback (most recent call last):
File "ttt.py", line 12, in
trees = parser.parse(s)
File "/home/joe/.local/lib/python3.10/site-packages/bashlex/parser.py", line 610, in parse
parts = [p.parse()]
File "/home/joe/.local/lib/python3.10/site-packages/bashlex/parser.py", line 691, in parse
tree = theparser.parse(lexer=self.tok, context=self)
File "/home/joe/.local/lib/python3.10/site-packages/bashlex/yacc.py", line 439, in parse
p.callable(pslice)
File "/home/joe/.local/lib/python3.10/site-packages/bashlex/parser.py", line 401, in p_pattern
handleNotImplemented(p, 'pattern')
File "/home/joe/.local/lib/python3.10/site-packages/bashlex/parser.py", line 17, in handleNotImplemented
raise NotImplementedError('type = {%s}, token = {%s}' % (type, p[1]))
NotImplementedError: type = {pattern}, token = {start}

I hope, you can help here.
Best regards

Parsing fails for if [[ -f "../build/tmp/dklm/klm_exports.h" ]]

Parsing fails for
if [[ -f "../build/tmp/dklm/klm_exports.h" ]]

Cannot parse command substitution

(I plan to keep working on this, but I wanted to make everyone aware of it first.)

The README indicates that the following should work:

>>> bashlex.split('cat <(echo "a $(echo b)") | tee')
['cat', '<(echo "a $(echo b)")', '|', 'tee']

However, when I run it, I get a generator (see #13) that cannot be converted to a list without raising an error. For example:

>>> list(bashlex.split('cat <(echo "a $(echo b)") | tee'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bashlex/tokenizer.py", line 1176, in split
    doublequoted, 0, 0)
  File "bashlex/subst.py", line 225, in _expandwordinternal
    node, sindex[0] = _extractprocesssubst(parserobj, string, tindex)
  File "bashlex/subst.py", line 61, in _extractprocesssubst
    node, si = _parsedolparen(parserobj, string, sindex)
  File "bashlex/subst.py", line 31, in _parsedolparen
    copiedps = copy.copy(parserobj.parserstate)
AttributeError: 'tokenizer' object has no attribute 'parserstate'

Here's an online demo. Note that the command can be simplified to $(echo) or `echo` (the latter raises a slightly different error).

Cannot parse line continuations in let assignment

The following bash code throws a parsing error
let

X=1

ParsingError: unexpected token '\n' (position 3)

importing the ABCs from 'collections' instead of from 'collections.abc' error

Any plans to fix this?

bashlex/utils.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    class typedset(collections.MutableSet):

bashlex/utils.py:51
bashlex/utils.py:51: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    class frozendict(collections.Mapping):

-- Docs: https://docs.pytest.org/en/latest/warnings.html

Thank you!

I just wanted to say thanks for making this library. It's been super-useful while I've been implementing a feature in cibuildwheel. Parsing is super hard, and this seems to nail it! :)

by the way, did you (or did you know of anything) ever try to make something that would try to execute the ast, or even just evaluate CommandNodes, CommandsubstitutionNodes and ParameterNodes? I'm working on something that does that at the moment :)

init fails in Python 3.5

/usr/lib/python3.5/site-packages/bashlex/init.py in ()
----> 1 import parser, tokenizer
ImportError: No module named 'tokenizer'

Parsing scripts with arrays

Attempting to parse a script with array declaration fails upon encountering the opening set mark (ie: ().

The following bashlexinformation was provided by pip:

$ pip show bashlex
Name: bashlex
Version: 0.18
Summary: Python parser for bash
Home-page: https://github.com/idank/bashlex.git
Author: Idan Kamara
Author-email: [email protected]
License: GPLv3+
Location: /home/user/.local/lib/python3.10/site-packages
Requires:
Required-by:

In a Python interactive session with the following setup:

Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bashlex

Running the bashlex.parse function with the string declare -a CMDS=() produces the following output:

>>> bashlex.parse('declare -a CMDS=()')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.10/site-packages/bashlex/parser.py", line 610, in parse
    parts = [p.parse()]
  File "/home/user/.local/lib/python3.10/site-packages/bashlex/parser.py", line 691, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "/home/user/.local/lib/python3.10/site-packages/bashlex/yacc.py", line 537, in parse
    tok = self.errorfunc(errtoken)
  File "/home/user/.local/lib/python3.10/site-packages/bashlex/parser.py", line 548, in p_error
    raise errors.ParsingError('unexpected token %r' % p.value,
bashlex.errors.ParsingError: unexpected token '(' (position 16)

When removing the round brackets it succeeds:

>>> bashlex.parse('declare -a CMDS')
[CommandNode(parts=[WordNode(parts=[] pos=(0, 7) word='declare'), WordNode(parts=[] pos=(8, 10) word='-a'), WordNode(parts=[] pos=(11, 15) word='CMDS')] pos=(0, 15))]

It's independent of the declare keyword:

>>> bashlex.parse('CMDS=()')
bashlex.errors.ParsingError: unexpected token '(' (position 5)

The error occurs when appending to the array as well:

>>> bashlex.parse('CMDS+=("init")')
bashlex.errors.ParsingError: unexpected token '(' (position 6)

Parsing parenthesis is not by itself the issue:

>>> bashlex.parse('(env)')
[CompoundNode(list=[ReservedwordNode(pos=(0, 1) word='('), CommandNode(parts=[WordNode(parts=[] pos=(1, 4) word='env')] pos=(1, 4)), ReservedwordNode(pos=(4, 5) word=')')] pos=(0, 5) redirects=[])]

The lexer seems to recognize arrays as WordNodes:

>>> bashlex.parse('ARRAY[1]=init')
[CommandNode(parts=[WordNode(parts=[] pos=(0, 13) word='ARRAY[1]=init')] pos=(0, 13))]
>>> bashlex.parse('echo ${ARRAY[*]}')
[CommandNode(parts=[WordNode(parts=[] pos=(0, 4) word='echo'), WordNode(parts=[ParameterNode(pos=(5, 16) value='ARRAY[*]')] pos=(5, 16) word='${ARRAY[*]}')] pos=(0, 16))]
>>> bashlex.parse('unset ARRAY[1]')
[CommandNode(parts=[WordNode(parts=[] pos=(0, 5) word='unset'), WordNode(parts=[] pos=(6, 14) word='ARRAY[1]')] pos=(0, 14))]

It just seems to have issues recognizing array sets when performing assignments.

Parsing fails for some list nodes inside command substitution nodes.

To reproduce do:

import bashlex
bashlex.parser.parse('echo $(pwd && pwd)')

Results in ParsingError: unexpected token ')' (position 10)

Parsing also fails for 'echo $(pwd || pwd)' and 'echo $(pwd & pwd)', but 'echo $(pwd ; pwd)' parses just fine.

backslash newline fails

The tokenizer fails to parse files with backslashes separating lines properly\

For example:

for hook in \
	/etc/* \
	/lib/* \
	/etc/*
do
	echo hook
done

Results in the following exception:

Exception has occurred: ParsingError       (note: full exception trace is shown but execution is paused at: <module>)
unexpected token '/etc/*' (position 15)
  File "[/bashlex/bashlex/parser.py]()", line 589, in p_error
    raise errors.ParsingError('unexpected token %r' % p.value,
  File "[/bashlex/bashlex/yacc.py]()", line 1107, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "[/bashlex/bashlex/yacc.py]()", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "[/bashlex/bashlex/parser.py]()", line 733, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "[/bashlex/bashlex/parser.py]()", line 652, in parse
    parts = [p.parse()]
  File "[/bashlex/example.py]()", line 4, in <module> (Current frame)
    parts = bashlex.parse(script)

New lines between statements are not supported

Parsing a file with new lines between statements is not supported. For the following script:

echo "Line 1"

echo "Line 3"

The sample program (the one in the README) generates the following error:

Traceback (most recent call last):
  File "sp.py", line 4, in <module>
    parts = bashlex.parse(open(sys.argv[1]).read())
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 595, in parse
    part = _parser(s[index:], strictmode=strictmode).parse()
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 641, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 1079, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 539, in p_error
    p.lexer.source, p.lexpos)
bashlex.errors.ParsingError: unexpected token 'echo' (position 1)

Traversing the BashLex AST Node.

Hey,

I am trying to traverse through the bashlex.ast.node to find something specific, such as if the bash command is writing something to temp or deleting any file.

Till now I was trying manual check
if (tree[0].tree[i].word) == 'rm' :
return command
But I assume its not the right way to traverse the Bashlex AST tree, what if I need to find the files with are writing to Temp directory.

Can you shed some light on how I can efficiently traverse through the AST and fulfill the above requirement.

Thank You.

As a noob to Linux, I hated seeing tutorials online that shows which commands to use, but gave arguments along with them and didn't explain what the option meant. I use Explainshell everyday to learn more about commands and now feel a lot more comfortable running tutorial commands knowing exactly what they do.

I actually wondered if there was something like this, only wondered for a few week before finding this though. Just wanted to say thank you :)

Include license file in PyPI

Hello,

I am packaging bashlex as a conda package but the license file is not available in the PyPI tarball.

Could you please include it in the next release?

xref: conda-forge/staged-recipes#4401

Best regards,
Sebastian

add unimplemented nodes to AST instead of raising exceptions

To facilitate broader coverage of the analyzer, it would be good for the parser to add "unimplemented nodes" to the AST rather than raising an error. This can be done as follows:

$ git-diff bashlex/parser.py
...
+from mezcla import system
+
+ADD_UNIMPLEMENTED_NODE = system.getenv_bool("ADD_UNIMPLEMENTED_NODE", False,
+                                            "Add unimplemented nodes to parse tree")
+
 from bashlex import yacc, tokenizer, state, ast, subst, flags, errors, heredoc
 
 def _partsspan(parts):
@@ -13,14 +19,21 @@ precedence = (
 )
 
 def handleNotImplemented(p, type):
-    if len(p) == 2:
+    if ADD_UNIMPLEMENTED_NODE:
+        parts = _makeparts(p)
+        p[0] = ast.node(kind='unimplemented', parts=parts, pos=_partsspan(parts))
+    elif len(p) == 2:
         raise NotImplementedError('type = {%s}, token = {%s}' % (type, p[1]))
     else:
         raise NotImplementedError('type = {%s}, token = {%s}, parts = {%s}' % (type, p[1], p[2]))

This way, a parse tree can still be recovered even though a particular construct is not supported:

$ ADD_UNIMPLEMENTED_NODE=1 python -c 'import bashlex; print(bashlex.parse("case fu in esac")[0].dump())'
UnimplementedNode(pos=(0, 15), parts=[
  ReservedwordNode(pos=(0, 4), word='case'),
  WordNode(pos=(5, 7), word='fu'),
  ReservedwordNode(pos=(8, 10), word='in'),
  ReservedwordNode(pos=(11, 15), word='esac'),
])

I can add a pull request for this if you want.

Airthmetic Command Implementation

So currently the p_arith_command and _extractcommandsubst functions pop a NotImplemented errors when an arithmetic expression is found. In bash-master/make_cmd line 430, the make_arith_command function is implemented simply to set the .value attribute equal to the string, the flags to zero, give the node type cm_arith, and set the redirects to null. Would adding this implementation into the p_arith_command function be an acceptable fix? subst.py would also need to be changed to implement these functions. If you just call _parsedelparen on the airthmetic expression, the parsing seems to work just fine. The parens shouldn't be parsed as nodes and the node type should be 'arith_cmd' but those are easy fixes. Is there something I am missing as to why these aren't implemented?

Unexpected token \n when parsing bash script

I am trying to parse the following bash script (simplified example) and print the produced AST as JSON using bashlex 0.12:

function a {
    a;
}

# Comment

But it fails:

  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 614, in parse
    part = _parser(s[index:], strictmode=strictmode).parse()
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 682, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 1079, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 539, in p_error
    p.lexer.source, p.lexpos)
bashlex.errors.ParsingError: unexpected token '\n' (position 10)

A trivial workaround is to wrap the code in any other construct, the simplest being a set of curly braces. Then everything works just fine:

{
function a {
    a;
}

# Comment
}

Of course I can live with the workaround but I think it would be great if you took a look at it.

Thanks a lot for the great job you've done!

ANSI-C quoted strings $'' aren't supported

https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html

Expected result:

>>> list(bashlex.split("echo $'hello'"))
['echo', 'hello']
>>> list(bashlex.split("echo $'hello\\nworld'"))
['echo', 'hello\nworld']  # notice \\n becomes a real newline character \n

Actual result (bashlex 0.15):

>>> list(bashlex.split("echo $'hello'"))
['echo', '$hello']
>>> list(bashlex.split("echo $'hello\\nworld'"))
['echo', '$hellonworld']

Cannot parse array initialization statement

Quoting Bash Guide for Beginners :: 10.2.1. Creating arrays:

Array variables may also be created using compound assignments in this format:

ARRAY=(value1 value2 ... valueN)

I have lots of scripts with such statements:

ARRAY=('value1' 'value2')

However they raise a ParsingError:

  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 682, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 1079, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 539, in p_error
    p.lexer.source, p.lexpos)
bashlex.errors.ParsingError: unexpected token '(' (position 6)

Command substitution

Hi,
I was about to make a pull request for a command substitution that returned a wrong position when there are spaces in the commands in the command substitution $(foo ). But then I discovered another problem and I couldnt find how to fix it. It is when there is a semi column in the list of commands, parsing failed.

Ex: parsing this command failed $(foo;). I got this error. Can you tell me where the problem lies? It is a list, but I couldnt find where to fix it. It does not happen with another form of command subsitution `command;`. I know they are treated differently in parser.

File "python3.6/site-packages/bashlex/parser.py", line 605, in parse
	    parts = [p.parse()]
	  File "python3.6/site-packages/bashlex/parser.py", line 686, in parse
	    tree = theparser.parse(lexer=self.tok, context=self)
	  File "python3.6/site-packages/bashlex/yacc.py", line 277, in parse
	    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
	  File "python3.6/site-packages/bashlex/yacc.py", line 998, in parseopt_notrack
	    p.callable(pslice)
	  File "python3.6/site-packages/bashlex/parser.py", line 157, in p_simple_command_element
	    p[0] = [_expandword(parserobj, p.slice[1])]
	  File "python3.6/site-packages/bashlex/parser.py", line 137, in _expandword
	    doublequoted, 0, 0)
	  File "python3.6/site-packages/bashlex/subst.py", line 271, in _expandwordinternal
	    node, sindex[0] = _paramexpand(parserobj, string, sindex[0])
	  File "python3.6/site-packages/bashlex/subst.py", line 165, in _paramexpand
	    return _extractcommandsubst(parserobj, string, zindex + 1)
	  File "python3.6/site-packages/bashlex/subst.py", line 55, in _extractcommandsubst
	    node, si = _parsedolparen(parserobj, string, sindex)
	  File "python3.6/site-packages/bashlex/subst.py", line 42, in _parsedolparen
	    node, endp = _recursiveparse(parserobj, base, sindex, tokenizerargs)
	  File "python3.6/site-packages/bashlex/subst.py", line 23, in _recursiveparse
	    node = p.parse()
	  File "python3.6/site-packages/bashlex/parser.py", line 686, in parse
	    tree = theparser.parse(lexer=self.tok, context=self)
	  File "python3.6/site-packages/bashlex/yacc.py", line 277, in parse
	    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
	  File "python3.6/site-packages/bashlex/yacc.py", line 1079, in parseopt_notrack
	    tok = self.errorfunc(errtoken)
	  File "python3.6/site-packages/bashlex/parser.py", line 543, in p_error
	    p.lexer.source, p.lexpos)
	bashlex.errors.ParsingError: unexpected token ')' (position 4)

[request] Ship wheels too

Could you please also include a wheel when releasing bashlex? Only the classic ".egg" is provided, so this triggers a SDist install (I don't think .egg's are used anymore, setuptools-scm is the only other package I know still also providing an egg). See reasons listed here: https://pythonwheels.com for why wheels are nice, even for pure-python packages (faster, better security, pre-generates .pyc's, etc). Thank you!

pip wheel . will make one, or use pip install build && python -m build (may be best with a pyproject.toml too, which is also a good idea, but I think it works in legacy mode for simple packages).

Is it possible to run the parser in a generative fashion?

I wonder if it is possible to convert the parsed AST back into a valid bash script? Since the grammar is already there, in theory nothing stops it from doing so, right?

`for f in $(a; b); do :; done` not supported

This happens on explainshell.com.

heredoc parsing in function cause strange extra results

for the code as following

code = '''cat << EOF
abc
def
EOF'''
ret = bashlex.parse(code)
print(ret[0].dump())

we got:

CommandNode(pos=(0, 9), parts=[
  WordNode(pos=(0, 3), word='cat'),
  RedirectNode(heredoc=
    HeredocNode(pos=(10, 21), value='abc\ndef\nEOF'), output=
    WordNode(pos=(6, 9), word='EOF'), pos=(4, 21), type='<<'),
])

that's fine so far.

but for the code:

code = '''function foo () {
cat << EOF
abc
def
EOF
}'''
ret = bashlex.parse(code)
print(ret[0].dump())

we got:

FunctionNode(pos=(0, 40), parts=[
  ReservedwordNode(pos=(0, 8), word='function'),
  WordNode(pos=(9, 12), word='foo'),
  ReservedwordNode(pos=(12, 13), word='('),
  ReservedwordNode(pos=(13, 14), word=')'),
  CompoundNode(list=[
    ReservedwordNode(pos=(15, 16), word='{'),
    ListNode(pos=(17, 39), parts=[
        CommandNode(pos=(17, 26), parts=[
          WordNode(pos=(17, 20), word='cat'),
          RedirectNode(heredoc=
            HeredocNode(pos=(31, 38), value='def\nEOF'), output=
            WordNode(pos=(23, 26), word='EOF'), pos=(21, 26), type='<<'),
        ]),
        OperatorNode(op='\n', pos=(26, 27)),
        CommandNode(pos=(27, 30), parts=[
          WordNode(pos=(27, 30), word='abc'),
        ]),
        OperatorNode(op='\n', pos=(30, 39)),
      ]),
    ReservedwordNode(pos=(39, 40), word='}'),
  ], pos=(15, 40)),
])

in this case, abc no longer a part of the heredoc, but came out as a standalone CommandNode.

"a\\ \n" is not treated the same as "a\\\n"

The space in between \ and \n causes the tokenizer to not treat the \ as an independent and removable character, like it would if there were no space. Bash treats these as the same so it makes sense for the parser to do so as well

bashlex.split() - Strange quoting behaviour with variable assignments

I'm seeing a strange bug with variable assignments

>>> list(bashlex.split("PATH=\"$PATH:/usr/local/bin/\""))
['PATH="$PATH:/usr/local/bin/"']
       ^                     ^
#      note the quote marks /

>>> list(bashlex.split("PATH2=\"$PATH:/usr/local/bin/\""))
['PATH2=$PATH:/usr/local/bin/']

#     the quote marks are gone!

In the above example, it seems to be the number in the env var name that triggers the removal of quotes.

The following example shows that a preceeding var assignment with a number in the name will trigger the different quote behaviour.

>>> list(bashlex.split("VAR_ABC=1 PATH=\"$PATH:/usr/local/bin/\""))
['VAR_ABC=1', 'PATH="$PATH:/usr/local/bin/"']
                    ^                     ^
#                   note the quote marks /

>>> list(bashlex.split("VAR_123=1 PATH=\"$PATH:/usr/local/bin/\""))
['VAR_123=1', 'PATH=$PATH:/usr/local/bin/']

#     the quote marks are gone!

Retaining the quotes is desirable for my use case. I can workaround, so I'm just wondering if this is a bug in bashlex or some strange bash behaviour.

Comments are not supported

Parsing a file with comments is not supported. For the following script:

# A comment
echo "A script with a comment"

The sample program (the one in the README) generates the following error:

Traceback (most recent call last):
  File "sp.py", line 4, in <module>
    parts = bashlex.parse(open(sys.argv[1]).read())
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 582, in parse
    parts = [p.parse()]
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 641, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/yacc.py", line 1079, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "/usr/local/lib/python2.7/dist-packages/bashlex/parser.py", line 539, in p_error
    p.lexer.source, p.lexpos)
bashlex.errors.ParsingError: unexpected token 'echo' (position 12)

Refactoring class names

I was just grazing through the code and found that lot of class name in ast.py is in small letters. In PEP8, class names have CapWords. Just wanted to ask if this was intentional. If not, I can submit a PR for the fix.

Fail to parse single line string with comment

I'm using bashlex to parse build log files to extract compilation commands. I've just realized that when single line strings with comments are passed to the parser, it fails raising the exception below:

Traceback (most recent call last):
  File "/bin/compiledb", line 11, in <module>
    load_entry_point('compiledb', 'console_scripts', 'compiledb')()
  File "/usr/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 1043, in invoke
    return Command.invoke(self, ctx)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/nick/projects/compiledb/compiledb-generator/compiledb/cli.py", line 74, in cli
    done = generate(infile, outfile, build_dir, exclude_files, verbose, overwrite, not no_strict)
  File "/home/nick/projects/compiledb/compiledb-generator/compiledb/__init__.py", line 78, in generate
    r = generate_json_compdb(infile, proj_dir=build_dir, verbose=verbose, exclude_files=exclude_files)
  File "/home/nick/projects/compiledb/compiledb-generator/compiledb/__init__.py", line 34, in generate_json_compdb
    result = parse_build_log(instream, proj_dir, exclude_files, verbose)
  File "/home/nick/projects/compiledb/compiledb-generator/compiledb/parser.py", line 103, in parse_build_log
    commands = CommandProcessor.process(line, working_dir)
  File "/home/nick/projects/compiledb/compiledb-generator/compiledb/parser.py", line 163, in process
    trees = bashlex.parser.parse(line)
  File "/home/nick/sandbox/bashlex/bashlex/parser.py", line 611, in parse
    ef.visit(parts[-1])
  File "/home/nick/sandbox/bashlex/bashlex/ast.py", line 35, in visit
    k = n.kind
AttributeError: 'NoneType' object has no attribute 'kind'

Patch coming..

bashlex is creating python files at import time

Hi,

I am using bashlex, installed system wide (in a container, but that's not the issue), and when I try to import it as an unprivileged user, it shows some errors:

# sudo -u user python -c "import bashlex"
Unable to create '/usr/lib/python3.7/site-packages/bashlex/parsetab.py'
[Errno 13] Permission denied: '/usr/lib/python3.7/site-packages/bashlex/parsetab.py'

tracking the issue, it seems that https://github.com/idank/bashlex/blob/master/bashlex/yacc.py#L3291 is the call that write this file.

add support for using unimplemented nodes for array assignment

It would be good for array assignments to flagged as unimplemented when the new proceedonerror flag is enabled. This way, a complete AST can still be generated.

Currently, array assignment leads to a parsing error:

$ snippet='num=2 arr=(1 2 3)'

$ python -c "import bashlex; print(''.join(p.dump() for p in bashlex.parse('$snippet', proceedonerror=0)))"
Traceback (most recent call last):
...
  File "/usr/local/misc/programs/python/bashlex/bashlex/parser.py", line 587, in p_error
    raise errors.ParsingError('unexpected token %r' % p.value,
bashlex.errors.ParsingError: unexpected token '(' (position 10)

It would be better to add an unimplemented node to the AST:

$ python -c "import bashlex; print(''.join(p.dump() for p in bashlex.parse('$snippet', proceedonerror=1)))"
CommandNode(pos=(0, 17), parts=[
  AssignmentNode(pos=(0, 5), word='num=2'),
  UnimplementedNode(pos=(6, 17), word='arr=(1 2 3)'),
])

This can be implemented as follows (see attachment for complete diff):

--- a/bashlex/flags.py
+++ b/bashlex/flags.py
@@ -52,4 +52,5 @@ word = enum.Enum('wordflags', [
+    'UNIMPLEMENTED', # word uses unimplemented feature (e.g., array)

--- a/bashlex/parser.py
+++ b/bashlex/parser.py
@@ -173,6 +173,8 @@ def p_simple_command_element(p):
+        if (p.slice[1].flags & flags.word.UNIMPLEMENTED):
+            p[0][0].kind = 'unimplemented'
@@ -720,6 +722,7 @@ class _parser(object):
+                                       proceedonerror=proceedonerror,

--- a/bashlex/tokenizer.py
+++ b/bashlex/tokenizer.py
@@ -199,7 +199,8 @@ eoftoken = token(tokentype.EOF, None)
-                 lastreadtoken=None, tokenbeforethat=None, twotokensago=None):
+                 lastreadtoken=None, tokenbeforethat=None, twotokensago=None,
+                 proceedonerror=None):
@@ -232,6 +233,7 @@ class tokenizer(object):
+        self._proceedonerror = proceedonerror
@@ -391,7 +393,7 @@ class tokenizer(object):
-        d['dollar_present'] = d['quoted'] = d['pass_next_character'] = d['compound_assignment'] = False
+        d['dollar_present'] = d['quoted'] = d['pass_next_character'] = d['compound_assignment'] = d['unimplemented'] = False
@@ -467,6 +469,19 @@ class tokenizer(object):
+        def handlecompoundassignment():
+            # note: only finds matching parenthesis, so parsing can proceed
+            handled = False
+            if self._proceedonerror:
+                ttok = self._parse_matched_pair(None, '(', ')')
+                if ttok:
+                    tokenword.append(c)
+                    tokenword.extend(ttok)            
+                    d['compound_assignment'] = True
+                    d['unimplemented'] = True
+                    handled = True
+            return handled
+
@@ -512,6 +527,8 @@ class tokenizer(object):
+                elif c == '(' and handlecompoundassignment():
+                    gotonext = True
@@ -573,7 +590,7 @@ class tokenizer(object):
-        if d['compound_assignment'] and tokenword[-1] == ')':
+        if d['compound_assignment'] and tokenword.value[-1] == ')':
@@ -581,6 +598,10 @@ class tokenizer(object):
+        if d['compound_assignment']:
+            tokenword.flags.add(wordflags.ASSIGNARRAY)
+        if d['unimplemented']:
+            tokenword.flags.add(wordflags.UNIMPLEMENTED)

unimplemented-array-node-diff.txt

I can work this into a pull request if desired. I wasn't quite sure of the best way to handle the flags, so suggestions would be welcome. For example, I was going to use parser flags, but they seemed more related to internal state than final attribute.

New release - When ?

Hi,

The current release 0.16 was released in September 2021, since you made a change to fix the blank line which never got released.

Could you set a new release 017 ?

LICENSE file location too aggressive

bashlex/setup.py

Line 42 in 9017528

data_files = [('', ['LICENSE'])]

(root) # pip install bashlex

LICENSE put at /usr/local/LICENSE

(root) # pip install --user bashlex

LICENSE put at /root/.local/LICENSE

(venv) $ pip install bashlex

LICENSE put at venv/LICENSE

None of these locations seems related to bashlex at first glance.

I think this is too aggressive. Considering LICENSE file is included in tarball, and after installing bashlex ~~LICENSE file can be found in site-packages/bashlex-0.13.dist-info/LICENSE~~, is this still necessary?

comment parsing #

I checked in the test case, there is a test case for comment but in my application, when I tried to parse the comment (e.g # foo), it failed

File "lib/python3.6/site-packages/bashlex/parser.py", line 611, in parse
	    ef.visit(parts[-1])
File "lib/python3.6/site-packages/bashlex/ast.py", line 35, in visit
	    k = n.kind
builtins.AttributeError: 'NoneType' object has no attribute 'kind'

I used the latest version 0.14.

Multiple new lines at end of file

Working off of the base for PR #71 so this will be relevant after that PR is merged.

Multiple new lines at the end of an input triggers an "Unexpected EOF" error in line 546, in p_error.

Minimal example:

from bashlex import parse

parts = parse('cmd1\n\n')

This is not the case for a single newline at the end of the file (as of PR #71).

0.12 tag missing

There's no tag for 0.12 which is mentioned on PyPi

What precise version of the bash parser was this transliterated from?

Hey idank, what precise version of bash did you use to build this? how difficult is it to redo or update?

space at the end of line - parse error

>>> bashlex.parse('cmd1\ncmd2 \ncmd3\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bashlex/parser.py", line 614, in parse
    part = _parser(s[index:], strictmode=strictmode).parse()
  File "bashlex/parser.py", line 682, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
  File "bashlex/yacc.py", line 277, in parse
    return self.parseopt_notrack(input,lexer,debug,tracking,tokenfunc,context)
  File "bashlex/yacc.py", line 1079, in parseopt_notrack
    tok = self.errorfunc(errtoken)
  File "bashlex/parser.py", line 539, in p_error
    p.lexer.source, p.lexpos)
bashlex.errors.ParsingError: unexpected token 'cmd3' (position 1)

Warnings on first usage

When bashlex used first time it prints:

WARNING: Token 'COND_ERROR' defined, but not used
WARNING: There is 1 unused token

[request] Add support for local, global, and export

If an assign statement is used after local, global, and export it is treated as a word node, not an assignment node.

Parsing awk command with quotes

Hi,

I am using bashlex to parse some shell commands, and I encountered some problems with arguments of commands that are enclosed with quotes '...', the word node does not include surrounding quotes.

Example:

$ awk '{print $0};' /tmp/test

The dump of treenode outputs only {print $0};. The correct token should be '{print $0};'

CommandNode(pos=(0, 25), parts=[
   WordNode(pos=(0, 3), word='awk'),
   WordNode(pos=(4, 15), word='print $0;'),
   WordNode(pos=(16, 25), word='/tmp/test'),
 ])

I want to change parsing/tokenizer, but if you can pinpoint me to where I should change, I would be glad to do it.

Problems with $10 and bigger numbers

I found a problem with bigger numbers after $ sign. Bashlex will only return first number. As you can see on this example:
[ParameterNode(pos=(879, 881) value='1')] pos=(879, 883) word='$124')

It should be value='124' not only '1'

Issues with parsing && and ||

While following seems to parse just fine:

foo && bar

However when I try to parse this:

foobar=$(foo && bar)

I get the following error:

bashlex.errors.ParsingError: unexpected token ')' (position 10)

The same goes for ||.

Full Traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 610, in parse
    parts = [p.parse()]
             ^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 691, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/yacc.py", line 439, in parse
    p.callable(pslice)
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 167, in p_simple_command_element
    p[0] = [_expandword(parserobj, p.slice[1])]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 145, in _expandword
    parts, expandedword = subst._expandwordinternal(parser,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/subst.py", line 271, in _expandwordinternal
    node, sindex[0] = _paramexpand(parserobj, string, sindex[0])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/subst.py", line 165, in _paramexpand
    return _extractcommandsubst(parserobj, string, zindex + 1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/subst.py", line 55, in _extractcommandsubst
    node, si = _parsedolparen(parserobj, string, sindex)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/subst.py", line 42, in _parsedolparen
    node, endp = _recursiveparse(parserobj, base, sindex, tokenizerargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/subst.py", line 23, in _recursiveparse
    node = p.parse()
           ^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 691, in parse
    tree = theparser.parse(lexer=self.tok, context=self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/yacc.py", line 537, in parse
    tok = self.errorfunc(errtoken)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/___/.local/lib/python3.11/site-packages/bashlex/parser.py", line 548, in p_error
    raise errors.ParsingError('unexpected token %r' % p.value,
bashlex.errors.ParsingError: unexpected token ')' (position 10)

Hope this helps, thanks!

won't parse 'rm -f !(file.sh)'

bashlex.errors.ParsingError: unexpected token '(' (position 7)

May be caused by the fact that '!' is interpreted as WORD instead of BANG.

Fix bashlex to parse whole test file

Hi,
i have problems to parse whole test file. The problems are that bashlex hates empty lines and comments. Could you please fix it? I really would like to use it.
Thanks

Variable names cannot have numbers in them

If a variable declaration has a number in it, the parser will treat it as a singular word entity. This is true in bash only if the 1st character is a number. 2all=something, will not be treated as a variable declaration according to bash, but a2ll=something is treated as a variable declaration. The parser currently treats both of these are not being assignment statements.

Cut a new release

Last one was in 2016. Without the additions in the most recent commit, bashlex fails to install for me because it tries to install enum34, which gets used over the standard enum.

Collecting compiledb
  Downloading https://files.pythonhosted.org/packages/20/b8/b0912c8198baf67ebba62c46d21bbb16f03ff072eee782ee659dd11520ee/compiledb-0.9.8.tar.gz
Collecting click (from compiledb)
  Downloading https://files.pythonhosted.org/packages/f8/5c/f60e9d8a1e77005f664b76ff8aeaee5bc05d0a91798afd7f53fc998dbc47/Click-7.0.tar.gz (286kB)
    100% |████████████████████████████████| 286kB 5.8MB/s
Collecting bashlex (from compiledb)
  Using cached https://files.pythonhosted.org/packages/e6/83/8f35a0a430908e5c964fbf31a8e46fbac125d1bbf066a1e26110c618a3ff/bashlex-0.12.tar.gz
Collecting enum34 (from bashlex->compiledb)
  Downloading https://files.pythonhosted.org/packages/bf/3e/31d502c25302814a7c2f1d3959d2a3b3f78e509002ba91aea64993936876/enum34-1.1.6.tar.gz (40kB)
    100% |████████████████████████████████| 40kB 8.5MB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/a3/.local/share/pythons/c/lib/python3.7/site-packages/setuptools/__init__.py", line 6, in <module>
        import distutils.core
      File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/core.py", line 16, in <module>
        from distutils.dist import Distribution
      File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 9, in <module>
        import re
      File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 143, in <module>
        class RegexFlag(enum.IntFlag):
    AttributeError: module 'enum' has no attribute 'IntFlag'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/lz/tm467dx170g12t9bg6mg9h8w0000gn/T/pip-install-vbmmlqvd/enum34/

If statements only have support for conditional commands (COND_CMD)

Parsing an if statement will crash if you tried anything along the lines of [[ 0 -eq 0 ]]. This bug comes because state 41 has no transition to states for parsing words in the test