engelberg / instaparse Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 149.0 1.21 MB

License: Eclipse Public License 1.0

Clojure 100.00%

instaparse's People

Contributors

Stargazers

Watchers

Forkers

brandonbloom aengelberg mnemnion gigasquid timsgardner clojens yayitswei big-data scraping-xx bigdata-tools netconstructor lewang samvit eliassona mullr zachary-kuhn raynes sbelak bruce2008github ryantm cymen anujsrc lgastako timothyandrew b-ryan timgluz ifesdjeen lambda-ai eobrain vonwenm lbradstreet scgilardi zmaril qnix johannesloetzsch ghosthamlet kevinbarabash tefla cbuelvas perween michaelblume mohanarunachalam duot samertm maxweber tiensonqin niac tgoossens qianxuecheng yilab maddenp jjfiv douglas-larocca hhutch sil3nz3r paultopia steinarvk niklasjansson highgrove wibisono puredanger claj wdv4758h kwakbab whoops shadowfiend ajchemist lbt05 jamesnvc idozorenko orneryhippo clojurians-org rowhit aa10000 viebel manualstar gchiaramonte blankrain kanaka plumpmath gmercer yyd-luojie hausnerr yatesco hlolli rorokimdim colinhicks dundalek martinodb prejrdev randyhudson brianwitte ayato-p f-f jgke grahack benbenxong galdolber ilvcyy chimez

instaparse's Issues

Information about hidden content and tags gets lost when pretty-printing

This somewhat complicates things like serialization. Do you accept patches or is this an intentional decision?

'.' is not treated as an end of rule character

In the syntax notation table, ; and . is treated as end of rule characters. We can easily see that ; is treated as end of rule:

(insta/parse
  "a = b c ;
   b = 'b' ;
   c = 'c' ;")
; => (legal parser)

whereas . is not treated as an end of rule character:

(insta/parse
  "a = b c .
   b = 'b' .
   c = 'c' .")
; RuntimeException Error parsing grammar specification:
; Parse error at line 1, column 9:
; a = b c .
;         ^
; Expected one of:
; (list of different alternatives)

Either . should be left out of the syntax notation table, or it should be treated like ;.

Edit: This was tested on the 2013-06-06 build of 1.2.0-SNAPSHOT and on 1.1.0.

Line and Column metadata

I noticed that the parse trees have begin and end position metadata, but not line and column data. Failures seem to re-create line and column metadata via re-walking the input string. It would be great if I could optionally enable line and column tracking during parsing, such that all parse results were fully decorated after the parse completes. Bonus points for begin/end line/column pairs.

Top-level epsilon parser doesn't work

instaparse.core> ((parser "S = epsilon") "")
Parse error at line 1, column 1:
nil
^

This is surprising, if of little practical consequence.

Allow creating custom terminal parse rules

Imagine I wanted to parse an indentation-sensitive language like Python.
The standard technique is to have a lexer that analyzes whitespace and generates WS or INDENT and DEDENT special tokens.
I'd like to be able to write a WS rule in Clojure and plug it into existing grammar.

Support for the ABNF #rule

There seems to be no support for the #rule, as specified in RFC 2616, section 2.1. Is it in the works?

For now I guess I could just make the conversion from, say:

1#element

to:

( *LWS element *( *LWS "," *LWS element ))

Unsupported operation exception

I'm getting the following exception when calling (insta/parses), even though (insta/parse) with the same arguments works as expected. Note that both the grammar and source text that's being parsed are coming from string literals. This is occurring on 1.1.0-SNAPSHOT.

java.lang.UnsupportedOperationException: count not supported on this type: File
              instaparse.gll$total_success_QMARK_.invoke(gll.clj:105)
              instaparse.gll$push_result.invoke(gll.clj:203)
              instaparse.gll$eval6843$star_parse__6856.invoke(gll.clj:545)
              instaparse.gll$_parse.invoke(gll.clj:52)
              instaparse.gll$push_listener$fn__6744.invoke(gll.clj:229)
              instaparse.gll$step.invoke(gll.clj:278)
              instaparse.gll$run.invoke(gll.clj:294)
              instaparse.gll$run.invoke(gll.clj:282)
              instaparse.gll$parses.invoke(gll.clj:675)
              instaparse.core$parses.doInvoke(core.clj:62)
              <stack trace truncated for brevity>

Thanks for your hard work on the library - it's a gem!

Handling of \r

On Windows at least, it seems that carriage-returns get normalized to newlines:

user=> (parser "S = '\r\n'")
S = "\n"

Is this intended?

CRLF is used in lots of internet protocol grammars.

Surprising behavior of regex rule when parsing the empty string

I found an inconsistency between a grammar rule specified as string literal and as regular expression when trying to parse an empty string. An example:

user=> (ns test
  #_=> (:require [instaparse.core :as insta]))
nil
test=> ((insta/parser "f = \"asdf\"" ) "")
Parse error at line 1, column 1:
nil
^
Expected:
"asdf" (followed by end-of-string)

test=> ((insta/parser "f = #\"asdf\"" ) "")
[:f]
test=>

I would have expected an error in both cases. This is a simplified example. In my real grammar the regex behavior leads to an ambiguous grammar because it can somehow match 0 characters of input (I haven't debugged this further).

I observed this behavior both with version 1.3.1 and 1.2.14.
Is this a bug or am I missing something obvious? Is there a work-around?

How to handle angle brackets in ABNF?

I’m trying to create a parser for IMAP using the ABNF from RFC 3501 but I’m a little stuck because the grammar contains angle brackets, and while the Instaparse docs on ABNF are quite impressive, and do include a section on angle brackets, that section is vague about what I should actually do about them.

I’m not very familiar with formal grammars and parsers, so I don’t know if this is crazy, but if in the grammar <something like this> is supposed to represent, basically, anything, then could I maybe replace it with 1*OCTET? Or would that be terrible?

Appreciate any help!

Weird tree behavior , can anyone explain ?

I'm writing a little parser and I came across this strange behavior (at least to my limited knowledge)

(def testa-parser
(insta/parser
"start := {test} | {num}
num := #'[0-9]+'
test := {A} <{spaces}> {B}
A := #'[a-z]+'
spaces := #'\s+'
B := #'[a-z]+'
"))

(def testb-parser
(insta/parser
"start := {A} <{spaces}> {B}
A := #'[a-z]+'
spaces := #'\s+'
B := #'[a-z]+'
"))

(testa-parser "a b")
[:start [:test [:B "a"]] [:test] [:test [:A "b"]]]

(testb-parser "a b")
[:start [:A "a"] [:B "b"]]

testa parser is problematic, I do not see why [:test ..} would not be properly nested like [:test [:A "a"] [:B "b"]] like it is the case when the test rule is the start rule...

Any ideas ?

Thanks a lot

Add positional metadata to parse tree

Request from David Powell to add either character number of line/column info as metadata.

Inconsistent :red when using auto whitespace

(def whitespace (insta/parser "whitespace = #'\\s+'"))

(:grammar (insta/parser "S = A B <A> = 'foo' <B> = #'\\d+'"))

{:S
 {:red {:reduction-type :hiccup, :key :S},
  :tag :cat,
  :parsers ({:tag :nt, :keyword :A} {:tag :nt, :keyword :B})},
 :A {:red {:reduction-type :raw}, :tag :string, :string "foo"},
 :B {:red {:reduction-type :raw}, :tag :regexp, :regexp #"\d+"}}

(as expected) but

(:grammar (insta/parser "S = A B <A> = 'foo' <B> = #'\\d+'" :auto-whitespace whitespace))

{:whitespace
 {:red {:reduction-type :raw}, :tag :regexp, :regexp #"\s+"},
 :S
 {:red {:reduction-type :hiccup, :key :S},
  :tag :cat,
  :parsers
  ({:tag :cat,
    :parsers ({:tag :nt, :keyword :A} {:tag :nt, :keyword :B})}
   {:hide true, :tag :opt, :parser {:tag :nt, :keyword :whitespace}})},
 :A
 {:tag :cat,
  :parsers
  ({:hide true, :tag :opt, :parser {:tag :nt, :keyword :whitespace}}
   {:red {:reduction-type :raw}, :tag :string, :string "foo"})},
 :B
 {:tag :cat,
  :parsers
  ({:hide true, :tag :opt, :parser {:tag :nt, :keyword :whitespace}}
   {:red {:reduction-type :raw}, :tag :regexp, :regexp #"\d+"})}}

Notice how :A and :B have :red field buried in :parsers. I'm using the grammar tree to infer some automatic transformations on the output and accounting for such inconsistencies needlessly bloats the code (another one is how :parsers sometimes becomes :parser but that's a minor annoyance).

Rule restatement/extending

Right now instaparse doesn't handle well grammars like that:

A='a';
A='b';

It will create grammar A='a' or A='b'. IMHO it should either throw an error, or better yet, just merge those two rules into A='a'|'b'.

I've seen that ABFN notation documentation specifies this feature, so I guess this is a bug, not a lack of feature :).

Improve error message when CFG is malformed due to a missing closing quote or paren.

Adaptation to ClojureScript

I have discussed it in this thread

Cheers,

Jeremys.

When a rule is hidden transformation is not applied.

Is this intentional?

Have to run just now, but if this isn't clear I can supply a test case.

Add token streams

It would be lovely to have ANTLR style token streams. Proposed syntax:

(def paren-ab-hide-parens
  (insta/parser
    "paren-wrapped = <=:paren '('> seq-of-A-or-B <:paren ')'>
     seq-of-A-or-B = ('a' | 'b')*"))

this would hide parentheses, but add them to a second version of the parse tree that could be revealed with

(paren-ab-hide-parens "(aba(abba)aba)" :show #{:paren})

regular expression hangs on bad input

This rule, which is malformed, causes instaparse to wander off and not come back:

evil-regex = #'([a-z]|A-Z| |:|"|/|.|*|-|_|#|)+';

running 1.2.2

remove the last | and this works fine:

okay = #'([a-z]|A-Z| |:|"|/|.|*|-|_|#)+';

Allow parsing stream (sequence) of arbitrary tokens, not a string

This may be a duplicate of #9, but maybe not.
If I already have a lexer that generates a seq of tokens I'd like to be able to parse it with instaparse.

Fully-qualify the :failure keyword of the :total parsing mode

This is a minor point, but it can be significant for certain use cases: I think that the :failure keyword that is inserted at the point of parsing failure when the :total option is true, should be fully-qualified (something like :instaparse/failure) to ensure that it cannot be confused with a genuine grammar rule by the same name.

I'm thinking that this could be useful when instaparse is could be used for grammar-assisted content generation (a form of auto-complete maybe?)

Thanks! 👍

Reflection warning with new ABNF 1.1.0-SNAPSHOT version

I'm seeing the following warnings from the new version of the library, even when not using the new ABNF features:

Reflection warning, instaparse/gll.clj:474:9 - call to equalsIgnoreCase can't be resolved.
Reflection warning, instaparse/gll.clj:485:35 - call to equalsIgnoreCase can't be resolved.

right associativity

Hi,

I'm converting an antlr grammar and I'm pleased with instaparse so far. I've hit a bit of a bump.

Most rules are left associative, so that 2 + 3 + 4, for instance, groups as (2 + 3) + 4. not that it matters for addition.

But for powers, it does matter, because they must be right associative: 2^2^3 must parse as 2^(2^3) not (2^2)^3.

In antlr the rule looks like:

    |   expression (POW<assoc=right>) expression

which may not fit instaparse's syntax.

I vaguely recall there being a way to rearrange right associative rules to get them to work, but it's clearer to mark the token as right associative if your parsing algorithm allows this.

I perused the readme and didn't see a reference to right associativity, am I missing it?

:enlive output produces vector on failure with :total true

I tried to file this yesterday but don't see it. Apologies if this is posted twice.

Here's the breaking input:

(def eat-a (insta/parser "Aeater = #'[a]'+" :output-format :enlive))

(eat-a "aaaaaaa" :total true)
{:tag :Aeater, :content ("a" "a" "a" "a" "a" "a" "a")}

(eat-a "aaaaaaaabbbbbb" :total true)

{:tag :Aeater, :content ("a" "a" "a" "a" "a" "a" "a" "a" 
      {:tag :instaparse/failure, :content ["bbbbbb"]})}

I'd say ["bbbbbb"] should be a list, not a vector.

Also, it would be just fantastic if :total true could be added to the definition of a parser. In the code I'm working on, I need a contract with the parsers that flat-tree (which does what you'd expect) will return the whole string, even on failure to parse.

auto_flatten_seq reverses tree order

Background

I'm using Instaparse to parse a simple assembly-like language used by 1980s test equipment. It has support for including files and binding registers to symbols. These are part of the language spec, and I added them into my grammar, then post-process the AST to insert the included files and resolve symbols. I chose clojure.walk and clojure.core.match for the post-processing, and it's all pretty straightforward.

The Problem

I found an input which behaves in an extremely unusual way. If I (pprint ast), I get exactly what I would expect, a tree representing the source. But, if I use either prewalk or postwalk across the AST, some (but not all) branches come back as seqs in reverse order. This is rather bad behavior for an assembler.

I systematically ripped out everything I'm doing around the parsing and transforming process, then started bisecting the input to find the minimal case which triggers the bug. It seems that for smaller trees, Instparse uses normal clojure.lang.PersistentVectors, but when the tree size hits some threshold, it switches to instaparse.auto_flatten_seq.FlattenOnDemandVector. I can comment out any line of my source input and get the behavior I expect. Once the input exceeds 34 lines, the bad behavior is triggered.

Test case

Because this is kind of a hairy (and I have to assume non-obvious) issue, I put together a project which exhibits the behavior: https://github.com/ieure/instaparse-reverse-test

It has the grammar I defined, the minimal input that triggers the bug, and a testcase which exercises it. A clone and lein test should be sufficient to exhibit the behavior. I am using the latest Clojure and Instaparse, and there are no other dependencies.

The test case will print the type of the :PROGRAM_BODY node, and you can see that the test only fails when it is a instaparse.auto_flatten_seq.FlattenOnDemandVector. You can substitute the walkfn which does this with a fn that returns its input unaltered, or identity, and get the same behavior. In short, it isn't what the function is doing that causes the issue, it's the process of walking over the AST. Both prewalk and postwalk cause the problem.

It's possible that I'm doing something wrong here, but I cannot think what. It's unusual that pprint doesn't trigger the bug, but prewalk does, and this may be a clue to what's wrong.

IndexOutOfBoundsException in failure.clj

With 1.1.0-SNAPSHOT I get an IndexOutOfBoundsException at line 18 of failure.clj with the following snippet of code:

(def test-grammar "
  ImportStatement       = 'CAN HAZ' Whitespace Identifier '?' EndOfLine
  Whitespace            = #'\\s+'
  OptionalWhitespace    = #'\\s*'
  EndOfLine             = OptionalWhitespace ('\\n' | '\\r' | '\\r\\n')
  Identifier            = #'[_\\p{Alpha}]\\w*'
  ")
(def test-parser (insta/parser test-grammar))
(test-parser "CAN HAZ STDIO?\n" :start :ImportStatement)

ABNF syntax

Support for numeric ranges, comments and possibly other aspects of ABNF syntax.

Tag c38b1d with tag v1.2.4

Sorry to nitpick, finding the relevant version of the code for 1.2.4 involved finding the pom.properties file in the Maven repo - would be nice if it were tagged in git.

passing nil to a parser gives NullPointerException

Passing nil to a parser gives a NullPointerException. For example:

((insta/parser "S='a'") nil)
NullPointerException   clojure.core/subs (core.clj:4517)

It might be better behavior might be generate a parse failure object?

Redefined rule in language should be a warning?

While having a play with Instaparse for the first time was implementing a 50+ rule grammar for reading an arbitrary file. In the process, I accidently re-used a rule name and was thoroughly confused for 'some time' until I finally realised that I had re-used the rule name and that the last definition of it was being used instead of the one I was expecting.

Now that I've had this happen to me, it will probably be trivial to work out, but would be nice to have some warning when a rule is re-used?

Windows Line Endings

This is a super cool library and I feel like a jerk for being the guy who nitpicks over silly things like whitespace, but Windows line endings are really quite annoying for me when I'm trying to read the code in Vim.

Luckily, there is a trivial fix: find . -name '*.clj' | xargs dos2unix

Regex matching bug

Hey, I've just discovered a bug for the following grammar:

(insta/parser
    "ws = #'\\s+';
    Int = #'[0-9]+';
    Double = #'[0-9]+\\.[0-9]*|\\.[0-9]+';
    <ConstExpr> = Int | Double;
    Input = ConstExpr <ws> ConstExpr;"
    :start :Input)

and input: 30 .2

the resulting output is: [:Input [:Double ".2"] [:Double ".2"]]
or with meta attached:
^{:instaparse.gll/start-index 0, :instaparse.gll/end-index 5} [:Input
^{:instaparse.gll/start-index 0, :instaparse.gll/end-index 2} [:Double ".2"]
^{:instaparse.gll/start-index 3, :instaparse.gll/end-index 5} [:Double ".2"]]

Adding ^ (beginning of line marker) in front of all regexes works, so does switching <ConstExpr> = Int | Double; into <ConstExpr> = Double | Int;.

ENBF constructor in combinator library

For example, (ebnf "(A | B)+") would create something composable with other combinators.

Open question: Should this constructor only work on right-hand side fragments, or should it also operate on whole rules, for example, (ebnf "S = (A | B)+").

((parser "S = ('a'?)+") "")

Instaparse currently makes the assumption that in a plus parser, you don't care about any nullable interpretation and the above example will fail.

It's hard to imagine anyone actually caring about this behavior, but I've marked this down to investigate whether it is worth considering this a bug and fixing it.

Allow designation of tokens as garbage

At present, one can emulate the behaviour of a lex/yacc grammar using instaparse only by suitably modifying the source grammar to explicitly account for traditionally ignored whitespace characters. Over the course of the C language grammar or the Pascal language grammar this can easily amount to hundreds of rule changes as one must explicitly provide for the possibility of whitespace in every nonterminal concatenation where the standard defined source grammars state simply "discard whicespace" assuming a separate lexer.

It would be awesome if the top level parser took an :ignored "ignored-forms-rule", which would be implicitly used as a token sink throughout the grammar.

EBNF-style comments

Traditional EBNF comment delimeters are (* and *).

Question: Meta Knowledge in Transform

When my mapped "transformation" is invoked during insta/transform, is there information available? Basically, I would like to access the particular branch of the parse-tree that I am currently transforming.

Use case: A parse tree may contain one or more 'select' rules as stated by the grammar. During transformation, knowing which select instance we are on (segment of it's particular parse-tree vector would suffice) so that we can do further reasoning on the current state.'

I can provide examples if required.

Dependency issue: clojure.tools.trace in 1.2.0

Moving from 1.1.0 to 1.2.0, I encountered the following error, which appears to be from (:use clojure.tools.trace) in repeat.clj.

Adding the dependency for org.clojure/tools.trace to my own project.clj removes the error.

Exception in thread "main" java.io.FileNotFoundException: Could not locate clojure/tools/trace__init.class or clojure/tools/trace.clj on classpath: 
    at clojure.lang.RT.load(RT.java:443)
    at clojure.lang.RT.load(RT.java:411)
    at clojure.core$load$fn__5018.invoke(core.clj:5530)
    at clojure.core$load.doInvoke(core.clj:5529)
    at clojure.lang.RestFn.invoke(RestFn.java:408)
    at clojure.core$load_one.invoke(core.clj:5336)
    at clojure.core$load_lib$fn__4967.invoke(core.clj:5375)
    at clojure.core$load_lib.doInvoke(core.clj:5374)
    at clojure.lang.RestFn.applyTo(RestFn.java:142)
    at clojure.core$apply.invoke(core.clj:619)
    at clojure.core$load_libs.doInvoke(core.clj:5413)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at clojure.core$apply.invoke(core.clj:621)
    at clojure.core$use.doInvoke(core.clj:5507)
    at clojure.lang.RestFn.invoke(RestFn.java:408)
    at instaparse.repeat$eval3869$loading__4910__auto____3870.invoke(repeat.clj:1)
    at instaparse.repeat$eval3869.invoke(repeat.clj:1)
    at clojure.lang.Compiler.eval(Compiler.java:6619)
    at clojure.lang.Compiler.eval(Compiler.java:6608)
    at clojure.lang.Compiler.load(Compiler.java:7064)
    at clojure.lang.RT.loadResourceScript(RT.java:370)
    at clojure.lang.RT.loadResourceScript(RT.java:361)

Seemingly trivial terminal renamings can change behavior

I have found cases where changing the names of my rule names (e.g. the terminals) can change the behavior of the parser. For example, visible = #"\p{Graph}+" worked fine; however, renaming 'visible' to 'symbol' changed the behavior. I have not constructed a small reproducible example yet -- I'll check back in after I do.

Representing '\' with string literal notation is buggy

Representing the string literal \ seems to not work as intended, and returns an error in certain cases:

(insta/parser "a = b c
               b = 'a'
               c = '\\\\'")
;; spits out
a = b c
b = "a"
c = "\\"

But the equivalent variant

(insta/parser "a = b c
               c = '\\\\'
               b = 'a'")

Throws the following exception:

RuntimeException Error parsing grammar specification:
Parse error at line 3, column 22:
               b = 'a'
                     ^
Expected one of:
=
::=
:=
:
?
*
+
#"\s*[.;]\s*"
<
ε
eps
EPSILON
epsilon
Epsilon
|
/
!
&
(
{
[
#"#\"(?:[^\"]|(?<=\\)\")*\""
#"#'(?:[^']|(?<=\\)')*'"
#"\"(?:[^\"]|(?<=\\)\")*\""
#"'(?:[^']|(?<=\\)')*'"
#"[^, \r\t\n<>(){}\[\]+*?:=|'"#&!;./]+"

  instaparse.cfg/build-parser (cfg.clj:222)

The same issue appears when using \" to denote string literals as well. #'\\\\' seems to suffer from the same problem as well.

Suspected bug with negative lookahead

I think I found a bug. I don't know if it is linked to Java regex character classes (I use \p{Graph} below). Can you take a look?

I created a file simple1.bnf that I hoped would work. It contains:

s = {x sp} [x]
x = word | !word visible
sp = #"\s+"
word = #"\w+"
visible = #"\p{Graph}+"

As you will see, the above did not work for me. So I created this file, simple2.bnf. It is the same as the above except for one line (where lookahead is removed):

x = word | visible

Here is a helper function:

(defn parses
  [bnf s]
  (insta/parses (insta/parser (clojure.java.io/resource bnf)) s))

The next examples show what happens with simple1.bnf:

user=> (parses "simple1.bnf" "h_ello w_orld")
([:s [:x [:word "h_ello"]] [:sp " "] [:x [:word "w_orld"]]]) ; good

user=> (parses "simple1.bnf" "h_ello w$orld")
() ; unexpected

The following examples show that simple2.bnf works, but I don't like the ambiguity:

(pprint (parses "simple2.bnf" "h_ello w$orld"))
([:s [:x [:visible "h_ello"]] [:sp " "] [:x [:visible "w$orld"]]]
 [:s [:x [:word "h_ello"]] [:sp " "] [:x [:visible "w$orld"]]])

Clojure 1.7.0 Alpha 2 compatibility - instaparse.combinators-source/cat

When using instaparse with 1.7.0 Alpha 2, this warning appears:

WARNING: cat already refers to: #'clojure.core/cat in namespace: instaparse.combinators-source, being replaced by: #'instaparse.combinators-source/cat

Trying to parse BibTeX, no output

Disclaimer: I'm completely new to Clojure (coming from Ruby), and am just learning about parsers, so I might be totally off - any gentle guidance appreciated! I work a lot on publication metadata, and wanted to play around with parsing BibTeX (fun, I know!)...

I found an EBNF specification for BibTeX here: bibstuff, however it gives me tons of parsing errors. Is there something simple I need to understand to transform this into a format that instaparse can understand, or do I just need to study the entire specification?

(I load the string from a file to avoid problems with " etc)

grammars as strings

Why are grammars strings and not done as a clojure macro? it seems strange to me when lisps are known for DSL's to make the grammar a string...

OutOfMemoryError Java heap space [trace missing] when parsing small file

Hi,

I'm trying to parse a 7k line text file with some rules here:

https://gist.github.com/lewang/5900166#file-subs-parser-clj-L27

It works with half at a time (~3.5k), i.e. parsing the top and bottom halves separately, but trying to parse the whole file results in

OutOfMemoryError Java heap space [trace missing]

The highlighted line,

subscription = #'.*?(?=\\s+-)' < separator > date

seems to cause the problem. It does not happen if I modify the rule slightly.

Support for EBNF comments

It would be nice to support EBNF comments:

(* comment text goes here *)

Raw strings regex escaping

I noticed your comment on the regex double backslash escaping one has to do in Clojure and decided to look into this for a bit. There seems to be a way to avoid this, perhaps this can be used with instaparse, I'm not sure but in case it does:

(java.util.regex.Pattern/quote "\r\n?|\n")
;"\\Q\r\n?|\n\\E"

tutorial edits

I so enjoyed reading the extremely well-written instaparse tutorial that I thought I should take the time to suggest a few small edits, in case they're helpful:

"The string specification allows the parser to rebuilt with a different output format..."

"to rebuilt" -> "to be rebuilt"

"...implemented a wrapper around Clojure's vectors that use..."

"that use" -> "that uses", assuming that it is the wrapper (singular) that uses, not the vectors that use

Many of the libraries use a recursive-descent strategy that fail for left-recursive grammars..."

"that fail" -> "that fails", if it's the strategy (singular) that fails; or "and fail", if it's the libraries (plural) that fail.

"...showing you the furthest point it got in parsing your text..."

"got" -> "reached" or maybe "got to", as in "it reached a point" or "it got to a point", not "it got a point"

"Nevertheless, there may be times where..."

"where" -> "when" (times are when), or maybe "places" or "contexts" or "situations" where

"Supports both of Clojure's most popular tree formats (hiccup and enlive) as an output target."

"an output target" -> "output targets" ("formats" and "targets" both plural)

"So, as is often the case in Clojure, use recursion judiciously..."

"the case" -> (e.g.) "advisable", as in "it is often advisable to use recursion judiciously", not "it is often the case to use recursion judiciously"

Apologies if I've misread something.

lein jar does not work with :auto-whitespace option

The new :auto-whitespace option (which, by the way, is very cool, thanks!) does not seem to work with lein jar. A project that works fine with lein run and friends gives Assert failed: (let [ws-parser (get options :auto-whitespace)] (or (nil? ws-parser) (instance? Parser ws-parser))) when I try to generate a jar.

I've created a small test case to demo the problem at https://github.com/deg/test-ws-jar

Strange regexp fail

I have a date-related parsing task and have a rule for day parsing. I came up with following regular expression to match it: #"0?[1-9]|[12][0-9]|3[01]". However this does not work well with instaparse:

user=> ((insta/parser "Day = #'0?[1-9]|[12][0-9]|3[01]'") "01")
[:Day "01"]
user=> ((insta/parser "Day = #'0?[1-9]|[12][0-9]|3[01]'") "09")
[:Day "09"]
user=> ((insta/parser "Day = #'0?[1-9]|[12][0-9]|3[01]'") "9")
[:Day "9"]
user=> ((insta/parser "Day = #'0?[1-9]|[12][0-9]|3[01]'") "10")
Parse error at line 1, column 1:
10
^
Expected:
#"0?[1-9]|[12][0-9]|3[01]" (followed by end-of-string)

user=> (re-matches #"0?[1-9]|[12][0-9]|3[01]" "10")
"10"

That is, only 1..9 or 01..09 values work. Everything else (10..31) does not work, however the regexp itself matches these values.