zevv / npeg Goto Github PK

View Code? Open in Web Editor NEW

328.0 18.0 20.0 10.02 MB

PEGs for Nim, another take

License: MIT License

Nim 100.00%

peg parser grammar regex gerexp regular-expressions parser-generator nim

npeg's People

Contributors

Stargazers

Watchers

npeg's Issues

Feature request: Parse from seq[Token]

Peg is great. One of its advantages is that it's suitable for both lexing and parsing. This doesn't mean it's always preferable to do both of them in one step. Therefore we should provide an interface where the user can either choose to write 1 peg that does both lexing and parsing, or write 2 pegs, 1 for lexing and 1 (which feeds from the previous) for parsing.

To achieve this we need a match() that accepts seq[Token] where Token is user-defined (probably some codegen to generate the match() proc is involve here)

Nim parser generator story and npeg's role

Intro:
I realize npeg's original intention was (still is?) to be a nifty tool for parsing relatively simple grammars with pure Nim code. Therefore it's justifiably unfit for larger projects with more complex grammars.

But why is it unfit?
Npeg has come a looooong way! Many of its shortcomings are gone. VM problems are gone and recently you can even write grammars that parse from seq of tokens instead of string (#20)!

But well... the syntax is a rather ugly ad-hoc modification of peg. This is of course necessary in order to be accepted by Nim's parser. Nim-compatible syntax comes with several advantages: Firstly it allows you to leverage Nim's parser instead of implementing one yourself, and secondly it allows the user to write his DSL code in a macro block argument (which is cool and convenient). For large projects though you don't want to use such a syntax just so that you can write directly in a macro block argument and would very much prefer to give this up and use special files (e.g. .npeg instead of .nim) if it meant better syntax.

Another -less important imo, yet worth mentioning- limitation is lack of lexer/parser pipeline. Yes, you can now parse from tokens, but the two steps being separate hurts performance. Ideally lexer and parser should work concurrently.

What should we do?
The thing is... there is really no tool to use with Nim for that matter. Projects like Bison or ANTLR don't support Nim. I think this situation is unacceptable, or at least there is work that must be done in this direction. Lastly I list some options to improve the situation:

contribute to Bison or ANTLR (add Nim support)
extend npeg
- add support for reading standard PEG syntax mixed with Nim code from .npeg
- implement lexer/parser pipeline
- implement at least a linter for .npeg files
keep npeg nifty as it is and begin a new project oriented towards larger scale parsing

Please comment on which one you think is most logical, most preferable, most feasible, easiest.

More/larger examples of Error reporting

I am really not able to wrap my head around error reporting. I understand how to use it in small fragments. But I am not sure how to structure the grammar so that the error reporting is effective. Any longer example code here or general tips would be helpful.

thanks
Rishav

When defining grammar block parameterized rules block name needed?

I have a grammar block:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace

And with normal rules it works fine. But as soon as i add parameterized rules it requires me to add ib. in front of everything.

Erroring code:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace
    Line(content) <- Static * content * EOL

Working code:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace
    Line(content) <- ib.Static * content * ib.EOL

I would assume either that this is a bug or i am missing something, either way how do fix this?

Cannot compile in stable release (Version 0.19.4)

Error message: .nimble\pkgs\npeg-0.8.0\npeg\buildpatt.nim(48, 22) Error: expected: '}', but got: '{'

Large grammars break NPeg

For very large grammars (such as SQL) NPeg's code generation mechanism will fail:

..\npeg\src\npeg\codegen.nim(340, 18) Error: case statement has too many cases for computed goto

(another problem here is that the error message doesn't indicate what part of the grammar is the actual cause)

Just wanted to say npeg is great

Hi, dear developers

Some months ago I started making a concept of a markup language which would be useful for UI design description.

And week ago I started development.
I know that parsing is actually pain, so I decided to take this library for usage.

It's so amazing! It only took 3 days to find out how can I use it and now I have clean code with several rules, which actually can parse the code!

The parser is still in development, but I'm sure I will be able to make first release soon.

Thank you very much! You are great!

lazy matching

hey,

I remember in REGEX when I can write (.+) bag and it matches the "faded blue bag"
[it's a very normal regex pattern]

but I can't do something like this in npeg.

const parser = peg("main", data: ...):
  main  <-   >*{ 'a' .. 'z', ' ' }  * " bags":
    echo "matched: ", $1

What does !1 do? (in the quickstart) [Question]

In the quickstart, there is a line:

pairs <- pair * *(',' * pair) * !1

I get that this means that pairs is a list of a pair object, followed by 0 or more pairs. But what is the !1 doing there? The only thing I can think of is that it is negating the literal 1.

Investigate precedence climbing

See if Npeg could be extended with precedence levels for rules, which could be used for implementing precendence climbing as part of a parser

Implement handling of left recursion

Handling left recursion is problematic in PEGs, see https://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion for a nice explanation of the problem.

Performance and Complexity

Hello!
I was wondering what is the complexity of npeg. Does it depend on the grammar ? It would be nice to have a section in the documentation about it.
For example consider the example in the README.md to parse arithmetic expression.

let parser = peg "line":
  exp      <- term   * *( ('+'|'-') * term)
  term     <- factor * *( ('*'|'/') * factor)
  factor   <- +{'0'..'9'} | ('(' * exp * ')')
  line     <- exp * !1

doAssert parser.match("3*(4+15)+2").ok

Is the match performed in linear time ?

I'm considering building a simple C compiler and I need a parsing library because I'm too lazy to read parser theory to build a parser by hand. I have https://github.com/vanyle/lrparser, but it's an SLR parser and I'm not sure it's powerful enough to parse C (also I haven't implemented any error handling.) Obviously, linear parsing time is a requirement when dealing with tens of thousands of lines of code to parse.

Another concatenation symbol?

I like this module very much! Thanks for your effort.

However, when I try to use it in my projects, I found it is hard to read the PEG grammar because that concatenation and matches 0 or more times use the same symbol *.

Is it possible to use another symbol for concatenation, for example: ~, or others?

Composibiity, imports, reuse

Npeg would greatly benefit from a mechanism providing reuse of rules using some kind of import mechanism to pull in pre-defined patterns from libraries.

With this mechanism Npeg could provide a library of predefined common patterns to be shipped with Npeg to simplify common operations like parsing of IPv4/IPv6 addresses, URIs, e-mail addresses.

I'm looking at an implementation but I seem to run into the limits of the Nim VM with what I'm trying to do. I'd appreciate any help on this from someone with more experience in macro land.

Templates don't reject code block syntax

Templates obviously do not (and should not) support code blocks.
While this is true, when you do write one, npeg just ignores it instead of erroring.

Question about code blocks and backtracking

You write:
"Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked."

I have been modeling a semi-complicated Prolog like language with lots of committed choices (and hence lots of places for backtracking). My global state is a stack which during code-block execution is popped and pushed eventually rolling up leaf-nodes into higher grains (I'm trying to model my nodes as variant objects like the JSON parser does).

My question is multi-fold: 1) given your statement above that code gets executed even in dead-end parsing is there any API for knowing this and possibly allowing code blocks to commit/rollback the global state? 2) do you have any suggestions of a good technique for this (the best I can think of is a stack of stacks where the inner stacks represent speculative parses that can be dropped/committed) 3) am I overthinking this?

thanks,
Daniel

Enhancing parser flexibility: improving type parsing with the push template

I have developed a lexer/parser, similar to the example "tests/lexparse.nim", that offers the ability to parse types beyond just strings.

Using the push template within the parser has proven to be incredibly useful for achieving this functionality. I made a modification to the codegne.nim file:

template push(`s`: string) {.used.}

I made the following change to expand its capabilities:

template push(`s`: string|`sType`) {.used.}

By implementing this modification, the parser now performs smoothly and accurately.

A paper citing Npeg

Your work is being recognized in academia, too: https://people.seas.harvard.edu/~chong/pubs/gpeg_sle21.pdf
👍

Use with threads and gcsafe code ?

Hi,

I am wondering how to use npeg with gcsafe code or threads. i declared my grammar constant, but I get errors like this when using the grammar match procedure:

Warning: 'decode_encoded_words' is not GC-safe as it calls 'match' [GcUnsafe2]

Is it possible to use npeg with gcsafe code ?

[Question] Why is my npeg parser done, then starts backtracking and fails.

I have this line of my parser: tinput <- tblock * !1 after EOL it should check again for the !1 and end but from the trace below you can see it backs up and starts again then fails. Why is this happening and how can i fix it?

 75|  0| 94|                        |EOL            |chr "\r"                                |***
 79|  0| 94|                        |EOL            |chr "\n"                                |***
 82|  0| 94|                        |EOL            |chr "\r"                                |***
 85|  0| 94|                        |EOL            |any                                     |***
 89|  0| 94|                        |tcontent       |return                                  |***
 93|  0| 94|                        |tblock         |commit 91                               |***
 91|  0| 94|                        |tblock         |choice 94                               |**
 92|  0| 94|                        |tblock         | call tcontent:9                        |***
  9|  0| 94|                        |tcontent       |choice 60                               |***
 10|  0| 94|                        |Static         | capopen ckAction -> 94                 |****
 11|  0| 94|                        |Static         |  capopen ckVal -> 94                   |****
 12|  0| 94|                        |whitespace     |   span {'\t',' '}                      |****
 13|  0| 94|                        |Static         |  capclose ckVal -> 94                  |****
 14|  0| 94|                        |Static         | capclose ckAction -> 94                |****
 15|  0| 94|                        |Alpha          | set {'A'..'Z','a'..'z'}                |****
 95|  0| 94|                        |               |opFail (backtrack)                      |****
 60|  0| 94|                        |Static         |capopen ckAction -> 94                  |***
 61|  0| 94|                        |Static         | capopen ckVal -> 94                    |***
 62|  0| 94|                        |whitespace     |  span {'\t',' '}                       |***
 63|  0| 94|                        |Static         | capclose ckVal -> 94                   |***
 64|  0| 94|                        |Static         |capclose ckAction -> 94                 |***
 65|  0| 94|                        |Alnum          |set {'0'..'9','A'..'Z','a'..'z'}        |***
 95|  0| 94|                        |               |opFail (backtrack)                      |***
 94|  0| 94|                        |tblock         |return                                  |**
 48|  0| 94|                        |tcontent       | choice 59                              |**
 49|  0| 94|                        |tcontent       |  choice 57                             |***
 50|  0| 94|                        |Dedent         |   capopen ckAction -> 94               |****
 51|  0| 94|                        |Dedent         |    capopen ckVal -> 94                 |****
 52|  0| 94|                        |whitespace     |     span {'\t',' '}                    |****
 53|  0| 94|                        |Dedent         |    capclose ckVal -> 94                |****
 54|  0| 94|                        |Dedent         |   capclose ckAction -> 94              |****
 95|  0| 94|                        |               |opFail (backtrack)                      |****
 57|  0| 94|                        |tcontent       | commit 58                              |***
 58|  0| 94|                        |               |opFail (backtrack)                      |**
 60|  0| 51|Fooo:\n    Barr\n    Fo |Static         |capopen ckAction -> 51                  |*
 61|  0| 51|Fooo:\n    Barr\n    Fo |Static         | capopen ckVal -> 51                    |*
 62|  0| 51|Fooo:\n    Barr\n    Fo |whitespace     |  span {'\t',' '}                       |*
 63|  0| 51|Fooo:\n    Barr\n    Fo |Static         | capclose ckVal -> 51                   |*
 64|  0| 51|Fooo:\n    Barr\n    Fo |Static         |capclose ckAction -> 51                 |*
 65|  0| 51|Fooo:\n    Barr\n    Fo |Alnum          |set {'0'..'9','A'..'Z','a'..'z'}        |*
 66|  0| 52|ooo:\n    Barr\n    Foo |value          |capopen ckAction -> 52                  |*
 67|  0| 52|ooo:\n    Barr\n    Foo |value          | capopen ckVal -> 52                    |*
 68|  0| 52|ooo:\n    Barr\n    Foo |value          |  span {'0'..'9','A'..'Z','a'..'z'}     |*
 69|  0| 55|:\n    Barr\n    Foooo: |value          | capclose ckVal -> 55                   |*
 70|  0| 55|:\n    Barr\n    Foooo: |value          | chr ":"                                |*
 71|  0| 56|\n    Barr\n    Foooo:  |               | jump :73                               |*
 73|  0| 56|\n    Barr\n    Foooo:  |               |opFail (backtrack)                      |*
  4|  0| 51|Fooo:\n    Barr\n    Fo |tinput         |any                                     |
  5|  0| 52|ooo:\n    Barr\n    Foo |               |jump :7                                 |
  7|  0| 52|ooo:\n    Barr\n    Foo |               |opFail (error)                          |

[Question] Parsing comments with two characters limits

I am trying to parse comments with limits: (* and *).

I am trying the following without success:

import npeg

type
  Comment = string

let tmp = "(*This is a comment*)"  # Nested comments are not allowed

let parser = peg("comment", d: Comment):
  commentStart <- "(*"
  commentEnd <- "*)"
  #content <- *(1-commentEnd) 
  #comment <- commentStart * >content * commentEnd:
  comment <- "(*" * >(1-"*)") * "*)":
    d = $1


var comm:Comment
#assert parser.match(tmp, comm).ok
echo parser.match(tmp, comm).captures

#echo comm

The sort of comments I could expect are like:

(* This is an exmaple *)

(* This is 
 * an example
 *)

(********)
(* Ok   *)
(********)

(*
  enum_v1 : enum_type00;
*)

Captures not working inside of proc with generic arguments

The > symbol used in an expression inside of a proc causes the following error: Error: wrong number of arguments. It seems like it is being recognized as the system operator rather than part of the expression.

Two examples of it not working:

import npeg

proc test(T: typedesc, str: string) = 
  let p = peg "start":
    start <- >"Test"
  var m = p.match(str)
  echo m.captures
test(string, "Test")

import npeg

proc test[T](obj: T, str: string) = 
  let p = peg "start":
    start <- >"Test"
  var m = p.match(str)
  echo m.captures
test(1, "Test")

If the '>' is removed, things work. But of course I cannot use the matched element. This is the only symbol that is causing me problems and the parser I've been trying to write is using almost all of them.

import npeg

proc test(T: typedesc, str: string) = 
  let p = peg "start":
    start <- "Test"
  var m = p.match(str)
  echo m.captures
test(string, "Test")

import npeg 

proc test[T](obj: T, str: string) = 
  let p = peg "start":
    start <- "Test"
  var m = p.match(str)
  echo m.captures
test(1, "Test")

Input wanted: design and implement a proper API for code blocks

Here's a request to all NPeg users: I'm looking for some ideas and feedback for the proper design of the API for grammar code blocks. At this time the available functions feel ad hoc and scattered, and I don't dare to call this v1.0 before a better interface is in place.

Here is a list of things a code block capture can do, and how this works now:

Access captures: when a code block executes, all closed captures that were made inside the rule, and the code block capture itself are available to the code block as the variable capture which is of type seq[Capture]. The Capture object has a few public members that can be used from the code block: s is the captured string and si is the index into the original subject where the capture was matched. The capture strings s are also accessible through a sugar macro that rewrites the code block as $n:
- capture[0] or $0 is the total capture of the rule
- capture[1..] or $1 are the explicit string captures inside the rule
Force a rule to fail:
- code blocks have access to a proc fail() which will cause the match to fail - this can be used to validate matches with Nim code, for example to check integer ranges, enum values, or symbols that were captured earlier.
- code blocks have acecss to a proc validate(b: bool), which is nothing more then if not b: fail()
For example:

uint8 <- >Digit[1..3]:                                                     
  let v = parseInt($a)                                                         
  validate v>=0 and v<=255

Perform matching in Nim code: the internal match state is implicitly available to the code block capture, which can (ab)use that to do matching in Nim code: the code block can see and adjust the match index, and indicate success or failure, for example:

  let p = peg foo:
    foo <- "ba" * die * "da"
    die <- 0:
      if si < s.len-3 and if cast[string](s[si..si+2]) == "die":
        si += 3
      else:
        fail()

  echo p.match("badieda")

All of the above features were introduced in NPeg one at a time, growing the API lacking a proper design.

My question: does the above feel as kludgy as I think it is, or is it fine as it is now? Any ideas for improvement or what this should look like?

FR: Deferred code block captures

Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked.

It'd be cool to have some sort of API that defers code block execution until after parsing or at least beyond where backtracking is possible. Right now I'm using npeg for validation and to construct a tree, but I have to basically dispatch magic strings (using push) from the code block, and then re-parse those later to perform the checks I need. A simplified example would be: count how many times this rule has fired the code block. This count will currently include partial paths.

Streams / resuming

NPeg can at this time only match a complete subject string, which does not allow streaming. Find a way to make NPeg resumable without losing performance, which will then allow easy interfacing with Nim streams.

javascript support

First, thanks for this project is great. Trying to compile my code to javascript (just to check if that was possible) I got the following error:

$ nim js test_ex_01.nim
...
asciidoc/src/asciidoc/docheader/docheader.nim(18, 32) template/generic instantiation of `peg` from here
/home/jose/.nimble/pkgs/npeg-0.27.0/npeg/codegen.nim(429, 33) template/generic instantiation of `fn`gensym167` from here
/home/jose/.nimble/pkgs/npeg-0.27.0/npeg/codegen.nim(355, 22) Error: type mismatch
Expression: pop(e`gensym166.trace)
  [1] e`gensym166.trace: string

Expected one of (first mismatch at [position]):
[1] proc pop[A, B](t: OrderedTableRef[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: TableRef[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: var OrderedTable[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: var Table[A, B]; key: A; val: var B): bool
[1] proc pop[A](t: CountTableRef[A]; key: A; val: var int): bool
[1] proc pop[A](t: var CountTable[A]; key: A; val: var int): bool
[1] proc pop[T](s: var seq[T]): T
[1] template pop[T](s: var Stack[T]): T

Is this something that could be fixed or simply npeg won't support javascript? (or maybe is something that I am doing wrong).

Npeg too heavy for NimVM

Currently the VM program space is 16bit. Npeg uses more than that when passes a non-trivial grammar to analyze. This means NimVM crashes on relatively complex grammars.

Examples of grammars that npeg can't support due to this:
https://play.nim-lang.org/#ix=22KQ
http://ix.io/22KC/nim

Failure to consume in particular case

I'm working on a SQL parser that converts a "WHEN" expression into reverse-polish notation. I'm coming across a problem where the system is ... possibly .. failing to consume the end-point numbers.

I must admit I'm new to PEG in general. So, if I'm making a newbie mistake, I do apologize.

Rather than go into great detail, I wrote a tiny program that demonstrates it:

import npeg
import strutils

let parser = peg("equation", p: seq[string]):

  equation <- allExpr
  
  plusMinus <- '+' | '-'

  number <- (integer)

  integer <- >+Digit:
    p.add "literal:" & $1

  whitespace <- +(Blank | Space | Cntrl)

  addExpr <- number * ?whitespace * >plusMinus * ?whitespace * (addExpr | number):
    p.add "op:" & $1

  allExpr <- (addExpr | number)

let testStr = "1 + 2 - 3 + 4 - 5"

var answer: seq[string] = @[]

let successFlag = parser.match(testStr, answer).ok

echo "test = [", testStr, "]"

echo "success = ", $successFlag

echo "answer = "
for item in answer:
  echo "  ", $item

The result I'm getting in answer is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  literal:5
  op:-
  op:+
  op:-
  op:+

but what I'd expect is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  op:-
  op:+
  op:-
  op:+

So, the section that is parsing a number is running twice on the same "5" text. Am I missing something or is there a bug?

Consolidate error methods

Npeg currently can fail a parse in two ways:

Normal failure when the subject does not match the grammar. In this case the ok field of the MatchResult will be false and the offset of failure is available in MatchResult
An user defined failure with E() pattern of fail() in a code block throw an NPegException with a user defined message

The above requires the caller to check for errors in two ways, which is not ideal.

I propose deprecating the exceptions and always return errors through the MatchResult object instead.

Memorization & Bryan Ford

Huge thanks for nPEG — love it a lot, even its quirky syntax — as it gets highlighted, unlike common PEG. Love your other work as well.

PEGs creator — Bryan Ford — has a page which (besides other things) lists PEG implementations in various languages. I suggest to ask him to add Nim and its both PEG implementations. His email and twitter are at the index.

In my opinion it would also be a good opportunity to poke his brain on implementing some memorization for your amazing nPEG. I know that contacting people is not everyone's favorite thing, but here are some reasons for it:

He is a relatively young professor — approachable and curious.
He is the author of packrat — a practical memorization technique. His Master thesis describes it, and contains a Haskell implementation. The thesis predates PEG by 2 years, so it uses TDPL instead — but it's still helpful to get the practical gist of his idea. The mentioned info page has 6 more pdfs on packrat alone, plus more on related topics. Also has links to packrat implementations in Java, Python, Haskell, Scheme, and Ruby — for inspiration and borrowing.
Even just exchanging Hello's with the inventor would be cool, and potentially could be inspiring and helpful.

You might be busy or have other plans, but if there be time to improve nPEG in terms of memorization — contacting the dude can do no harm.

In any case — thank you, and best wishes.

push() captures broken?

From EyeCon on IRC:

import std/[strformat, strutils]
import npeg

let tmp = "Z1Z + A1A / B1B + -C1C"

let parser = peg "exp":
  code <- >(?'-' * Alpha * Alnum[2..4]):
    push("Cod " & $1)
                                       
  parenExp <- ( "(" * exp * ")" ) ^ 0          
                                                 
  prefix <- code | parenExp                    
                                               
  infix <- >{'/'} * exp ^ 1 | >{'+'} * exp ^ 2: 
    push("Opr " & $1)                           
                                                
  exp <- prefix * *infix                        
                                                
echo parser.match(tmp.replace(" ", "")).captures

Prevent code blocks captures from executing

I want to make a pattern that uses another pattern but doesn't execute that pattern's code block capture.

import npeg

# This parser parses words and no words
# Words are one or more alpha characters
# And no words are words with a dash before them
const parser = peg("nodes", data: seq[string]):
  nodes <- *(node * (' ' | !1))
  node <- noWord | word
  noWord <- '-' * word:
    data.add($0)
  word <- +Alpha:
    data.add($0)

var data: seq[string]
assert parser.match("a -b", data).ok
assert data == @["a", "b", "-b"] # I want it to be @["a", "-b"]

There could be, perhaps, an operator that allowed it:

%P # matches P without executing it's code block capture

AssertionDefect when matching simple example

Hi,

I found npeg just now and tried it out with the following example.

`import npeg, strutils, tables

type Dict = Table[string, int]

let parser = peg("program", d: Dict):
program <- *statement
statement <- var_decl
word <- +Alpha
varC <- "var"
semi <- ';'
var_decl <- varC * >word * semi

var words: Table[string, int]
discard parser.match("var hello;", words).ok
echo words`

But it doesn't match - what am I doing wrong?
Thanks for any hint!

Pass local state to parser code blocks

Any variables used in code blocks need to be declared before the parser definition, which is not ideal. It would be nice to have a way to pass local state around which can be used in the code blocks.

`-d:npegTrace` code allocates a number of closures equal to the number of states

When npegTrace is enabled, a closure is allocated for each state generated by NPeg. This is due to the fact that doTrace is defined inside genTraceCode. For large grammars, this can cause programs to suddenly crash, due to the program attempting to allocate stack memory for all the closure pointers.

Implement slicing operators for captures

Someone commented on an old YouTube video of mine showing Advent of Code solutions using NPeg. After 0e297ac the Captures type is no longer just an alias for a sequence, but rather an object. The object has [] defined for it, but not for slices which I used in my solution. The fix is simple enough, but for backwards compatibility and future usability it would be nice to have.

Investigate parallelization

Would it be possible to implement some kind of parallization when parsing ordered choice or optionals?

NPeg's IR language should not be too hard to parallelize: when encountering an opChoice, both branches of the program can be executed in parallel. When the primary branch matches, the secondary branch is aborted, but when the primary branch fails, the secondary can continue parsing from there.

Challenges:

Keeping track of captures: each thread needs to keep track of each own capture stack.
Code block captures?

PEG section in the Nim book

Dear Sir,

I have just shipped the first draft of the PEG section for the Nim book, see

http://ssalewski.de/nimprogramming.html#_parsing_expression_grammars

That section have not yet been checked by native English proof readers, so there may be still some grammar or spelling errors. Please let me know if my explanations are wrong or still unclear.

I have the hope that this section will make it a bit easier for kids and beginners to start with PEGs.

Some notes to your NPeg README text:

data up to the match; if his is

Plain typo, should be "this".

NPeg provides various prefix, infix and suffix operators.

I still wonder if there are really suffix operators.

even if the match is part of a pattern that fails and is later backtracked.

That full sentence is displayed in italics -- is that intended?

The syntax for defining a generic grammar is as follows:

peg(name, identifier: Type)

The phrase "generic grammar" sounds strange here, as you are not talking about grammar, but about passing an additional variable to the peg() macro.

Best regards, Stefan Salewski

Attach line information to generate functions

NPeg doesn't appear to attach line information to its generated code. For instance, when the backtrace depth has been exceeded, a stack trace like the following will occur:

.\sample.nim(17) sample
..\npeg\src\npeg.nim(134) match
..\npeg\src\npeg\stack.nim(31) fn`gensym15613
..\npeg\src\npeg\stack.nim(27) grow
Error: unhandled exception: backtrace stack overflow, depth>1024 [NPegException]

Ideally, the traceback would indicate (either directly, or via the generated function name) what rule caused the error.

Can You Add Some Example Binary Structures?

I realize that you specifically indicate that npeg was extended over time and parsing binary data structures is a bit experimental at this point. However, it would be wonderful to have some examples of npeg parsed binary structures just to help potential users of the library wrap their heads around how it should work given that use-case.

I could be wrong, but I think a heavily commented working parser for Ethernet frames and IPv4 packets would probably be enough of a reference for a large number of binary use cases to help people get the ball rolling.

This is a very cool project. Writing parsers by hand is indeed quite a pain.

A way to let every peg rule to produce its own object...

Here are some codes to demonstrate my problem.

import npeg, strutils

const
  testData = "10:10;12:00;22:40" # last item has no ';'

proc test1() =
  var list: seq[string]
  let peg = peg "start":
    start <- +statement
    statement <- >time * ';':
      list.add $1
    time <- Digit[1..2] * ':' * Digit[1..2]

  discard peg.match(testData)
  echo list

Output: @["10:10", "12:00"]

The output is what we want. However, we like object instead of string. So...

type
  Time = object
    hour: int
    min: int

proc test2() =
  var list: seq[Time]
  let peg = peg "start":
    start <- +statement
    statement <- time * ';'
    time <- >Digit[1..2] * ':' * >Digit[1..2]:
      let (hour, min) = (parseInt($1), parseInt($2))
      if hour in 0..23 and min in 0..59:
        list.add Time(hour: hour, min: min)

  discard peg.match(testData)
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0), (hour: 22, min: 40)]
We get seq[Object] as output, however, code block capture are always executed even when the parser state is rolled back afterwards. The result is wrong.

proc test3() =
  var list: seq[Time]
  let peg = peg "start":
    start <- +statement
    statement <- time * ';':
      for i in countup(1, capture.len-1, step=2):
        let (hour, min) = (capture[i].s.parseInt, capture[i+1].s.parseInt)
        if hour in 0..23 and min in 0..59:
          list.add Time(hour: hour, min: min)

    time <- >Digit[1..2] * ':' * >Digit[1..2]

  discard peg.match(testData)
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0)]

Finally, we get what we want. However I think this code is bad because we produce Time object outside of time rule.
If statment rule is statement <- (time | date | something) * ';' , the code will be really ugly.

The way I can resolve the problem for now is:

import marshal

proc test4() =
  let peg = peg "start":
    start <- +statement:
      var list: seq[Time]
      for i in 1..<capture.len:
        list.add to[Time](capture[i].s)
      push($$list)

    statement <- time * ';'

    time <- >Digit[1..2] * ':' * >Digit[1..2]:
      let (hour, min) = (parseInt($1), parseInt($2))
      if hour in 0..23 and min in 0..59:
        push($$Time(hour: hour, min: min))

  var list = to[seq[Time]](peg.match(testData).captures[0])
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0)]

Ok, it works fine, and the code is clear, each rule produce the object of itself. But it will be very slow due to serialization/deserialization, and marshal cannot works at compile-time (https://github.com/treeform/jsony can).

In the end, is there a better/smarter way to do this?
Sorry for my bad English.

AST building, another take

Currently, AST building with NPeg is quite difficult and bug-prone. You have to maintain a stack of nodes and use a bunch of relatively inefficient workarounds to build a nice, inspectable, traversable abstract syntax tree.

My proposal is to allow the user to specify their own Node type, and add primitives for building abstract syntax trees. Here's my idea for the API:

import std/strutils

type
  NodeKind = enum
    nkNumber, nkIdent, nkCall

  Node = ref object
    case kind: NodeKind
    of nkNumber: number: float
    of nkIdent: ident: string
    of nkCall:
      fn: Node
      args: seq[Node]

let p = peg(expr, ast = Node):
  ident <- {'a'..'z', '_'} * *{'a'..'z', '_', '0'..'9'}:
    # perhaps we could assign to result but i don't know
    # whether NPeg reserves it or not
    build Node(kind: nkIdent, ident: $0)
  number <- +Digit:
    build Node(kind: nkNumber, number: parseFloat($0))
  # AST captures would be done with A(), referencing
  # a capture inside of the code block would be done with node(n)
  # NPeg could detect whether the capture would result in
  # multiple nodes and expose nodes(n) which returns a
  # seq[Node] instead of a single Node
  call <- A(ident) * '(' * A(expr * *(',' * expr)) * ')':
    build Node(kind: nkCall, fn: node(1), arg: nodes(2))
  expr <- call | ident | number

Usage of node(n) on a sequence capture would result in either an error or the first node being returned. Usage of nodes(n) on a singular capture would result in an error or a sequence with only that node being returned.

Prettier graphs

I'd love to be able to draw graphs like this from a grammar:

Or ascii:

                ╭─(back)◂─╮
  ──▸(term)─╭───╰─────────╯──╮──▸
            ╰─(one)◂─╮─(+)◂──╯
                     ╰─(-)◂──╯

Backreferences

Being able to reference backreferences in match patterns would be nice, for the cases where something like bash heredocs are desired.

Captures only work when input is consumed

I think this behavior is somewhat surprising.

let p = peg o:
  o <- >&1:
    echo len($1)

doAssert p.match("a").ok

Possible solutions:

make it work
produce a warning/error when > is used along with &
mention it in the documentation

NPeg does not support left recursion

which would be nice to have.

http://www.lua.inf.puc-rio.br/~sergio/leftpeglist.pdf

Help - how to handle captures and optionals

I am having two difficutties and probably I am not using probably npeg here.

First issue: capturing optional elements

title <- >( whatever)
attributes <- >(whatever2)
delimitedBlocks <- ?title * >?attributes * >body:
   mycode

Given above's situation, $1 could be tittle, attributes or body. How can I check what it is? If in the code I do echo capture, the field name is empty.

Second issue: creating objects

I want to capture above's data in a ref object, where should I create it?

var blk:Block
new(blk)

I'd like to create the object in one place in order to do something like:

let parserBlocks* = peg("delimitedBlocks", myBlk):
  title <- >( whatever):
    blk.title = $1
  attributes <- >(whatever2):
    blk.attributes = $2
  delimitedBlocks <- ?title * >?attributes * >body:
    blk.body 
    myBlk.blocks &= blk

Get line/position number of capture

If possible, it would be nice to get some meta-information about a capture, such as the position it is located at in the passed-in input.

Print return stack on overflow

When NPeg throws a depth error due to an overflow in the return stack, it would be nice to have the exception contain a printout of the stack.

Something like

Error: unhandled exception: return stack overflow, depth>1024
Return stack:
sample.nim(10)             rule_name_one
keywords.nim(100)       rule_name_two
...

Rather than just

Error: unhandled exception: return stack overflow, depth>1024

Parsing indentation-based grammars

Many languages, including Nim, use indentation to represent a tree structure. Would it be possible to add a mechanism for parsing such languages into NPEG? Indentation-based grammars are not strictly context-free, but it's quite easy to convert them into a context-free grammar by introducing special “indent” and “dedent” tokens, representing a change in the indentation level, which can then work like brackets.

zevv / npeg Goto Github PK

npeg's People

Contributors

Stargazers

Watchers

Forkers

npeg's Issues

First issue: capturing optional elements

Second issue: creating objects

Recommend Projects

Recommend Topics

Recommend Org