teo-tsirpanis / farkle Goto Github PK

View Code? Open in Web Editor NEW

84.0 3.0 6.0 6.05 MB

LALR parser combinators for C# and F#.

Home Page: https://teo-tsirpanis.github.io/Farkle/

License: MIT License

F# 48.05% PowerShell 0.14% C# 51.69% CSS 0.12%

fsharp parser lalr parser-combinators hacktoberfest

farkle's Introduction

Farkle

Farkle is a modern and easy-to-use parser library for F# and C#, that creates LALR parsers from composable parser combinator-like objects.

Documentation

Maintainer(s)

@teo-tsirpanis

farkle's People

Contributors

Stargazers

Watchers

Forkers

forki kaashyapan davidalphafox algunion ysadadas elexisvenator

farkle's Issues

Remodel the grammar domain

The new domain model would be much more type-safe, mainly with regards to the way symbols are stored.

Currently, there is a single type for all kinds of symbols. However, all these kinds cannot be actually used everywhere. As a example, currently, a production can have a Noise symbol; a very absurd idea.

Move the project's website generator away from FSharp.Formatting

FSharp.Formatting

is cluttering the project's Paket dependencies
has a bug that cripples Farkle's code samples
is inactive. In fact, the last opened issue the one I mentioned previously; issue #478. I opened it on August, and still got no reply.

The tutorials and quick start guide will be hosted on a site made by a different static site generator. Some options considered are:

Docusaurus
Docz
Whatever FAKE is using (unless it is FSharp.Formatting of course)
Stick with FSharp.Formatting, but try to fix issue #478
Something else..?

These options do not offer the tooltips inside the code snippets, which is, admittedly, quite the killer feature. However, Farkle's API is not that hard-to-realize-what-this-object's-type-will-be.

It was also decided that the more detailed API documentation will not be hosted on the project's site. It already exists in the source code comments, and in the IntelliSense offered by the IDEs. So, everyone that uses the library will stumble upon it when they need it. Documentation of every code element is something developers usually look at after they have installed the package in question, or at least looked at the source code

BNF grammar output

Having a BNF grammar output would allow that the grammar written with Farkle to be used in other tools.

Lexer Ambiguity Resolution Documentation

As we discussed in the Discord, the lexer generator resolves ambiguity in favor of fixed-size regexes when two regexes conflict. This resolution is independent of declaration order. While this is not inherently problematic, this behavior is not documented. It would be good for this behavior to have documentation added.

AppVeyor randomly fails without reason

Issue report on AppVeyor support site.

Farkle 7 parser API work

(well, tasklists are in private beta and only for organizations 🥲)

#89
Tokenizer API design
Implementation
Benchmarks

Precise syntax error reporting

Context: https://discord.com/channels/944706562948231188/944706563401211946/1226641177554456626 – thanks @notYuriy for bringing this up!

Due to the state merging performed by the LALR algorithm, the expected symbols list may contain errors. To present an accurate list to the user, we can attempt reducing each symbol and include only the symbols where the reduction succeeded.

This resembles GNU Bison's LAC (Lookahead Correction) feature, which I would like to eventually add to Farkle as well. As described in Bison's documentation, LAC's exploratory reductions have no memory overhead, and I am not concerned with any time overhead, since the work scoped in this issue will happen only in error cases.

There will be some complications with default reductions, once that gets added in Farkle, but I believe it will be manageable.

One final question is whether it should be enabled by default and if so, if we want to add an option to switch to the previous behavior. At the moment I am inclined to say yes and no respectively.

Optimize tokenizing character groups.

Character groups are the most common kind of groups in Farkle (the other one, token groups originates from GOLD Parser and can be represented in Farkle's grammars but cannot be created in the builder as of Farkle 6). When inside a character group, the tokenizer tries to match a token with the DFA, and if it did not find anything of interest (that ends the group or starts a nested group), consumes one character from the input before trying again. This is very inefficient because we invoke the DFA once per character.

Farkle 6 tried to improve this by leveraging MemoryExtensions.IndexOf(Any) to quickly jump to the start of a possibly interesting token. While it avoids repeated invocations of the DFA, it is unfortunately incorrect because it assumes that groups end with either a new line, or a literal of the name of their ending symbol. While both GOLD Parser's and Farkle 6's groups only end with these, their grammar files can encode groups that don't. This is an assumption we don't want to carry over to Farkle 7.

Here's what we will do instead. For each group, we will use a special DFA that matches only its end symbol and the start symbol of its nesting groups, if any. If we have say a group that starts with {, ends with } and supports nesting with itself, we would emit a DFA that matches the regex .*[{}] and invoke that DFA, consuming all the characters it read. This approach is definitely faster than the correct naive way.

One question is how to encode these extra DFAs? One way that requires the least changes to the grammar format is to append them to the grammar's main DFA, and introduce a new structure that maps groups to the DFA state to start from when inside them. Token groups will start from the state 0, using the main DFA. For a little more speed gains, we can specify that if a group uses a non-zero initial DFA state, the tokenizer does not have to check if a group nesting is actually valid.

This is too compelling to not land in Farkle 7.0, but needs the builder first.

Move from AppVeyor to GitHub Actions.

Since its beginning, Farkle has used AppVeyor for CI and it has been more than once the source of frustrations (it's broken as of writing this because they did not update the .NET SDK).

We should move to GitHub Actions, it seems to be one of the best CI solutions available at the moment, and it would be a useful skill to learn.

The 6.x branch can stay in AppVeyor.

Create an example of how to parse an indentation sensitive grammar.

I just bumped across Farkle while looking for resources on parsing indentation sensitive grammar using FParsec. The library/tooling looks intriguing, but it would be great if some guidance/examples on how to support offsides like F#, Elm, or Python.

For reference the language I am trying to create a parser for is Elm.

How to match EOF?

I'm trying to write a parser for the data format described in Day 4 of the Advent of Code 2020 problems. The following code works OK on the contents of a file which ends in 2 newline characters.

namespace AdventOfCode.Core

module Day4 =

    open Farkle
    open Farkle.Builder
    open Farkle.Builder.Regex

    type Field =
        | BirthYear
        | IssueYear
        | ExpirationYear
        | Height
        | HairColor
        | EyeColor
        | PassportID
        | CountryID

    type KeyValuePair = { Key : Field; Value : string; }

    type Parser () = 

        static let field = 
            regexString "byr|iyr|eyr|hgt|hcl|ecl|pid|cid" 
            |> terminal "Field" (T(fun _ data -> match data.ToString() with
                                                 | "byr" -> BirthYear
                                                 | "iyr" -> IssueYear
                                                 | "eyr" -> ExpirationYear
                                                 | "hgt" -> Height
                                                 | "hcl" -> HairColor
                                                 | "ecl" -> EyeColor
                                                 | "pid" -> PassportID
                                                 | "cid" -> CountryID
                                                 | _ -> failwith "Invalid field code"))

        static let value = regexString "[abcdefghijklmnopqrstuvwxyz0123456789#]+" 
                                        |> terminal "Value" (T(fun _ data -> data.ToString()))

        static let keyValuePair = "KeyValuePair" ||= [
            !@ field .>> ":" .>>. value => (fun k v -> { Key = k; Value = v })
        ]

        static let keyValuePairs = "KeyValuePairs" ||= [
            !@ (many1 keyValuePair) .>> "\r\n\r\n" => id
        ]

        static member internal keyValuePairParser = RuntimeFarkle.build keyValuePair

        static member internal keyValuePairsParser = RuntimeFarkle.build keyValuePairs

        static member internal parse input parser = 
            match RuntimeFarkle.parseString parser input with
            | Ok result -> result
            | Error err -> failwith (err.ToString())

However I also want to recognise EOF as a separator so that I don't have to put 2 newline characters at the end of the file in order for Farkle to parse it. FParsec seems to have an inbuilt parser for EOF, but I can't figure out how to do it with Farkle. How can I write a parser that captures EOF?

The tokenizer erroneously reports an EOF in certain cases

Description

The tokenizer has a problem. As the DFA is reading the input stream, if it reaches the end while having never before accepted a token, it reports an EOF, instead of a lexical error.

Repro steps

Write the following grammar:

"Start Symbol" = <S>
Number = {Digit}{Digit}
<S> ::= Number | <>

As you see, the language of the grammar is just two digits, or the empty string.

Using Farkle, parse the string containing a single number like 3.

Expected behavior

It is obvious from the grammar that a single-digit string is not part of the grammar's language. 3 does not match the regular expression of the Number terminal. As a consequence, a lexical error should be raised.

Actual behavior

When the tokenizer encounters 3, he advances to the next DFA state. After the digit, he reaches the end, but because he had never before reached an accept state (because a single digit does not suffice to complete the Number terminal), it reports an EOF, while the input had clearly, not ended. Because in the beginning, the parser accepts either a Number terminal, or an EOF, he accepts the latter while he should have accepted the former, and parsing succeeds, while it should have failed.

Known workarounds

None.

Additional information

This bug is deemed serious enough for its fix to be backported back to version 4.0.0.

Add proper logging support.

The event callbacks acceptd by RuntimeFarkle are not an option. I'm talking about integration with a real logging framework. This will allow gathering much deeper insights about Farkle's internal workings. Furthermore, it might even pave the way for localized log messages! This framework has to be fast (i.e. minimal overhead when no logging is enabled), and have an easy-to-use API.

I am leaning towards either logary, or Serilog, or maybe something else.

Announcing Farkle's Discord server.

Yesterday Farkle reached 50 stars. I would like to thank each of these fifty people who showed interest in Farkle. It might look like a small number but it means a lot to me.

One significant problem of Farkle is its lack of direct user feedback and community. Because I don't know how Farkle is used, I cannot meet developers' needs (or even know what they are) and the impact of any breaking changes I make is unknown.

That's why I had the idea to create a Discord server for Farkle. You will be able to more directly engage with Farkle's maintainers (at the moment only me) and other users, show off the cool things you have built with Farkle, and be notified of any announcements (such as a new release or imminent breaking changes). As I am starting working for Farkle 7.0.0 -a significant reimagining of how Farkle internally works (albeit non-breaking for the majority of use cases)-, if you are using Farkle, your presence will be very important.

The link is https://discord.gg/mYzXu5Zt8J. I can't wait to see you there!

New CLI command: farkle describe

Similar to GOLD Parser's website generator, farkle describe will generate a documentation file for a grammar.

Features

It must support Markdown (it's the lingua franca of documentation) and very readable by humans.
- The most architecturally beautiful way to implement it is using Scriban. Of course we would allow the user to provide his own.
  - We have to extend the Scriban ScriptObjects to support LALR and DFA states. It might involve some boilerplate, because of Scriban's lack of immutable collection support.
PDF is immediately more readable and portable than Markdown.
- However Scriban cannot create a PDF, so we have to rely on a code-based solution.
- Maybe PDF is not actually needed; people using Farkle are supposed to be knowledgeable enough to know what Markdown is.
A stretch goal could have been to visualize a DFA/LALR state graph.
- We can use something like Graphviz.
- Or create interactive HTML files with a JS library (needs research)!

Invite you to try FslexFsyacc

Compiler/Parser programming enthusiasts, I am very pleased to invite you to try a new series of lex/yacc tools.

Fslex is a code generator that uses regular expression syntax as a rule to generate a function, which divides the input token sequence into groups at a higher level. The Fslex is often used to remove redundant delimiters, add omitted delimiters or other syntax components, and so on. The Fslex is also used to determine context somewhere in the stream.

Fsyacc is a code generator that use BNF productions and precedences as a rule to generate a function, which resolves the input token sequence to an abstract syntax tree.

Why use the package?

You can use your existing handwriting tokenizer.
This package uses standard lex/yacc syntax to minimize your learning costs.
fslex/fsyacc generates respectively an independent, side-effect-free function that can be called flexibly.
The method of generating code is simple, without command lines and without the need to configure projects.
The result code is data-driven and highly readable.
Flexiblely compose of tokenize, regular expressions, BNF technology.

The project is hosted on GitHub where you can report issues, fork the project and submit pull requests.

Create terminals from string regular expressions

One of Farkle's lacking features in comparison with FsLexYacc and GOLD Parser is how terminals are created. You see, terminals in Farkle consist of a regular expression, which is hand-written in code. Regular expressions are composed from smaller ones, using operations like concatenation, alternation, the Kleene star and other ones (like repetition) that are based from the previous three.

Composing regexes is a powerful model akin to composable designtime Farkles, but requires lots of code even for the simplest of terminals. FsLexYacc and GOLD Parser allow the user to define regexes with a simple language he already knows.

Therefore, we will propose the following two functions:

val terminalRegex: name: string -> regex: string -> transformer: T<'T> -> DesigntimeFarkle<'T>

val terminalRegexU: name: string -> regex: string -> DesigntimeFarkle

Requirements

The language will incorporate the best of GOLD's and canonical regex language.
- For example it will support both {Number and \d to specify the set of decimal number digits.
It will have feature parity with composable regexes.
- Anything that can be done with terminal can also be done with terminalRegex.
- It means that features like group capturing and lazy matches will not be supported and never will.
It will not be whitespace-sensitive. Readability matters, and you don't usually want to explicitly tokenize whitespace.
- If needed, representing whitespace characters can happen in regions delimited by '.
What about macros? FsLexYacc support them, GOLD Parser doesn't, and Farkle's composable variables essentially supports them with composition.
- We will follow GOLD Parser's approach. {Set} will match a predefined set (as already got by GOLD).
  - We can surprisingly simply get Farkle's declared predefined sets using reflection. Should we allow declaring custom ones? Maybe not, it needs more functions and pollutes the API.
The language of regexes is context-free, and we need a sufficiently powerful parser.
- Just imagine! We have one!

Generate precompiler build error reports

As shown in #9, GOLD Parser displayes LALR conflicts in a not so helpful way. While Farkle 6.0.0 has somehow improved the error messages for them, there is still lots of room for improvement.

We can use the new HTML generator to also create error reports for failed builds, displaying the full LALR state table instead of just the conflicts without any context.

This feature would require changes to the way LALR conflicts are represented but that's an acceptable break. It would also require some changes in the LALR builder and conflict resolver.

Build designtime Farkles on compile-time.

In contrast with FsLexYacc and other traditional parser generators, Farkle creates the parsing tables of a grammar at runtime, instead of at compile-time. This poses two problems:

Compared to parsing, building a designtime Farkle is a slower and more computationally intensive procedure. This inherent limitation is mostly ignored because building has a cost that is supposed to be paid only once in the execution of the average app that uses a static grammar.

The likes of FParsec largely evade this problem because they do not use precomputed parsing tables. Creating an FParsec Parser is merely creating some objects; no special algorithm involved.

When we say "slow", we mean that building a designtime Farkle for GOLD Meta-Language takes about ten milliseconds in the benchmarks. Compare this with reading an EGT file which takes some hundreds of microseconds. Both times might seem very fast, but Farkle is a library that is meant to be fast.

Errors with the built grammar (such as LALR conflicts) are not reported until it's too late when text is attempted to be parsed using a defective grammar. FsLexYacc brings these errors to the user's attention earlier when the .fsl/y file is being compiled (it actually tries to resolve them itself instead of failing but that's a practice Farkle should not follow). FParsec does not bother with such kinds of errors; it will (try to) parse whatever the user gives it.

A solution for these two problems is precompiled grammars.

How it works

We will create a new function with the following signature:

module RuntimeFarkle =

val markForPrecompile: DesigntimeFarkle<'T> -> DesigntimeFarkle<'T>

It will be supposed to be called like that from user code:

let designtime =
  Terminals.int "My Number"
  |> RuntimeFarkle.markForPrecompile

let runtime = RuntimeFarkle.build designtime

After compilation, Farkle.Tools.MSBuild (or the CLI tools) will probe the compiled assembly for all properties holding a designtime Farkle and are marked for precompilation, build the grammar, and embed the compiled tables in the assembly.

When the designtime Farkle is being built, Farkle is going to check if the assembly has a precompiled grammar in it, and use it, instead of building it again. If no such grammar is found, nothing changes. The whole process must be totally transparent from the user's perspective.

If building a grammar fails, an appropriate error message must be displayed, and in MSBuild's case, building must fail (maybe add a flag to ignore the errors; they will already be caught by the time the runtime Farkle gets used).

Like with other metadata setters, markForPrecompile is supposed to be only called at the topmost designtime Farkle and be ignored in other cases.

Improve error messages

Farkle needs to give better error messages in case of lexical errors and LALR conflicts. Let's talk about them separately.

The biggest problem with lexical errors is that they report the first character the tokenizer reads as erroneous. For example, if we have something a string like "<100 characters><EOF> with no ending quote, the error would occur at the beginning quote.

The proposed solution would be to bring some context around the erroneous token, like including the entire 101-character-long string in the end. Very long strings (like the example above) can be truncated in their middle.

Another idea would be displaying expected characters like the syntax errors, but only in cases the expected characters are few. Still I am not sure how helpful it would be.

LALR conflicts are reported in a quite detailed way that can do better however. Compare Farkle's error reports for a naïve calculator grammar:

with GOLD Parser's:

Farkle does not give any clue at all about what LALR state 2 is for example, and there is no way to see it, apart from debugging deep into the builder. We can surely do better!

Some other notes:

We could add methods in runtime Farkles to more easily check whether building succeeded, as well as get friendly error messages from a string.
Error messages for syntax errors are IMO descriptive enough and do not need any improvement.

Include the current parser state number in the syntax error.

Suggested by @Kuinox on Discord, it would make debugging easier by cross-referencing the state number with the HTML page to see where the parser was during the error.

My first thought is to not show it on the error's ToString() representation; this is something the end user will not care about, but it would be accessible from code and from the debugger.

Unless we allow for two string format types, one simple and one more detailed. This one needs more thought.

Cannot happen in 6.x due to DUs and the breaking changes around changing them.

Make `Production<T>` a covariant interface.

Background and motivation

It has been asked on Discord to support "upcast[ing] productions in a nonterminal", avoiding on fusers boilerplate upcasting to a common ancestor that is shared by a nonterminal containing the production.

Here is a real-world example of this boilerplate. Each production builder is finished with essentially dummy fusers¹:

https://github.com/Kuinox/KuiLang/blob/bf6d6283df1b8fec0c397ab0f9f22c52acd96558/KuiLang/KuiLang.cs#L180-L183

This has been recognized as a real usability issue of Farkle, and can be solved pretty easily.

API Proposal

namespace Farkle.Builder;

- public sealed class Production<T> {}
+ public interface Production<out T> {}

Production<T> will continue having no members, reflecting the builder's philosophy that its objects are black boxes. The I prefix will not be added to signify that implementing this interface from user code will not make sense; just like with designtime Farkles. ²

API Usage

With the proposed change the code snippet above could be simplified to this (methodDeclaration and fieldDeclaration return different descendants of Statement.Definition):

var definition = Nonterminal.Create<Statement.Definition>("Method or Field Declaration List",
    methodDeclaration.AsIs(),
    fieldDeclaration.AsIs()
);

Much terser and more intuitive (and a tiny bit more performant because AsIs is used). A dummy prototype of this API shape compiles as expected.

Alternative Designs

We could go further and implement Production<T> on DesigntimeFarkle<T>, making the example above to:

var definition = Nonterminal.Create<Statement.Definition>("Method or Field Declaration List",
    methodDeclaration,
    fieldDeclaration
);

It would save even more characters, but it's not orthogonal to the distinction between designtime Farkles and productions; it's just a syntax convenience that needs a strong justification to puncture the architecture. And anyone interested can create a method that emulates the call above.

Risks

It's a breaking change.
- We don't care, a new major version is underway, everything is subject to change, and this is not source-breaking.
Users might implement it by mistake.
- It will be documented that implementing it from user code is not allowed, just like designtime Farkles.

The fuser was subsequently eliminated by calling AsIs with an explicit type parameter (that was smart!) but only works on the AsIs extension method of designtime Farkles; not the AsIs instance method of production builders. Even then, some boilerplate on each production remains. ↩
The Farkle.PostProcessor<T> interface is an exception to this rule and makes sense to implement it from user code. It has been renamed to IPostProcessor<T> on Farkle 7. ↩

Farkle.Tools can generate duplicate production names.

Take this grammar for example:

<Value> ::= A
        |   B
        |   <C>
        |   <D>

Farkle.Tools will generate a source like this one:

type Production =
    | ValueA = 0
    | ValueB = 1
    | Value = 2
    | Value = 3

The last two Values have the same name!

Support nonterminals participating in P&A.

Consider this grammar represented in Bison syntax:

%left '|' "or"
%left '&' "and"
%precedence '!' "not"

%start expr

or_op: '|' | "or";
and_op: '&' | "and";

expr:
  expr or_op expr
  | expr and_op expr
  | "x";

Farkle fails to build this grammar because the expr <op> expr productions don't have any P&A¹, and that's because the operator terminals are indirectly specified. The current workaround is to inline or_op and and_op. There are two ways to solve this:

Recognize nonterminals in associativity groups. The constructors of associativity groups accept an array of objects, which can be DesigntimeFarkles for terminals, strings for literals, or arbitrary objects for contextual precedence tokens, ignoring everything else. We should also recognize DesigntimeFarkles for nonterminals.

With this change the P&A of productions without contextual precedence would become the P&A of the last terminal or nonterminal of the production that has P&A. This would be the equivalent of writing %left <op>.
Infer the P&A of nonterminals whose all productions have the same P&A. This would make the above grammar work with no modifications to the P&A rules. If we go down this way we have to watch out for recursive nonterminals.

Judging from Bison's documentation, it doesn't do the things proposed above; I wonder why. Should we even do such comparisons? In any case I don't see any downsides; it won't affect those that don't use them.

Precedence & Associativity ↩

The builder produces invalid DFA states on grammars with long literals.

I'm trying to use Farkle to parse the input for Advent of Code 2020 Day 7. You can find my current WIP here. I have a few questions about the implementation of my parser that I hope you can help me with.

Firstly, my parser is failing on the input text light red bags contain 1 bright white bag, 2 muted yellow bags. with the following error:

(1, 9) Cannot recognize character 'd'.

The error seems to be indicating that the text fragment 'light red' can't be parsed. However, I have a unit test which tests parsing the bag colour and this is passing. It's not clear to me why the parser I have written can't parse the entire line of text, when the tests I have written for parsing the individual elements of the line all pass.

Secondly, I have written the following parser to parse the word 'bag', which can either be singular or plural:

static let bagWord = regexString "bag(s)?" |> terminal "BagWord" (T(fun _ data -> data.ToString()))

I use it when parsing the name of a bag, e.g. 'light red bags':

static let bagType = "BagType" ||= [ !@ colour .>>. bagWord => (fun c _ -> c) ]

It works, but I'm wondering whether there is a better way of doing this. I'm not really interested in the result of parsing this word. You can see from the above code fragment that I'm discarding the result. However it doesn't look as if I can use literal text here, because the word can either be singular or plural. Do you have any suggestions?

Thirdly, I don't know how to extend the parser so that it can handle the following input when a bag contains no contents:

faded blue bags contain no other bags.

It looks like I need to use some kind of alternation construct here which tries one type of Farkle and then falls back to using another type if the first one fails. I see that there is something called choice in the library, but this is a low-level construct for Regexes. I think I would need something that can choose between different Farkles, first trying the one which matches against bags with contents, and then trying the one which matches against bags with no contents if the first one fails. What can I do here?

Many thanks

Paul

Add MSBuild integration.

Description

Farkle strives for the best development experience for a parsing library. We already have CLI tools to generate text templates based on the grammar ready to be released.

The next step is to integrate these tools with MSBuild.

How it will look like

As I am imagining it, it will look like this:

<ItemGroup>
    <Farkle Include="grammar.egt">
        <CustomNamespace>MyProject.MyGrammar</CustomNamespace>
        <Internal>true</Internal>
    </Farkle>
    <Compile Include="MyGrammar.fs" />
    <!-- The rest of the files -->
</ItemGroup>

Features to have

Support for both C# and F# (preferably within the same package).
- Visual Basic will not be supported. ~~It does not support unsigned integers.~~ I just learned it does, but it has terrible performance characteristics whatsoever. It is not a priority.
Proper logging support. Related to #6 .
Maybe pack the templating code as an MSBuild task, instead of just invoking the CLI tool? I have to investigate what strategy other projects follow.

Investigate using EventSource for logging builder events.

We might be able to replace BuilderLogger and assorted infrastructure with an EventSource for the builder. This will be an imrovement in many ways (less code to maintain, decoupled logging events from collecting them, more powerful options to collect events).

Question about multiple Regexs and IndistinguishableSymbols2

Hi
Thanks for sharing Farkle.
Apologies if it is a stupid question.
I am trying to parse a schema file to an AST.
I have a problem with declaring multiple Regexs in a single df.
If this is not allowed can you a different way of doing it?
Below is my fsx file.

#r "nuget: Farkle, 6.4.0"

open Farkle
open Farkle.Builder
open Farkle.Builder.Regex


let makeDesigntime () =
    let t' = Terminal.Literal("type")

    let typeName =
        Regex.regexString "[A-Z][a-z0-9]+"
        |> terminal "typeName" (T(fun _ data -> data.ToString()))
        
    let enumVal =
        concat [
            Regex.regexString "[A-Z]+"
            concat [
                Regex.Literal "="
                Regex.chars PredefinedSets.Number |> plus
            ] |> optional
        ]
        |> star
        |> terminal "enumVal" (T(fun _ data -> data.ToString()))
    
    
    let enumDefT = Nonterminal.Create<string>("EnumDef")
    enumDefT.SetProductions(
        !& "enum" .>> "{" .>>. (enumVal) .>> "}" |> asIs
    )
    
    let bareT = Nonterminal.Create<string>("Bare")
    bareT.SetProductions(
        !%t' .>>. typeName .>>. enumDefT => (fun k v -> k + v)
    )
        
    bareT
    |> DesigntimeFarkle.caseSensitive true
    |> DesigntimeFarkleOperators.many

let convert: RuntimeFarkle<list<string>> =
    makeDesigntime () |> RuntimeFarkle.build

"""
type UserType enum { 
    USR
    RUS = 99
}
"""
|> RuntimeFarkle.parseString convert 
|> printfn "%A"

I get a builderror.

  (BuildError
     [IndistinguishableSymbols2
        (set
           [Choice1Of4 (Terminal (1u, "typeName"));
            Choice1Of4 (Terminal (4u, "enumVal"))], "AA");

How to do error recovery

I've started using Farkle to build a parser for a programming language. I'm happy with the library and the results I'm getting, so thanks for working on it!

One thing that's not very clear to me is how to do error recovery with Farkle – with parser generators a-la yacc/bison/menhir I have access to 'error' productions; by inserting it in strategic places I can make the parser parse as much as possible while accumulating parse errors. Is there a similar mechanism in Farkle?

The plan for Farkle 7

Farkle's goal is to provide the best LALR parsing experience for .NET. To achieve this, version 7, the next major version of Farkle is going to be a significant reimagining of the library's internals, aiming for improved performance, extensibility and compatibility going forward. I've been thinking of it for more than a year and it's finally time to put my thoughts into paper. Here's an outline of what will come (links to dedicated issues will be provided in the future):

Farkle will be rewritten in C#: F# is a good language, however for a project like Farkle I have often found myself working not with it but against it. There are some important features C# has and F# doesn't, forcing me to devise suboptimal workarounds. For some examples, I had to write a weaver because F# does not support covariant and contravariant generics, and I had to use reflection in one place because in F# cyclical dependencies among files are not supported. With the drawback of Farkle's codebase size getting increased, rewriting it to C# will bring benefits such as more control over the produced code, decoupling from the FSharp.Core library, better tooling support and improved compilation times. The main Farkle library will be the first to be rewritten and other components such as the precompiler and the CLI tool will eventually follow. The existing tests will stay in F#.

But, providing an idiomatic F# API is still a goal. I'm not sure how it will be done; the options are either by pubishing a separate package with a name like Farkle.FSharp, by providing an F# source file inside the main package, or by providing the F# API in the library binary itself. In the latter case Farkle would still depend on FSharp.Core but a large portion of it would be able to be trimmed away.
Improve the grammar format: Owing to Farkle's origins as an F# engine for the GOLD Parser project, it can since its first versions read and parse grammars from GOLD's Enhanced Grammar Table (EGT) files. After gaining the ability to build grammar tables on its own and to support the precompiler feature, Farkle needed to also write grammars to files. For this reason the EGTneo format was born which shares some semantics with EGT files but is more compact and easy to read from code than EGT because it better fits with Farkle's grammar domain model.

This however proved to be EGTneo's weak point. Because it so tightly matches Farkle's grammar domain model and because it was designed very hastily without any form of extensibility, it makes adding new grammar features almost impossible. Add to that some unfortunate design choices that limit performance (such as the use of ImmutableDictionary<TKey,TValue> which at that time I did not know it was implemented with a tree). That's why Farkle 7 will come with a new grammar format that is carefully designed with extensibility in mind and will enable significant innovations in the grammars' features. These are some of the new features in my mind that will come with Farkle 7:
- Hidden Terminals: A terminal can be marked as hidden, which means that it won't be suggested in syntax errors. A use case for it is in Farkle's string regex grammar, which defines a syntax for Unicode categories which are not actually supported and displays an error if they are encountered.
- Noisy Terminals: A terminal can be marked as noisy, which means that it will be ignored if encountered in an unexpected place, instead of raising a syntax error. A use case for it is in Farkle's newLine, which if used in a grammar it will prohibit new lines from appearing in any place not explicitly marked. By allowing newLine to be noisy, new lines in unexpected places will be allowed and ignored.
- Support for partial grammars: The new grammar format will support encoding partial grammars, which are grammars that had a build error and cannot be used for parsing. This would allow increased integration with Farkle's templating subsystem and improved and more complete HTML error reports.
The new grammar format will be properly documented and versioned, making it easy to write grammar readers in other languages. Farkle 7 will continue supporting reading GOLD Parser's EGT files by converting them to the new grammar format and reading that under the hood. The Farkle 6 EGTneo format will not be supported.
Improve the parser APIs: There are many changes planned in this area, but let's talk about the most fundamental. Currently the architecture of Farkle's parser is comprised of these three components from bottom to top: the CharStream API, the tokenizer and the LALR parser. When you tell Farkle to parse a piece of text, it invokes the parser, which invokes the tokenizer many times to get tokens, which uses the CharStream to get the input characters, potentially performing I/O to populate its internal buffer. This is called a pull-based model because characters are pulled by the tokenizer when needed.

Farkle 7 will switch to a push-based model, where I/O happens outside the parser call chain. Instead of the parser reading characters by itself, we will give the characters to it, and if it runs out of characters, the parser will stop and wait for us to give it more. If input ends, we will tell the parser so that next time it will not ask for more, but appropriately pass EOF to the LALR state machine, and return a result. Parsing contiguous characters in memory will be like a single big push. The possibilities this new design will bring are enormous. In Farkle 6 if you wanted to control the input to the parser you had to inherit TextReader which is cumbersome, and you couldn't pause a parser once it started. One of these possibilities is asynchronous parsing, which becomes trivial with this new architecture. These changes will be mostly transparent to the average user of Farkle, but those who use advanced features like custom tokenizers will have to adjust to the push-based API.

Besides that, the new parser architecture will be more flexible both in its implementation (support for custom tokenizers will be more streamlined) and its support for inputs (Farkle 6 supports parsing from ReadOnlyMemory<char> while Farkle 7 will support ReadOnlySpan<char> as well) and as for performance, parsing contiguous characters in memory will be amortized allocation-free.¹
Improve the builder APIs: The builder will receive less radical changes than the parser. Its object model of small and (almost) immutable grammars that get composed into bigger ones has worked well, and the main essence of it will carry on. These are the areas that will be improved:
- Distinguishing between global and local options: There are some extension methods to configure grammars, and most of them have effect only if applied to the grammar's starting symbol that will get built. We will call these global options and examples are adding comments and setting case sensitivity or whether to ignore whitespace. Other options can be set to any symbol of the grammar, like the symbol's name and operator precedence and associativity info and are called local options. Setting a global option to anywhere other than the grammar's starting symbol will be silently ignored, and issues have been opened around that. On Farkle 7 global configuration methods will return a special grammar type that can not be composed but only built and set other global configuration. This way such mistakes will result in compiler errors.
- Improving string regexes: The language of string regexes will be redesigned, with the goal of making it as close to the standard regex syntax as possible, and removing GOLD Parser influences.
- Compatibility levels: To lessen the impact of future breaking behavior changes in the builder, in Farkle 7 you can set a compatibility level to a builder object. If for example you write a grammar in Farkle 7.0, you can set its compatibility level to Farkle 7.0, to that if a future Farkle version gets released with an observable behavior change, you won't be immediately affected.
Improve the precompiler: The precompiler is Farkle's killer feature and sets it apart from other parser generators and combinator libraries, allowing it to compete with both. Its current design's main problems are that it requires reflection, does not allow the precompiled grammar to be trimmed, did not work with AOT (until Farkle 6.5.0), and relies on static constructors which during precompiling might execute more code than necessary. Farkle 7 will come with a new and slightly different way to mark grammars to precompile, that will solve all of the above problems.
Improve diagnostics: Farkle's existing diagnostic facilities leaves a lot to be desired. Farkle 7 will improve on this area by adding support for logging events other than errors in the builder and parser, giving every error and warning a unique code, and localizing its error messages (apart from English I will localize them to my native Greek; the community is free to contribute additional languages).

Plans beyond 7.0

The Farkle 7.0 release will be mostly a refactoring, with few exciting features. Once that gets done and we have a solid foundation under our feet, here are some things that could get done in the 7.x timeframe:

Better Unicode support: Farkle's tokenizer works by matching individual 16-bit characters. Farkle's tokenizer matches characters in text with the help of a DFA, which gets created at the builder from the terminals' regexes. These regexes match individual 16-bit characters, constrained to the Basic Multilingual Plane (BMP). You can match individual characters outside of it -the string will contain the surrogate characters and the DFA will just match them without problem- but you can't match a range of characters outside the BMP. We can make Farkle's regexes match Unicode code points, and at the time of constructing the DFA lower them to match UTF-16 characters. Thus a pattern like [🧄-🧅] which corresponds to [\U0001F9C4-\U0001F9C5] will get lowered to \uD83E[\uDDC4-\uDDC5].
- Once we do that we can finally start talking about UTF-8 parsers that would directly parse the UTF-8 bytes without converting them to UTF-16.
- And we can also support matching Unicode Categories and get rid of GOLD Parser's problematic predefined sets like All Letters that were brought over to Farkle.
Support for the IELR algorithm: To construct the parsing tables, Farkle uses the LALR algorithm like many other bottom-up parsing tools such as FsLexYacc. The problem with LALR is that it's not powerful enough and the grammar author often has to adjust it to avoid conflicts. IELR is a more powerful algorithm that can resolve more conflicts than LALR and would increase productivity. Besides that it would be a significant innovation in Farkle; other than GNU Bison there is no other known IELR implementation.
Code generation: The way Farkle's parser works today is by reading the grammar file, and interpreting its parsing tables. We could instead generate specialized code that will parse a specific grammar. Farkle 6 had this capability but only in the post-processor and was eventually removed in Farkle 6.5.0 due to stability concerns. It will be interesting to investigate after Farkle 7.0 to create an improved code generator for the entirety of the parsing process and without using the grammar file. This would bring Farkle's performance to a whole new level, while also taking advantage of technologies like ReadyToRun, Native AOT and Dynamic PGO. It will have to support both dynamic code generation with System.Reflection.Emit and static code generation at the precompiler.

Frequently Asked Questions

Sounds great. Is there an ETA?

Unfortunately not. There are a lot of features to be implemented, and I don't want to release even a preview version until it reaches almost (some features will be dropped) feature parity with the F# Farkle library.

What about compatibility?

Binary compatibility with older Farkle versions will not be guaranteed; you will need to at least recompile your applications to use the new version of Farkle. Regarding source compatibility, the expectation is that if you are only building a grammar and parsing it you would just have to do some renames and slightly adjust code for things like the precompiler or the string regexes. Other scenarios will need more extensive changes. It is not yet decided how smooth the update process will be but there will be a detailed guide, and open-source projects that use Farkle will be assisted with the upgrade.

Only the parser itself will not allocate. The objects produced by transformers and fusers and any custom tokenizers if exist may still cause allocations, but that's something user code can control. ↩

GitHub Actions next steps.

Now that Farkle is on GitHub Actions we can take advantage of its unique features to optimize Farkle's CI automation.

The objective is to do the most necessary stuff in the main CI workflow (tests) and delegate the others to separate workflows. Here's some of the things we can do:

Automate the release workflow and condition it on a tag.
Automate the site publishing workflow.
Automate the benchmarks workflow.
- They are incredibly fast and precise in GitHub Actions, I am very impressed; we might be able to run them on each commit.
Create separate workflows for the legacy F# projects.

My original thought with GitHub Actions is that we would get rid of the entire FAKE script but it's not a good idea; it's still needed to execute complex actions in both CI and locally.

Include tokenizer symbol type in DFA conflict messages.

Suggested by @rikashore on Discord, using the same character as a literal and as a line comment start produces a potentially confusing error message:

While this behavior is mostly expected when using symbols with the same name, it would help explaining in the error message that the first ; is a literal, and the second ; is a group start for Comment.

Add support for operator precedence and associativity

In contrast with FsLexYacc, neither Farkle nor GOLD Parser provide a facility to define operator precedence or associativity, as a means to resolve conflicts.

While GOLD Parser always resolves Shift-Reduce conflicts always in favor of Shift (with a warning), Farkle is more explicit and raises an error on every type of LALR conflict. Developers have to define operator precedence and associativity in a more explicit way, creating extra rules.

First-class support for operator precedence and associativity would require changes at the structure of the designtime Farkles, and to the LALR parser builder. The latter already provides some facilities for custom conflict resolution (which currently always raise an error).

teo-tsirpanis / farkle Goto Github PK

farkle's Introduction

Farkle

Documentation

Maintainer(s)

farkle's People

Contributors

Stargazers

Watchers

Forkers

farkle's Issues

Description

Repro steps

Expected behavior

Actual behavior

Known workarounds

Additional information

Features

Requirements

How it works

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Footnotes

Footnotes

Description

How it will look like

Features to have

The plan for Farkle 7

Plans beyond 7.0

Frequently Asked Questions

Sounds great. Is there an ETA?

What about compatibility?

Footnotes

Recommend Projects

Recommend Topics

Recommend Org