Coder Social home page Coder Social logo

generic-recognizer's Introduction

Input Grammar

The program reads grammars in EBNF. Here is an example of a grammar describing the JSON format:

! I'm a comment.
object* = "{" [ pairs ] "}" ;
pairs = pair { "," pair } ;
pair = #STR2 ":" value ;
value = #STR2 | #NUM | "true" | "false" | "null" | object | array ;
array = "[" [ elements ] "]" ;
elements = value { "," value } ;
.

From this example we can see several things:

  • = is used to define nonterminals.
  • Each production rule is terminated by a ;.
  • The whole grammar is terminated by a single ..
  • The start symbol (there must be only one) is denoted with a *.
  • | is used to denote alternatives.
  • {} are used to enclose parts that can repeat zero or more times.
  • [] are used to enclose parts that are optional.
  • ! starts a comment that extends until the end of the line.

() can be used for grouping. For example:

expr = term { ( "+" | "-" ) term } ;

Terminals symbols (tokens) can be specified in two ways:

  • With a # followed by the name of a token (see tokens.def).
  • Enclosed in quotes. If what is between the quotes is an identifier, then it is treated as a keyword token. Otherwise, what is between the quotes must be something that can be lexed by lex_get_token() (see lex.c).

There is no need for an explicit symbol to denote the empty string (epsilon, ε).

Usage

You can use the input grammar to recognize an input string right away or to generate a recognizer program.

Recognizing an input string

The input string is specified as a second argument:

$ ./genrec examples/grammar1.ebnf examples/string1

The -v option can be used to trace out the leftmost derivation that is performed. The program will exit silently if the string does not contain any syntax error.

Generating a recognizer

The -g option can be used to generate a recursive descent recognizer program in the C programming language. The recognizer is emitted to stdout by default (this can be changed with the -o option).

For example, the following generates a recognizer for the above JSON grammar:

$ ./genrec examples/grammar8.ebnf -g -o json_rec.c
$ cc json_rec.c lex.c util.c -o json_rec
$ ./json_rec examples/string8

The generated program takes as single argument the input string. It also uses lex.c for the lexing part (and lex.c uses functions defined in util.c, that is why it appears there in the linking).

The recognizers generated by -g are somewhat limited. In particular, they do not support the following features (described later):

  • Selective backtracking ([[]]).
  • Input/Output controlling actions ($).
  • Named buffers (>$).

Debugging the grammar

The program can also be used to get more information about the input grammar and to detect possible conflicts.

The -f options prints the First sets of all nonterminals, and the -l option prints the Follow sets.

The -c option checks the input grammar for LL(1) Conflicts. These conflicts can be left recursion (immediate or indirect), First/First conflicts, and First/Follow conflicts.

Sometimes, a grammar containing conflicts can still be used and will behave in a deterministic manner. The following points are relevant in determining the behavior:

  • All the operators consult uniquely the First sets of their operands to decide where to direct the flow of control of the recognition.

  • The | operator tests its operands from left to right. For example, given the rule

    S = α | β ;
    

    if there is a First/First conflict between α and β, then β will be unreachable on those symbols that cause the conflict (see grammar7.ebnf).

  • [] (and {}) tests the current token of lookahead against the First set of its operand to decide if it applies or not. For example, given the rule

    S = α [ β ] γ ;
    

    [] will apply β when the current token is in First(β) and will not care if it is also in First(γ). This is how the dangling else problem is resolved (see grammar6.ebnf).

Left recursion almost certainly will cause the death of the recognizer by infinite recursion, so under the -c option left recursion is considered a fatal error.

Generating output

You can cause the generation of output by embedding outputting constructs directly into the grammar. The basic idea is borrowed from META II. The following example translates simple arithmetic expressions into assembly code for a stack machine:

$ cat expr.ebnf
program* = { expr "." } ;
expr = term { "+" term {{ "ADD" ; }}
            | "-" term {{ "SUB" ; }} } ;
term = factor { "*" factor {{ "MUL" ; }}
              | "/" factor {{ "DIV" ; }} } ;
factor = #ID {{ "LD " * ; }} | #NUM {{ "LDL " * ; }} | "(" expr ")" ;
.
$ cat expr_string
a*b+c/2.
$ ./genrec expr.ebnf expr_string
LD a
LD b
MUL
LD c
LDL 2
DIV
ADD

Output constructs are introduced with {{}}. One or more of the following things can appear inside:

  • A double-quoted string (e.g. "ADD") to output text verbatim.
  • A ; to output a new-line character.
  • A * to output the last matched token.
  • A # to output a unique number. Each instance of a rule gets a new unique number.
  • + to increase the indentation level.
  • - to decrease the indentation level.
  • A named buffer as $buffer_name (explained below).

The default indentation level is zero. An indentation level of zero or below means the lines are outputted with no indentation. Each indentation level represents 4 space characters.

You can save the output that results from a rule invocation and then reference it later within {{...}}. This is useful, for example, when one wants to have a finer control of the order in which things are outputted.

Suppose for example that you are writing an assembly converter from Intel to AT&T syntax. Intel syntax uses instr dest, src while AT&T uses instr src, dest. One can accomplish the conversion using something like the following (assume mnemonic and operand are defined elsewhere):

instr = mnemonic>$mne operand>$op1 "," operand>$op2 {{ $mne" "$op2", "$op1 }} ;

To save the output from a rule A into a buffer named buf we write A>$buf. We can use $buf within {{...}} to output what it contains.

There are a few things to watch out when using named buffers:

  • Buffer names have rule scope and so one cannot reference names defined in other rules.
  • Each rule has a single set of buffers and so recursive rules that use named buffers will likely cause trouble.
  • Each instance of rule>$buf causes the truncation of buf (the writing position is set to zero) instead of appending to what it already contains.

The following example uses # to generate unique labels:

if_stmt = "if" expr "then" {{ "BF A"# ; }}
          stmt "else" {{ "B B"# ; "A"#":" ; }}
          stmt {{ "B"#":" ; }} ;

Backtracking

Full backtracking is generally too inefficient to be used in practice. The recognizer supports instead a form of backtracking (selective backtracking) where the user explicity specifies those places where backtracking should be used. Backtracking is introduced with [[]]. See grammar11.ebnf for an example of usage.

Limitations

  • Sets are represented internally with uint64_t bit vectors. This limits the number of rules to 64, and the number of terminals (tokens) to 63 (the most significant bit is reserved to represent ε).

generic-recognizer's People

Contributors

lucianognz avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.