daniel-boll / pile Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 131 KB

The oficial repo for the pile lang

C++ 95.02% Lua 0.53% Makefile 4.45%

compiler compiler-principles concatenative-programming-language lexer-analyzer programming-language stack-based

pile's Introduction

Daniel Boll

pile's People

Contributors

Stargazers

Watchers

pile's Issues

parser: add the grammar parser

Grammar Parser

A code able to parse any grammar that respects the following specification:

Each line may have
- A comment (delimited with leading //)
- An empty line
- A production in the format LHS -> RHS
  - LHS MUST always be a Production entity
  - RHS MUST always be any combination of the entities
    - Production
    - Terminal
    - Empty
  - RHS can have multiple entities of the same group separating by spaces or different productions separating by |

Examples

<S> -> <S> s | s

Should create something along the lines of

GrammarContent{{
    {Production{"S"}, {Production{"S"}, Terminal{"s"}}},
    {Production{"S"}, {Terminal{"s"}}},
}}

And

<S> -> <S> s | s | <S> <B> ε

GrammarContent{{
    {Production{"S"}, {Production{"S"}, Terminal{"s"}}},
    {Production{"S"}, {Terminal{"s"}}},
    {Production{"S"}, {Production{"S"}, Production{"B"}, Empty{}}},
}}

Must notice that ε creates an Empty{} entity and not a Terminal{"ε"}

lexer: rewrite spec

Rewrite Finite Automatum Specification

Include cleaver signal dealer

Change the finite automatum to no more deal with the number signal. The GLC can check by
- integer-literal | - float-literal

Rewrite the specification

Rewrite the finite-automatum.spec.json to compute tokens of the following GLC:

parser: add a LL(1) parser for syntax analysis

LL(1)

The LL(1) parser will receive a LL(1) grammar expecting it to be already LL(1) compliant. The purpose of this top-down analysis is to create a AST (Abstract Syntax Tree) deriving left-to-right.

Structures

First-Follow

We have to establish what grammar rule the parser should choose if it sees a nonterminal $\Delta$ on the top of its stack and a symbol $\alpha$ on its input stream. Wiki

First set

The first set represents the possible characters found from that nonterminal $\Delta$, written as $Fi(w)$.

Given a grammar with the rules $A_1 \rightarrow w_1$, …, $A_n\rightarrow w_n$, we can compute the $Fi(w_i)$ and $Fi(A_i)$ for every rule as follows:

initialize every $Fi(A_i)$ with the empty set { $\varnothing$ }
add $Fi(w_i)$ to $Fi(A_i)$ for every rule $A_i \rightarrow w_i$, where Fi is defined as follows:
- $Fi(aw')$ = { $a$ } for every terminal $a$
- $Fi(Aw')$ = $Fi(A)$ for every nonterminal $A$ with ε not in $Fi(A)$
- $Fi(Aw' )$ = ( $Fi(A)$ \ { $\varepsilon$ } ) $\cup \ \ Fi(w' )$ for every nonterminal $A$ with $\varepsilon$ in $Fi(A)$
- $Fi(\varepsilon)$ = { $\varepsilon$ }
add $Fi(w_i)$ to $Fi(A_i)$ for every rule $A_i \rightarrow w_i$
do steps 2 and 3 until all $Fi$ sets stay the same.

Follow set

Unfortunately, the First-sets are not sufficient to compute the parsing table. This is because a right-hand side $w$ of a rule might ultimately be rewritten to the empty string. So the parser should also use the rule $A \rightarrow w$ if $\varepsilon$ is in $Fi(w)$ and it sees on the input stream a symbol that could follow $A$. Therefore, we also need the Follow-set of $A$, written as $Fo(A)$ here, which is defined as the set of terminals a such that there is a string of symbols $\alpha Aa\beta$ that can be derived from the start symbol. We use $ as a special terminal indicating end of input stream, and S as start symbol.

Computing the Follow-sets for the nonterminals in a grammar can be done as follows:

initialize $Fo(S)$ with { $ } and every other $Fo(A_i)$ with the empty set
if there is a rule of the form $A_j → wAiw'$ , then
- if the terminal $a$ is in $Fi(w')$, then add $a$ to $Fo(Ai)$
- if $\varepsilon$ is in $Fi(w')$, then add $Fo(A_j)$ to $Fo(A_i)$
- if $w'$ has length 0, then add $Fo(A_j)$ to $Fo(A_i)$
repeat step 2 until all Fo sets stay the same.

Parsing Table

It is a bidimensional matrix in the format $M[\Delta, \alpha]$ where $\Delta$ is a nonterminal and $\alpha$ is a terminal.

Now we can define exactly which rules will appear where in the parsing table. If $T[A, a]$ denotes the entry in the table for nonterminal $A$ and terminal $a$, then

$T[A,a]$ contains the rule $A → w$ if and only if
$a$ is in $Fi(w)$ or
$\varepsilon$ is in $Fi(w)$ and $a$ is in $Fo(A)$.

Equivalently: $T[A, a]$ contains the rule $A \rightarrow w$ for each $a \in Fi(w)\cdot Fo(A)$.

If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking. It is in precisely this case that the grammar is called an LL(1) grammar.

parser: implement the SLR bottom-up parser

SLR Parser

Items

1. LR Items

A LR item is a production with a dot in it.
Example:
- For <A> -> <C> <D> <E>
- LR items are:
  - <A> -> · <C> <D> <E>
  - <A> -> <C> · <D> <E>
  - <A> -> <C> <D> · <E>
  - <A> -> <C> <D> <E> · (reduction)
For the empty production we have: <A> -> ε => <A> -> ·

2. Extended grammar

We add a new start symbol <S'> to the grammar.
We add a new production <S'> -> <S> to the grammar that indicates to the analyzer when to stop.

3. Closure

Considering I as the set of LR items, the closure of I is the set of LR items that can be reached from I by applying the following rules:

Each item of I is added to the closure of I
If A -> α · B β is in I and B -> γ is a production in the grammar, then B -> · γ is added to the closure of I
- Repeat until there's no possibilities for new items

3.1 Example

Given the grammar:

<E'> -> <E>
<E> -> <E> + <T> | <T>
<T> -> <T> * <F> | <F>
<F> -> ( <E> ) | id

If I is the set of the item {[<E'> -> · <E>]} then:

closure(I) = {
  <E'> -> · <E>
  <E> -> · <E> + <T>
  <E> -> · <T> // brings all of T
  <T> -> · <T> * <F>
  <T> -> · <F> // brings all of F
  <F> -> · ( <E> )
  <F> -> · id
}

4. Goto

The operatio Goto(I, X) returns the closure of the set of LR items that can be reached from I by shifting the dot over the symbol X.

Move the dot right for every item of I where the symbol after the dot is X.
- For all rule <A> -> <α> · <X> <β> in I and <x> = X add <A> -> <α> <X> · <β> to the closure of I
- Calculate the closure for the set of items

4.1 Example

<S'> -> <T>
<T> -> <F> | <T> * <F>
<F> -> id | ( <T> )

closure(I) = {
  <S'> -> · <T>
  <T> -> · <F>
  <T> -> · <T> * <F>
  <F> -> · id
  <F> -> · ( <T> )
}

shift(I, () = {
  <F> -> ( · <T> )
  <T> -> · <F>
  <T> -> · <T> * <F>
  <F> -> · id
  <F> -> · ( <T> )
}

shift(I, id) = {
  <F> -> id ·
}

shift(I, <F>) = {
  <T> -> <F> ·
}

shift(I, <T>) = {
  <S'> -> <T> ·
  <T> -> <T> · * <F>
}

5. SLR(1) Parsing Table

The SLR(1) parsing table is a table that contains the actions that the parser should take for each state and each symbol. It consists in calculate
the sets parting from the initial state and then calculate the goto for each symbol of the grammar.

The I_0 is the initial state and the I_n is the final state. I_0=[S' -> · <S>]

5.1 Example

Pseudo

// @input: extended grammar G'
// @output: SLR syntatic actions and gotos for G'
function table(G') {
  1. Build C={I_0..I_n}
  2. The ith state is built from I_i. The syntatic actions for the state i are determined by:
    2.1 If [A -> α · a β] is in I_i and a is a terminal, then shift[I_i, a] = shift j, where j is the state that contains [A -> α a · β].
    2.2 If [A -> α ·] is in I_i, then reduce[A -> α] = reduce A for every `α` in Follow(A).
    2.3 If [S' -> S ·] is in I_i, then accept = accept.
  3. The syntatic gotos for the state i are determined by:
    3.1 If shift[I_i, a] = shift j, then shift[i, a] = j.
}