I stumbled upon some small potential issues for the lexical analysis. In the current v

What is your preference? I was planning to use <code class="notransl

Lexical analysis / tokenizer about hoaf HOT 7 CLOSED

kleinj commented on July 17, 2024

Lexical analysis / tokenizer

from hoaf.

Comments (7)

adl commented on July 17, 2024

What is your preference?

I was planning to use flex's start conditions to handle this, but I can understand why context-sensitive stuff could be inconvenient.

I would like to reorganize and augment your list of solutions with some other ideas:

solutions that work without changing the current grammar (I see all of those choices as implementation details left to the choice of the developer, although I agree parsing would be easer if it did not require advanced features or ugly hacks.)
1. context-sensitive scanner (e.g. flex's start conditions)
2. scan everything between Acceptance: and the next HEADER: or --BODY-- as a single token, then call a separate parser. This is tricky and makes error reporting a pain.
3. scanning [IF] as well as [iF][0-9]+ as some specific token types for use in the Acceptance: grammar rule, and adjusting the implementation of the other rules to allow these token in addition to identifiers and decoding [IF][0-9]+ only when appropriate in the parser. (I'm not calling those token keywords, because this solution is just one way to implement the given grammar.)
solutions that change the grammar
1. require space between [FI] and [0-9]+, making F and I keywords, and fixing the specification to also allow F and I everywhere IDENTIFIER is used. This makes the implementation details discussed in the previous solution explicit in our document, somehow forcing the implementation. The only win is that the number can decoded in the scanner.
2. replace F and I by characters that cannot be mistaken as identifiers (what about + and - ?)
3. Force acceptance set numbers to be enclosed in braces, as in Acceptance: 3 F{0} & (F {1} | F!{2})& I{2}). This way the scanner can match [IF][[:space:]]*!?{[0-9]+} as a token, and this wont conflict with an identifier. As braces are also used to specify acceptance sets in automata, there is some homogeneity.

I like the last two solutions better.

For t and f I would either:

leave the grammar as is, maybe with a warning that t and f are also valid identifiers (dealing with this doesn't really seem very difficult)
declare that t and f are BOOLEAN constants, cannot be used as identifiers, but can appear in header lines as in HEADERNAME (INT|STRING|IDENTIFIER|BOOLEAN)*.
state that t and f can only be used in guard as (f) or (t), and have the scanner recognize those tokens instead. This looks clumsy if we consider aliases as well. (I don't like this.)

from hoaf.

kleinj commented on July 17, 2024

For t and f, my preference would be solution 2 (BOOLEAN).

For the acceptance condition, I have a preference for 2.ii (+0, -!2), even though it doesn't look that pretty. I'm not a fan of multiple tokenization (such as F{!0}with separate tokenization afterwards), as this restricts the places where comments or whitespaces can be placed. But I wouldn't object to needing a context-sensitive lexer, if there is a preference for allowing F0 and I0 for aesthetic reasons.

There is another question that occured to me just now: What should be the interpretation of
acc-name: Ra--ABORT--?
Is this just an identifier? I would be fine with that, as one can prepend a blank before --ABORT-- to abort an arbitry stream without risking to be inside an identifier. Another option would be to switch to ==BODY==, ==END== and ==ABORT==, then there is no ambiguity with identifiers. Either way, I don't think there is a way to insert a --ABORT-- into a non-parsed stream at an arbitrary location, as the stream could be in the middle of a comment or "quoted string". But that's probably not a big deal.

from hoaf.

adl commented on July 17, 2024

Branch issue/32 contains two patches marking t/f as BOOLEAN and stating that --ABORT-- should be a separate token.

So we are left with the F and I stuff. I agree + and - aren't pretty. We should probably also aim for a solution that allow future extensions. For instance what if in the future we want to add support for "occurrence acceptance" where a transition or state has to be visited at least once?

How about this: use F and I as functions with parentheses F(0) & (F (1) | F(!2))& I(2)). The grammar would use identifiers instead of F or I:

acceptance-cond ::= IDENTIFIER "(" "!"? INT ")"
                  | (acceptance-cond)
                  | acceptance-cond & acceptance-cond
                  | acceptance-cond | acceptance-cond

Our document would only specify the meaning of the identifiers F and I.

from hoaf.

kleinj commented on July 17, 2024

I like the F(0) & (F (1) | F(!2))& I(2)) syntax, that's a good, extensible solution.

In regard to IDENTIFIER: ([a-eg-su-zA-Z_][0-9a-zA-Z_-]*|[tf][0-9a-zA-Z_-]+), I would prefer keeping the old regexp and just stating that t and f are not identifiers. That's usually easy to ensure by the order of the lexical rules for the lexer.

from hoaf.

adl commented on July 17, 2024

OK, I've reworked these patches not to change the IDENTIFIER regexp, and also changed all F(x) and I(x). Can you check those changes and see if you agree?

from hoaf.

kleinj commented on July 17, 2024

Great, I fixed a small typo, otherwise this looks find. I would then as a next step extract the example automata in an examples subdirectory, so we can use them for testing parser implementations. Ok?

from hoaf.

adl commented on July 17, 2024

Thanks for the proof reading. I've put those patches on master and deleted the branch.

from hoaf.

Lexical analysis / tokenizer about hoaf HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent