Coder Social home page Coder Social logo

Lexical analysis / tokenizer about hoaf HOT 7 CLOSED

kleinj avatar kleinj commented on July 17, 2024
Lexical analysis / tokenizer

from hoaf.

Comments (7)

adl avatar adl commented on July 17, 2024

What is your preference?

I was planning to use flex's start conditions to handle this, but I can understand why context-sensitive stuff could be inconvenient.

I would like to reorganize and augment your list of solutions with some other ideas:

  1. solutions that work without changing the current grammar (I see all of those choices as implementation details left to the choice of the developer, although I agree parsing would be easer if it did not require advanced features or ugly hacks.)
    1. context-sensitive scanner (e.g. flex's start conditions)
    2. scan everything between Acceptance: and the next HEADER: or --BODY-- as a single token, then call a separate parser. This is tricky and makes error reporting a pain.
    3. scanning [IF] as well as [iF][0-9]+ as some specific token types for use in the Acceptance: grammar rule, and adjusting the implementation of the other rules to allow these token in addition to identifiers and decoding [IF][0-9]+ only when appropriate in the parser. (I'm not calling those token keywords, because this solution is just one way to implement the given grammar.)
  2. solutions that change the grammar
    1. require space between [FI] and [0-9]+, making F and I keywords, and fixing the specification to also allow F and I everywhere IDENTIFIER is used. This makes the implementation details discussed in the previous solution explicit in our document, somehow forcing the implementation. The only win is that the number can decoded in the scanner.
    2. replace F and I by characters that cannot be mistaken as identifiers (what about + and - ?)
    3. Force acceptance set numbers to be enclosed in braces, as in Acceptance: 3 F{0} & (F {1} | F!{2})& I{2}). This way the scanner can match [IF][[:space:]]*!?{[0-9]+} as a token, and this wont conflict with an identifier. As braces are also used to specify acceptance sets in automata, there is some homogeneity.

I like the last two solutions better.

For t and f I would either:

  1. leave the grammar as is, maybe with a warning that t and f are also valid identifiers (dealing with this doesn't really seem very difficult)
  2. declare that t and f are BOOLEAN constants, cannot be used as identifiers, but can appear in header lines as in HEADERNAME (INT|STRING|IDENTIFIER|BOOLEAN)*.
  3. state that t and f can only be used in guard as (f) or (t), and have the scanner recognize those tokens instead. This looks clumsy if we consider aliases as well. (I don't like this.)

from hoaf.

kleinj avatar kleinj commented on July 17, 2024

For t and f, my preference would be solution 2 (BOOLEAN).

For the acceptance condition, I have a preference for 2.ii (+0, -!2), even though it doesn't look that pretty. I'm not a fan of multiple tokenization (such as F{!0}with separate tokenization afterwards), as this restricts the places where comments or whitespaces can be placed. But I wouldn't object to needing a context-sensitive lexer, if there is a preference for allowing F0 and I0 for aesthetic reasons.

There is another question that occured to me just now: What should be the interpretation of
acc-name: Ra--ABORT--?
Is this just an identifier? I would be fine with that, as one can prepend a blank before --ABORT-- to abort an arbitry stream without risking to be inside an identifier. Another option would be to switch to ==BODY==, ==END== and ==ABORT==, then there is no ambiguity with identifiers. Either way, I don't think there is a way to insert a --ABORT-- into a non-parsed stream at an arbitrary location, as the stream could be in the middle of a comment or "quoted string". But that's probably not a big deal.

from hoaf.

adl avatar adl commented on July 17, 2024

Branch issue/32 contains two patches marking t/f as BOOLEAN and stating that --ABORT-- should be a separate token.

So we are left with the F and I stuff. I agree + and - aren't pretty. We should probably also aim for a solution that allow future extensions. For instance what if in the future we want to add support for "occurrence acceptance" where a transition or state has to be visited at least once?

How about this: use F and I as functions with parentheses F(0) & (F (1) | F(!2))& I(2)). The grammar would use identifiers instead of F or I:

acceptance-cond ::= IDENTIFIER "(" "!"? INT ")"
                  | (acceptance-cond)
                  | acceptance-cond & acceptance-cond
                  | acceptance-cond | acceptance-cond

Our document would only specify the meaning of the identifiers F and I.

from hoaf.

kleinj avatar kleinj commented on July 17, 2024

I like the F(0) & (F (1) | F(!2))& I(2)) syntax, that's a good, extensible solution.

In regard to IDENTIFIER: ([a-eg-su-zA-Z_][0-9a-zA-Z_-]*|[tf][0-9a-zA-Z_-]+), I would prefer keeping the old regexp and just stating that t and f are not identifiers. That's usually easy to ensure by the order of the lexical rules for the lexer.

from hoaf.

adl avatar adl commented on July 17, 2024

OK, I've reworked these patches not to change the IDENTIFIER regexp, and also changed all F(x) and I(x). Can you check those changes and see if you agree?

from hoaf.

kleinj avatar kleinj commented on July 17, 2024

Great, I fixed a small typo, otherwise this looks find. I would then as a next step extract the example automata in an examples subdirectory, so we can use them for testing parser implementations. Ok?

from hoaf.

adl avatar adl commented on July 17, 2024

Thanks for the proof reading. I've put those patches on master and deleted the branch.

from hoaf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.