The tinypg from ksp-kos

Change the boilerplate comment to make it clearer not to edit Parser.cs, ParseTree.cs, Scanner.cs

TinyPG spits out Parser.cs, ParseTree.cs, and Scanner.cs with this stock comment on top:

// Generated by TinyPG v1.3 available at www.codeproject.com

We should make our version of TinyPG spit out a more verbose clear comment that explains to people looking at our project that:
1 - DO NOT EDIT THIS FILE- IT IS AUTOGENERATED BY A PROGRAM CALLED TINYPG.
2 - And where to get TinyPG (our version of it in our github home)
3 - And how to run TinyPG to re-generate these files.
4 - And that the real change is to edit the kRISC.tpg file

This is because multiple times we've gotten PR's from people trying to change the parser by editing these files directly. We could make it more clear what's happening.

We could also perhaps change the folder tree to make them under a folder called "Autogenerated" but that's more for the KOS project not the TinyPG project. But I mention it here for reference.

[performance] Regex matches inefficiently find irrelevant hits that get culled out right away.

This issue in kOS project : KSP-KOS/KOS#2135
seems to imply that TinyPG itself can be edited to improve its regex performance in the scanner.

Example Text:

set   ident  to 1234 * sqrt(5432.1).[EOF]
             ^
             |
             |
    Imagine the Scanner's startpos is currently here
    because the scanner has already tokenized this much
    so far:
        set[whitespace skipped]ident[whitespace skipped]

That means the substring of the input file the scanner hasn't consumed yet is this:

to 1234 * sqrt(5432.1).[EOF]
^

And the zeroth position of that subset is where the caret is.

The Scanner currently does this in a for loop, inside LookAhead():

For each scantoken rule (regex pattern) defined in the grammar file:
- Try to find a match within the remaining substring (to 1234 * sqrt(5432.1)[EOF] in the above example).
- If a match is found AND that match started at index 0 and it is longer than the longest match so far:
  - Then this becomes the new match so far.
If no matches were found in the above loop, issue an error message - unexpected character.

But notice the bold text above. Only matches that start at index 0 count, but the way it implements this is to find the matches at higher indeces, but then it immediately throw them away. This is very inefficient, as discovered by @tsholmes. For example, if the scanner was looking at the above example, the rule to match INTEGER will find a hit at index 3 on 1234, but since that's not at index 0, it will be thrown out. The rule to match MULTIPLY will find a hit on the substring * at index 8, but since that's not at index 0, it doesn't count and gets thrown out. It will also find a hit for IDENTIFIER on the substring sqrt at index 10, but since that's not at index 0, it doesn't count and gets thrown out. etc, etc, etc. The only match that doesn't get thrown away is the one to find the keyword TO, which is kept because it was at index zero.

If you imagine a large file, this is a lot of matching that just gets thrown away right away.

By inserting an implied caret ("^") into the regex before running Regex.Match(), the Match routine itself can be told not to bother with any matches that don't start at index zero. Then instead of getting the match and immediately throwing it away, it just won't find the match in the first place.

ksp-kos / tinypg Goto Github PK

tinypg's People

Contributors

Watchers

Forkers

tinypg's Issues

Change the boilerplate comment to make it clearer not to edit Parser.cs, ParseTree.cs, Scanner.cs

Lexxer does not tokenize on word boundaries correctely

[performance] Regex matches inefficiently find irrelevant hits that get culled out right away.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent