Coder Social home page Coder Social logo

saladdais / tailslide Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 2.17 MB

Embeddable parser, AST, tree walker and compiler for the Linden Scripting Language based on LSLint

License: MIT License

CMake 0.59% Python 0.07% C++ 34.14% Lex 0.38% Yacc 1.93% LSL 62.67% Shell 0.23%
secondlife second-life lsl ast

tailslide's Introduction

Intro

Build Status codecov

Tailslide provides an embeddable parser, AST representation, and tree walker library for Second Life's Linden Scripting Language. These can be used as the foundation for an LSL compiler or developing a superset of LSL.

A reference byte-perfect LSO compiler and semantically equivalent CIL compiler are provided. Semantic conformity with the output of LL's compilers is proven through extensive testcases and a fuzzer.

Also provided is a CLI utility to quickly lint or optimize LSL scripts, as well as visualize the AST of the supplied script.

If you're mainly interested in a command-line linting or optimization utility, please see lslint or LSL-PyOptimizer, respectively.

Credits

Tailslide is heavily based on the lslint LSL linter written by pclewis, see the NOTICE.txt file for the original README and credits.

Build

Linux & OSX

cmake, flex and bison must be installed through your system's package manager. On OS X you must install the Homebrew versions of flex and bison because the versions provided with XCode are extremely old.

git clone https://github.com/SaladDais/tailslide.git
cd tailslide
mkdir build
cd build
cmake ..
make

Windows

Not well-supported, but possible:

inside the cloned repo:

mkdir build
cd build
cmake -G"Visual Studio 17 2022" ..
cmake --build .

Tech Overview

Parsing

The parser behaves more or less like the one on SL's servers, quirks and all.

For example:

default { $$$$
    '' state_entry() {
        llOwnerSay(L"Hello
world!");
    }"
}/*{

is completely handled:

$ tailslide --show-tree tests/scripts/parser_abuse.lsl
default
{
    state_entry()
    {
        llOwnerSay("\"Hello\nworld!");
    }
}

TOTAL:: Errors: 0  Warnings: 0
script [none] (cv=) (1,1)
  ast node list [none] (cv=) (1,1)
  ast node list [none] (cv=) (1,1)
    state [none] (cv=) (1,1)
      identifier "default" [none] (cv=) (1,1)
      event handler [none] (cv=) (2,8)
        identifier "state_entry" [none] (cv=) (2,8)
        null [none] (cv=) (2,8)
        compound statement [none] (cv=) (2,22)
          statement [none] (cv=) (3,9)
            function call [none] (cv=) (3,9)
              identifier "llOwnerSay" [none] (cv=) (3,9)
              constant expression [string] (cv=string constant: "\"Hello\nworld!") (3,20)
                string constant: "\"Hello\nworld!" [string] (cv=string constant: "\"Hello\nworld!") (3,20)

Special attention has been paid to weird, undocumented corner cases.

Optimizations

Constant Folding

The constant folding implementation is currently very simple. Tailslide makes multiple walks down the tree checking if an expression is constant, and replaces it with its value.

integer foo = 2;
foo = foo + 2 + 4 + 3 + 5;

won't currently be simplified at all due to how expressions are represented and will require term rewriting to move non-constant parts of expressions as far to the right (left) as possible.

There are still some cases where constants may be folded but currently aren't, and some differences in how floating point operations are folded.

Unused Variable / Function Pruning

Any variable or function that is unused after constant folding may be pruned to save bytecode space.

Symbol Mangling

Names of user defined functions, their parameters and globals contribute to bytecode size. To reduce bytecode size, Tailslide can mangle user-defined symbols into more compact ones like _a, _b, etc. This may also help with disambiguation when converting the AST to another language with different scoping / variable shadowing rules.

This isn't the most efficient naming scheme if targeting SL in particular, since it's better to take advantage of strings already in the constant pool.

License

MIT, scripts used for test data (barring those added by me and those in bugs/) are property of their respective owners

Serialization code based on Kai Mast's BitStream library is included in BitStream.hh, and is licensed under the BSD 3-clause license.

tailslide's People

Contributors

awozniak avatar makopo avatar pclewis avatar saladdais avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

beyonddiana

tailslide's Issues

Make typedness of print() expression match LL's compiler

The print() expression is a very weird and very broken case in LL's compiler. It's sort of meant to take on the type of its argument, but it doesn't fully. print("foo") + "foo" and string foo() {return print("bar");} are valid according to LL's type checker, llOwnerSay(print("foo")) is not.

Maybe it's just weird in function expressions?

Marked as enhancement because nobody cares about print() (doesn't seem like it has ever compiled correctly under Mono.)

Extract operation() / cast() methods out of constant classes

Value propagation behavior should be pluggable since the results of various operations can vary depending on what runtime is being used. For example, any operation that results in inf, -inf, or nan will crash an LSO script, so they may not be folded. Extracting the logic into an OperationSemantics class that could be passed into the ConstantDeterminingVisitor would be a lot cleaner, and allow things like doing the folding within an actual VM where appropriate.

LL's compiler allows duplicating identifiers over 255 bytes

Found by the lscript<->tailslide conformance fuzzer.

This compiles with LL's compiler but doesn't in tailslide:

default {state_entry() {
string asdadsadafwqafqwfafwadwawadasasdsadsadasdsadsadsadasdasdasdsadsadadwqdqd2qdq2fqfqfqdqdqasssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa = "foobar";
string asdadsadafwqafqwfafwadwawadasasdsadsadasdsadsadsadasdasdasdsadsadadwqdqd2qdq2fqfqfqdqdqasssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa = "foo";
string asdadsadafwqafqwfafwadwawadasasdsadsadasdsadsadsadasdasdasdsadsadadwqdqd2qdq2fqfqfqdqdqasssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa = "foo";
llOwnerSay((string)llGetFreeMemory());
}}

Seems to be due to LL relying on string literals being interned by the LLStringTable::addStringEntry, but it it treats every string over 256 chars as a distinct string due to

if (!strncmp(ret_val, str, MAX_STRINGS_LENGTH))
{
    entry->incCount();
    return entry;
}
// adds string to table

You can have colliding globals in mono as well, but it fails to start up due to the name collisions.

Since the string interning fails, all later references to those identifiers are a compile error under LL's compiler. Need to either error on identifiers over 255 bytes or allow them to be declared and shadows but don't allow later references.

Rewrite Logger API

Having multiple points where assertions might need to be checked has made the Logger API more painful than it needs to be. It could do with being rethought entirely.

Lexer's tailslide_tokentype enum leaks into the public API

Arbitrary constants defined by the lexer like MUL_ASSIGN are effectively part of the public API, even though they shouldn't be and may change in value across versions.

  • Create explicit enums for operation types and use those instead.
  • Try and namespace tailslide_tokentype to prevent likely collisions on constants like IF.

Allocator doesn't guard against quadratic blowup in constant folding

Since we do constant expression evaluation at compile-time, we need to be able to guard against quadratic blowup in evaluation of scripts like

list foo = ["foo", "foo", "foo", "foo", "foo", "foo"];
list foo2 = foo + foo + foo + foo + foo + foo + foo;
list foo3 = foo2 + foo2 + foo2 + foo2 + foo2 + foo2 + foo2;
// ...
integer only_use = foo9 != [];

Since most dynamic allocs pass through the ScriptAllocator we should have it add the size of each alloc to a counter, then throw (std::bad_alloc?) when it passes some reasonable upper limit. It might also make sense to have a slightly lower "soft" limit after which we refuse to do value propagation / folding, so we can keep certain cases as runtime crashes vs hard compile-time crashes.

\xFF should be treated as a line comment terminator

Found by the lscript<->tailslide conformance fuzzer

The following compiles in LL's compiler, but not tailslide:

default{state_entry(){if(1)//\xFF;
else{;}llOwnerSay("why.");}}

There is a line comment // followed by a literal \xFF byte followed by a ;.

Appears to be due to a probable misunderstanding in LL's comment parser:

void line_comment()
{
	char c;

	while ((c = yyinput()) != '\n' && c != 0 && c != EOF)
		;
}

where EOF == -1 == \xFF. EOF is actually a valid byte that flex may return, and you only need to check for \0. We should match that behavior anyways.

Write stubs to allow executing CIL-compiled scripts as EXEs

We can test LSO scripts with the lscript harness, but we don't have a way to run CIL scripts at the moment. Should be able to crib a handful of classes from libremetaverse and elsewhere, enough so we can run the LSL language conformance test (it only requires a handful of library functions).

Examine possibility of merging passes

Right now the separation of passes is very close to how lslint and LL's compiler worked, but walking the AST isn't free. Right now the most cumulatively expensive part of the code is the tree walking logic, merging things into fewer passes == fewer tree walks == faster.

At least symbol resolution and type checking could be merged into one visitor (maybe with a flag field to determine whether to actually check types / resolve symbols.)

List-in-list should be allowed by the compiler

List-in-list is a runtime error rather than a compile-time error in LSL. As such, List-in-list in a local should be a compiler warning rather than an error.

The following is a perfectly valid LSL script:

default {
  state_entry() {
    return;
    list l = [[]];
  }
}

A list-in-list in a global, however, is a compile-time error.

Write libFuzzer fuzz harnesses

Can use Luau's fuzzing harnesses as a guideline here. Stuff in tests/scripts/ should be fine as a seed corpus but we'll have to be smart about excluding some of the overly-complex scripts.

Improve CIL codegen

  • Keep track of actual maxsize for method eval stacks, probably need a structured way to emit opcodes so we know what's on the stack at any given time.
  • Look at tonybaloney's Pyjion fork's CIL emission API for optimization opportunities (ldc.i4.2... and friends would be a good first one, as well as the other "short" versions of opcodes)
  • Mark expressions where we don't care about the result so we can eliminate worthless push / pop pairs (function calls or assigns at top level of an expressionstatement, for initializers, that sort of thing.)
  • Allow re-using dead local slots in functions with only structured control flow
  • Use char, short, int32 or f32 literals instead of double literals where there will be no precision loss and do conv.r8.

Support for Firestorm Preprocessor

I would love to update my extension for VSCode to support your plugin! LSLint For VSCode

I did some testing last night and it seems to output the same as lslint command line, which is awesome! I would just need to add an option to use tailslide vs lslint.

All of that is just a side note of why I am posting this... I use a lot of the features from Firestorm's Preprocessor, such as define and include [EXAMPLES], but it is not currently supported by your linting and spits out an error:

❯ tailslide --lint example.lsl
ERROR:: (  1,  9): [E10020] syntax error, unexpected IDENTIFIER, expecting '('
TOTAL:: Errors: 1  Warnings: 0

which is just from trying to use a #define SOMETHING 123 in the script.

It is "tolerated" so to speak by lslint since it has a -i option to ignore/skip those directives, which would be helpful! [PR from lslint where it was implemented]

It would be even better if you were able to handle it, or at the least in this category handle #include so it could trace through the other included files for that script.

I know it is a big ask, but your project is close enough to lslint in its output (which means I can use it for the VSCode GUI) and the only one that seems to be in active development. If this is something you would consider supporting, it would help a LOT of people since handling includes is the biggest downfall from lslint that people ask me about for the extension.

Thanks!

Also feel free to reach out to me in SL or email (info on my account page).

Make exported headers C++14 (or at least C++17)-friendly across compilers

Right now C++20 has to be used under MSVC because of some struct initialization patterns that aren't all that important. Ideally you should be able to build the artifacts as C++17 and include them in a C++14 build, or at least a C++17 build. Would require breaking up bitstream.hh at least.

Use class members instead of linked list of children for fixed-width nodes

Having to refer to nodes by ordinal index and then casting is not so nice compared to just having a properly typed and named field we can reference. The linked list structure doesn't really gain us much so long as we're still able to visit any given node's children.

  • Change class definitions and constructors to place children in named members of correct type if node isn't logically list-like
  • Nodes should either use _mChildren or class members, not both
  • Give each fixed-width node type its own version of visitChildren()
  • Change LSLASTNode::replaceNode() to walk the children of the parent and get the address of the field containing the node to replace, only using the current node replacement strategy if the parent node has "list-like" children.

Add support for safe key constant value propagation / folding that preserves "key-ness"

Since there's no such thing as a key literal, you have to be careful about value propagation / constant folding for keys

list foo = ["<some_uuid>"];

is not the same thing as

key some_uuid = "<some_uuid>";
list foo = [some_uuid];

The former places a string value in a list whereas the latter coerces the string value into a key first, then places the key value in the list. Incautious folding can break uses of llListFindList() and llGetListEntryType() since both care about the type of the list element.

Folding in a key as a string constant is fine in places where the string expression will be automatically coerced to a key (rvalue of a key var assignment, function arg that requires a key) but not ok in a list expression. In other cases you can get away with a runtime cast to key, but obviously that's not possible in global initializers, the latter form is the only way of getting a real key into a global list initializer.

This only really matters if you're transpiling back to LSL, so if this is added it might make sense to have a separate transpilation-safe constant folding mode.

Improve AST / linked list performance

This is probably an issue in lslint as well

There's a common pattern of doing pushChild() which internally calls addNextSibling() on the head of the list. That walks the list until it finds the tail and tacks on the provided node. This leads to terrible perf for scripts that have long lists of children like stack_heap_collide.lsl. It's especially pronounced when walking a list while mutating it.

Might make sense to keep track of the tail of the child list or swap to a vector.

Separate notions of expression constness and side-effectfulness

Right now a node returning a null ->get_constant_value() can mean that either a constant value for the node can't be determined, or that something in the expression might have side-effects.

Basically, we should be able to tell that a if ((non_const = 3) + 1) branch will always be taken because the expression will always result in 4, but that the expression can't be folded due to the side-effects of the assignment.

We should also be able to prune a declaration like string foo = non_const + "bar"; because even though we can't determine a constant value for the expression, we know that it doesn't have any side-effects.

Clean up unnecessary use of (thread-local) globals

Right now most dynamic allocations rely on the gAllocationManager thread-local global having been set, and all allocs are only cleaned up when the ScriptAllocationManager's destructor is called. This was an okay stop-gap to deal with the fact that lslint (as a CLI util) never really needed to manually release resources, but I think we can do better. Similar story with the thread-local Logger singleton that visitors use to report errors.

Luau's AST library uses a similar allocation strategy, but it manually passes down the allocator rather than globals, and I believe a pointer to it is stored on (or at least reachable somehow from) the AST nodes. Might make sense to have a ParserContext struct or something like that that gets passed to all AST nodes so they can always reference their own allocator without a global.

With the current abuse of globals, the only way of having two distinct ASTs live at once is to parse and process them in separate threads.

Missing constant value inference rules for declarations with no rvalue

Need some sensible way of giving these symbols constant values. The semantics around lvalue references to uninitialized heap types within globals are kind of weird under LSO (they tend to cause bounds check failures when used.) Uninitialized vars also aren't allowed in SALists.

Erroring outright on weird globals like that would be wrong since the script will work fine so long as the value isn't read before being written within a function. For ex in LSO:

list baz;
list quux = baz;
default {
    state_entry() {
        llOwnerSay((string)quux); // bad. `quux` is uninitialized, runtime bounds error
    }
}

will break but

list baz;
list quux = baz;
default {
    state_entry() {
        quux = [1];
        llOwnerSay((string)quux); // fine. `quux` was initialized before it was read.
    }
}

is totally fine.

Might make sense to have a "default" constant value for each type that's tagged as such so it "poisons" all references to it, in case the distinction is important to a consumer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.