som-st / pysom Goto Github PK

View Code? Open in Web Editor NEW

This project forked from smarr/pysom

26.0 26.0 3.0 1.33 MB

PySOM - The Simple Object Machine Smalltalk implemented in Python

Home Page: http://som-st.github.io/

License: MIT License

Shell 0.08% Python 99.68% Makefile 0.25%

pysom's People

Contributors

Stargazers

Watchers

Forkers

lberk angelocky prg-titech

pysom's Issues

Frames: Design of Method Activation State

Current Situation

Currently, the AST and the bytecode interpreter use two different designs for frames. [AST, BC]

Both have in common to use a Frame class, which stores various bits. The AST version splits up the state in a number of different parts, a receiver, an arguments array, an array for arguments access by inner scopes, and array for local variables, an array for local variables access by inner scopes, and an on-stack marker. Allocating this frame is rather slow. There's a lot of work that has to be done for a single method activation.

The design or the bytecode frame is different and still uses the classic SOM stack-based encoding (combining arguments, local variables, and a stack). It also holds on to a few different bits of state though, which requires the use of the Frame object. The one positive aspect here is however that the construction of a frame is much more straight forward. The major drawback is that the virtualization of a frame doesn't work as well, because the whole frame object is passed to inner scopes.

New Design

In the new design, the state for a method activation is constructed from one or two plain lists/arrays (i.e. fixed-sized lists) that represent the Frame and Inner, avoiding the overhead of constructing a whole set of different lists, and objects.

The Frame holds the receiver, any additional arguments to the method, and local variables; except for, any arguments and local variables that are accessed by an inner lexical scopes.

The Inner holds a copy of the receiver, as well as the arguments and local variables accessed by the inner lexical scopes.

The goal of this design is that the frame is most likely going to be "virtualized" by the compiler. This means, in the resulting native code after just-in-time compilation the frame is not actually allocated. Though, the Inner may need to be allocated when it escapes the compilation unit, which is often the case when there are nested loops, in which an outer scope, perhaps a counter variable, is accessed. The first slot of the Inner is also used to store whether the whole frame/activation record is still on the stack.

While the Inner might not be allocated, for simplicity, the slot will always be reserved. This also neatly aligns the indexes for receiver access in Frame and Inner.

Frame

+-----------+
| Inner     | (Optional: for args/vars accessed from inner scopes, and non-local returns)
+-----------+
| Receiver  |
+-----------+
| Arg 1     |
| ...       |
| Arg n     |
+-----------+
| Local 1   |
| ...       |
| Local n   |
+-----------+

Inner

+-----------------+
| OnStack         |
+-----------------+
| Receiver        |
+-----------------+
| ArgForInner 1   |
| ...             |
| ArgForInner n   |
+-----------------+
| LocalForInner 1 |
| ...             |
| LocalForInner n |
+-----------------+

AST vs Bytecode Interpreter Inline Caching Strategy

Make sure we use the same strategy for inline caching, and the same number of possible entries.

Make AST and BC interpreter use the same frame format

Possible optimizations:

the BC interpreter does not need the array for the stack. TruffleSOM BC encodes the array as a local variable in the bytecode loop

Investigate drop in AST JIT performance because of #18

Richards, Bounce, BubbleSort, List, Queens, Towers all suffered see: https://rebench.stefan-marr.de/compare/RPySOM/bc14175415c59f1cfeb9b295fc8ddc103590d4ae/3eb4d2de135b4cf0b0bf9c4029b419d77ad00d4b

Implement common bytecode sequences

The TruffleSOM BC implementation is extended with common bytecode sequences.
Add these here too.

Running hello world fails

jnanjeky@simplex:~/cascon1/PySOM$ ./som.sh -cp Smalltalk TestSuite/TestHarness.som
Object class could not be loaded. It is likely that the class path has not been initialized properly. Please make sure that the '-cp' parameter is given on the command-line.
jnanjeky@simplex:~/cascon1/PySOM$

Am I missing something?

Design Considerations for Loop Inlining

With #32, we have support for inlining #if** messages on the bytecode level.

The next step is to support inlining #while*, and later #to:do:* messages.

The problem arrises from the fact that during inlining, some bytecodes may change their length.

With the addition of #25, we have special bytecodes that are shorter than their general version.
During inlining as introduces with #32, the bytecode generator may chose the shorter or longer versions compared to what was in the method that is currently inlined, i.e. the source method.

This means that jump offsets may need to be adjusted (increased/decreased) accordingly.

#32 introduces a sorted queue, which keeps track of the forward jumps and patches the offsets when the target bytecode is inlined. Thus, we simply check the offset in the source method, and use the length of the target method to as new offset.
Unfortunately, implementing support for backward branches under these conditions seems more involved.

I can imagine the following approaches:

know where loops start (this isn't easy without encoding it in a bytecode, or a side data structure). If we know where loops start, we can simply track the number of bytecodes produced since then. We can remember the number of bytecodes in the target method at the start of the loop, and then simply compute the distance in the target method when emitting the back jump. Though, this requires knowing the backward jump at the start. It also becomes more involved when tracking nested loops. So, the information about a loop start needs to carry the detail about the loop end, so that the two can be matched up.
we could just avoid changing the length of bytecodes. we can still use bytecodes specialized for specific constants, but instead of getting a run-time memory saving, we use a uniform length, so that jump distance does not differ in source and target method.
we could retrace steps once we found a backward branch. thus, we could do a second pass over the source method from the start of the loop (we know the offset and therefore the loop start in the source method then), to determine how many bytes where produced. The drawback here is that we need another pass over the bytecodes

Interpreter Adaptation for New Frame Design

The new design for frames in #16 has a number of consequences for the interpreters.

The main issue for the interpreter is that the we only know where an argument or local variable is going to be stored in the new frame once a method has been parsed completely.
If there are any blocks further down the method, we may end up storing an argument/local in the Inner, which means we need to access it differently. Though, we can't really know that before reaching that part of the method.

This means, we need to either patch up the access indexes after completing the parse, or, look them up on first execution.

AST Interpreter Design

The AST interpreter already used a set of nodes that would specialize themselves at run time on first execution.
Thus, the easiest change was to change the nodes to consider the new frame design.
In the process, this simplified the set of nodes needed, since accessing receiver, arguments, and local variables works now all the same. For the receiver, we also know already at parse time the fixed offset in the frame, which means we specialize it eagerly.

To Align PySOM and TruffleSOM, or not? Probably not completely.

The situation for the bytecode interpreter is a little more difficult.
There's the overarching concern of whether TruffleSOM and PySOM should be as closely aligned as possible. (Except for the frame design, the AST interpreters are rather similar already.)
My initial experiments in aligning the calling convention of PySOM to the one of TruffleSOM indicated a huge performance impact. So, moving to a Truffle-style calling convention of a single object array does not seem good from the performance perspective.

Though, on the other hand, any difference between PySOM and TruffleSOM likely means more complexity, more work implementing optimizations, because I need to adapt them to the two slightly different interpreters. For the moment, I think, I should push the RPython interpreters as hard as possible for speed, since I can't do that as easily with the Truffle ones, given the platform constraints.
I might regret it later, and change things around again, but for the moment, the added complexity will hopefully pay off in terms of performance.

Bytecode Interpreter Design

So, with the extra freedom of not aligning the implementations, the big question now is how to realize the bytecodes for the new frame design.

The two options are bytecode quickening, and a second pass over the bytecode.

Option 1: Determining Access Details By Quickening

Since we now also use Argument and Local objects in the bytecode parser, we could simply store those in the method, and look up the access index and whether an argument/variable is in the Frame or Inner on first execution.
This would be very similar to the node rewriting done in the interpreter, and avoid a second pass over the bytecode.
Though, on the other hand, this means we need POP/PUSH_VAR bytecodes accessing the Variable objects, and then Q_PUSH/POP_FRAME/INNER bytecodes as quickened versions.

the benefit would be to avoid a costly second pass over the bytecode
only code that's executed needs to be updated
we could simply keep all Argument/Local objects in the method
drawback is an extra run time overhead for quickening
it also bloats the bytecode set

Option 2: Second Pass

Turns out, we already do a second pass over the bytecode to determine the maximum stack hight needed for a method.
So, we could use the same pass to simply patch up the bytecodes for variable accesses, too.
At this point in the process, we have finished parsing the method and indeed know the indexes.

This is likely a small overhead added to method parsing, of all methods, but means we don't really need to keep the Variable objects around in the method (options would have been to have arg/local arrays, or to put them into the literal array). Both would mean more potential memory use.

Though, we still need to extend the bytecode set.
But, we only need the Q_PUSH/POP_FRAME/INNER bytecodes as active bytecodes.
We could introduce POP/PUSH_VAR as bytecodes outside the valid range, only used during parsing/compilation.
This would possibly avoid a performance impact in the interpreter.

Generally interesting with the Q_PUSH/POP_FRAME/INNER is what the frequencies for specific occurrences is going to be. In TruffleSOM, things like push_self, push_arg_1 etc are added, and work well.
Here, I'd need to look, and profile what ends up being used frequently. That's an example of the problem of diverging interpreter designs.

Think, I'll go with Option 2, fixing up accesses it in the max-stack-hight pass.

Avoid class guard and object-layout guard

In TruffleSOM, we have a relatively complex mechanism to avoid redundant guards.

An ObjectLayout implies a specific class. So, we should need to check both in the same compilation unit.

To minimize interpreter complexity, it might be worthwhile to give every object a object layout instead of a class.
Then, we can hopefully avoid multiple guards on the same object.

TruffleSOM has different guards for different types of objects.
Though, I think that's bad for interpreter speed.
Perhaps we get better results by putting the class into the object layout.