signaldust / bunny-jit Goto Github PK

Tiny optimizing JIT compiler backend.

Makefile 0.60% C++ 96.99% C 2.02% Shell 0.36% Batchfile 0.03%

compiler jit x64 optimizing-compiler ssa compiler-backend library small tiny arm64

bunny-jit's Issues

Stack-slot sizing

Right now arch-x64 assumes all stack slots are 64-bits, which wastes half a slot for every single-precision spill, but otherwise works fine as both doubles and integers fit. However, this needs to be fixed before we can add SIMD types.

There are basically two possible ways to fix this:

Option 1 is to forbid findSCC from merging SCCs when the size (or even just type?) is different and teach arch-x64 to do a pre-pass over slots to allocate actual stack locations of correct sizes, then indirect actual spills/reloads.

Option 2 is to make all slots 32-bits and then teach findSCC to allocate multiple SCCs for larger types. This would theoretically allow better sharing of stack locations (ie. can dynamically change what's stored where) and would save the code-gen from having to worry about it, but also seems a little bit more complicated to implement.

I sort of like option 1 better for the fact that it should be around 5 lines of code, but can we implement option 2 without significantly more complexity?

Find a better CSE strategy

Currently CSE keeps one candidate (per opcode+operands pattern) in hash-table, effectively matching ops against the previous identical operation only (as we replace the table-entry with the op we've last looked at). This works great for simple examples, but it's probably going to miss valid opportunities in more complex code (and arguably updates the hash table too much), so it needs to be replaced with something more reliable.

What we don't want to do is keep any per-bucket arrays or linked lists, but we could keep the lowest-indexed op as a "representative example" in the hash, then collect matches against that as index-pairs. Once we've found all the pairs, sort it by the representative index to get all the matches in bulk? Then we could just try every possible set of pairs in O(n^2) fashion?

CSE should probably be split out of Fold, because it shouldn't need nearly as much iteration as Fold itself.

vectors of vectors: replace with something more cache friendly?

Block arguments are stored as std::vector (per Block) and each of them keeps an std::vector of (source,value) pairs. This is "less than ideal" for cache with all the small allocations in the heap, especially given the phi-spam we get initially. I don't want to give up the easy to use interface, but I'd like to reduce the allocations.

We don't currently ever add new phis, we just optimize the existing ones, although we do often add new alternatives (when DCE does jump-threading). We could theoretically collect (phi,source,value) tripples when building IR and then sort them before optimisation although we would need some sort of 'next_index' field to enable cheap inserts. We might add new phis some day (eg. the PRE phi-cases).

I don't want to change the current system right now, but this might be worth investigating some day.

reassoc is not perfect

So the ad-hoc approach to reassoc sort-of works reasonably well, but it's ... no safe. :)

Writing a fuzzer to generate random expressions to fold (which exposed a few other bugs) exposed the obvious problem with the current approach: we either need to check that all the source values dominate uses (also locally in blocks), or we need to move things around to make it so.

If we limit reassocs purely to those that move values further up the dominator chain, then we might be fine? ... but then we really want another solution for reordering stuff locally..

arm64 missing non-jump comparisons

Looks like I forgot these... and should really add them to test coverage.

Optimizations for loads?

Currently the compiler treats all loads and stores as operations with side-effects and does not optimize them at all.

We probably don't want to go down the rabbit hole of aliasing analysis and I don't see dead-store elimination as high-value, but we could certainly optimize loads somewhat. Specifically, if we have multiple identical loads without stores between them, then in a single-threaded model these would always be safe for CSE and in a multi-threaded model they would be safe for CSE outside of loops.

If we choose single-threaded model, then in order to still allow looping on a memory variable, we would either need "volatile" versions of loads or a compiler fence instruction to act essentially as a "fake store" to intentionally break CSE/hosting.

If we choose multi-threaded model (ie. keep treating loads as having side-effects), then we could add "restricted load" variants that obey the rule that the compiler is allowed to treat the memory as constant for the duration of the function (ie. basically equivalent to a "const restrict" pointer in C). One advantage of this approach would be that it would make it trivial to allow such memory to always be rematerialized directly (where as in the single-threaded model we'd have to push the information about stores down to the register allocator) and it would also be easier to implement (no need to analyse stores) and would give control (but also push responsibility) to the front-end writer.

I feel like the multi-threaded "explicitly mark stuff as const-restrict" model would probably fit Bunny-JIT's goal of simplicity better, but if someone reads this before I make up my mind, let me know what you think.

extend to full PRE?

It would be relatively easy to extend CSE to do proper PRE by matching ops with phi inputs against all the alternatives and if any one path matches, pushing new ops to all the other paths with a new phi to collect the results.

The complication here is that we only store first candidate in cseTable as we encounter them, so we'd probably want to match against partial redundancies after the regular CSE pass (we could this by iterating cseTable directly). However, in some cases the possible partial redundancy might not be the one stored in cseTable, so we'd also have to check the list of candidate pairs for other alternatives.

The question then is how to best combine this into the CSE pass? Do we perform regular CSE first (with or without cleaning up the list of candidates?) or do we try to match PRE first and then check for CSE afterwards?

signaldust / bunny-jit Goto Github PK

bunny-jit's Issues

Stack-slot sizing

Find a better CSE strategy

vectors of vectors: replace with something more cache friendly?

reassoc is not perfect

arm64 missing non-jump comparisons

Optimizations for loads?

extend to full PRE?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent