chc4 / lineiform Goto Github PK

View Code? Open in Web Editor NEW

155.0 5.0 4.0 539 KB

A meta-JIT library for Rust interpreters

Rust 100.00%

jit cranelift x86 inlining-function constant-propagation rust interpreter compiler rvsdg

lineiform's People

Contributors

Stargazers

Watchers

Forkers

laplacekorea artaxerces ebell495 iq-scm

lineiform's Issues

Recursive example incomplete?

Hey! Shouldn't your README.md example for the recursive evaluation state

fn eval(ast: Ast) -> usize {
    match ast {
        Literal(l) => l,
        Add(left, right) => eval(left) + eval(right),  // <-- recursive call to eval missing!
    }
}

let val = eval(Add(Literal(1), Literal(2)));
val // 3

Do JIT compilation based on perf counters

Currently we block on JIT compilation as soon as we call Jit::speedup. We can instead return a FastFn that delays JIT compilation until it's called, and even better use performance counters like every other JIT on earth to only trigger compilation if it's called >N times, where N is a const generic parameter to the Lineiform trait for compile-time tweaking.

Probably just look up how LuaJIT does fast performance counters to minimize overhead. We might also want to store the performance counters out of band, since optimally we want FastFn to implement Freeze from #4 so we can inline calls to JIT'd closures without having to register them in some map and special casing them?

Async JIT compilation

Related to #11, but we can also do the JIT compilation in another thread, and run the slow code while it's compiling in the background. This means that, in the worse case where there's some hot loop that is called exactly as many times as our perf count threshold, we don't cause it to wait for a very long time for compilation only to never be called again.

We should use rayon for this, since we basically just want to post "compile me!" to a workqueue, and then check the progress on call and use the result if it's finished. Slightly complex since we have to make sure Drop doesn't leak the JIT code memory if we Drop the FastFn but the JIT compilation isn't done yet.

Handle pointer arguments in lifting

Related to #2, but on the lifting side: if we have a closure like so:

struct MyStruct {
    a: usize,
    b: usize,
    c: usize
}

jit.speedup(|env: MyStruct| { env.a += 1; env.a })

it's implemented as something like fn(clos: c_void, env: &MyStruct) { env->a++; env->a }: that is, the structure is behind a pointer (we optimize away all loads from clos, which is the closed environment).

We want to tell the lifter that the argument is a JitValue::Ref(JitValue::Struct(vec![JitValue::Value, JitValue::Value, JitValue::Value]), mem::size_of<MyStruct>()) - that is, that it's a pointer to a struct made up of some symbolic value members. This would allow us to do partial application and optimization with the struct's members instead of having to emit Cranelift load/stores for everything.

We might want to annotate the struct MyStruct with a #[derive(JitInput)] or something? Dunno.

Persistent cache

We optimally cache a lot of the JIT metadata we collect, for things like e.g. loop entry points from #8 or which control flows merge from #7 or DWARF abi information from #3.

All of this data should be identical across runs with the same binary. So hypothetically we should be able to not only cache the data for one run, but all future runs of the JIT.

We should look into dropping the data to disk, and mmapping it back in when we start back up again to have the best information we can, and atomically updating the database if we get new information in a run.

Slight problems are we have to handle cache invalidation if the binary is updated, or any data that depended on a symbol from a dynamic library that was updated.

Use argument and output info for correct Cranelift function prototype declaration

Currently, we just hardcode the input and output registers for closures we do partial evaluation on: we tell Cranelift (in lift.rs Jit::new) that all functions we compile have the prototype ptr, ptr, int -> int. This means we would miscompile e.g. jit.speedup(|| return (1u64, 2u64) since it wouldn't know that it has to feed the 2 immediate into the output register sink.

We should instead be using mem::size_of<A>() and mem::size_of<O>() to fill out the proper SystemV ABI calling convention for a function that takes and returns arguments of those size. I think this is fairly straightforward - too big arguments/outputs are turned into an argument to a stack buffer to write to, but we should be able to just tell Cranelift that's a ptr afaik since it never optimizes away stores.

Handle dynamically linked functions

Right now we just use /proc/self/exe as the ELF to load to look for symbols. This is OK for normal Rust code, since it's statically linked by default, but not for things like "any calls to a malloc" or "any external library".

We should instead open /proc/self/maps to find what ELF the address we're trying to resolve is inside, and check that for the symbol instead. Once we're getting the section mappings from proc maps we also need to use that for finding the base address for symbol vaddr calculation. Currently we just use a .needle symbol, which is hacky af and doesn't work for external ELFs.

We should cache the ELFs (and what sections they're mapped at ) with a RangeMap<usize, Rc<Elf>> or something similar, marking each range to what ELF it's part of so we don't have to keep checking proc maps or reading the ELF from disk.

Use a different IR than Cranelift

move |a: usize| {
    let val: usize;
    if Wrapping(a) + Wrapping(0x5) < Wrapping(a) {
        val = 1;
    } else {
       val = 2;
    }
    val
}

---- cmp rsi, -0x5
---- mov eax, 0x2
---- adc rax, -0x1
---- ret

function u0:0(i64, i64, i64) -> i64 system_v {
block0(v0: i64, v1: i64, v2: i64):
    jump block1

block1:
    v3 = iconst.i64 -5
    v4, v5 = isub_ifbout.i64 v2, v3
    v6 = iconst.i64 2
    v7 = iconst.i64 -1
    v8, v9 = iadd_ifcarry v6, v7, v5
    return v8
}

(Ignore the weird block0)
This errors when compiling with
thread 'test::test_passthrough_usize' panicked at 'ALU+imm and ALU+carry ops should not appear here!', /home/charlie/.cargo/registry/src/github.com-1ecc6299db9ec823/cranelift-codegen-0.75.0/src/isa/x64/lower.rs:5977:13

Technically for this specific case I could turn the cmp into a Cranelift icmp_imm instruction, but later could could be depending on the rsi (v4) value. I'd have to re-emit an isub instruction right before its use, which I can do in this specific case, but really this is just a sign that Cranelift's IFLAGS mechanism plain doesn't work for me.
The bigger issue is that Cranelift assumes that all math ops clobber all IFLAGS, so you can't have, for example

---- cmp rsi, -0x5
---- lea eax, [rsi+5]
---- mov eax, 0x2
---- adc rax, -0x1
---- ret

since it would say that the flag produced by the cmp would be invalid after the iadd from the LEA. I'd have to do my own code movement and instruction picking to make it happy, and even then Cranelift flag rules are hard enough to use it's probably impossible.

I'm not entirely sure where to go from here: I probably should've just been using LLVM this entire time, since that's what a lot of other x86_64 lifters target for a backend. Alternatively, I could try to use dynasm-rs: it doesn't have any optimizations, but then Cranelift barely does either (and it's codegen is often worse than just re-emitting the same instructions!). By lifting and lowering to the exact same instructions, nothing would clobber flags - at worse I'd have to emit stc; adc if I constant fold the add part of an add; adc, and could probably do something smarter instead.

Abort compilation properly

Right now we just have unwrap() in a bunch of places during JIT compilation. If we fail to e.g. resolve a symbol we should instead abort compilation and continue using the slow one, and if we fail to lift an instruction we should concretize the context and jump to the original closure body instruction.

Deallocate JIT code on FastFn drop

What is says on the tin: ususally JITs can't ever deallocate JIT bodies without explicit safepoints, since another thread could be currently executing the code and would segfault.
Since we JIT closures, and are using Rust where everything has explicit lifetimes, we can actually always deallocate the JIT code when the Fn goes out of scope, since by definition it is unreachable from any code. Also we don't have any multi-threading support yet lol.

Get rid of the mem::forget(f) in JIT compilation and store the original function in the FastFn instead, too, so that we properly run destructors of the members of the closed environment.

Merge flow control back together after a branch

Currently we're a zero-pass JIT compiler: we just JIT operations as we see instructions, without any initial pass over the function. This is nice because it gives us nice performance (probably), and because any CFG when we're aggressively inlining with partial application is going to have to be updated as we elide branches and inline functions.

Unfortunately, it also means that we don't have any CFG info as we JIT a function. Which means if we have above if cond { true } else { false } fallthrough, what we do is just split the execution into two paths above -> true -> fallthrough and above -> false -> fallthrough. This is fine for simple branches, and means we have full partial application for either version of fallthrough, but means we have n^2 paths through a function for n branches, which is hilariously bad.

We'd instead like to be able to turn it into above -> [true, false] -> fallthrough, where fallthrough has an intersection of the symbolic state from the result of true and false. The fact we don't have a CFG means we don't know if a control flow path merges back again though! And in practice either side of a branch aren't the same number of basic blocks, so we can't just advance the JIT for either side of a branch equally and just check if their jump targets are the same.

There's two options:

give up on being a zero-pass compiler, and do an initial CFG building pass to find where to merge branches back together (which also would help us when doing loop compilation...)
just use some heuristic and miss some merges - we could instead try to advance whichever side of a branch is at the lowest instruction offset into the function at each step, on the basis that even if the true path is a different length as false, they're both laid out in assembly before the fallthrough branch. This is easier, and maybe faster, but means that we would miss branches that jump to some basic block at the end of a function and back up again. Maybe that's good, because those are mostly "cold" branches, and compiling both side of a cold branch gives us better partial application info on the hot path since we wouldn't have to throw away information in the state unioning to include unlikely path behavior.

Ahead-of-time compilation mode

Ok, maybe this is a dumb idea, but it'd be nice to be able to do a JIT compilation of a script and then have Lineiform dump a snapshot to disk. Then running the snapshot executes the "JIT" compiled closures immediately, without having to parse + emit closures + JIT compile, giving you ahead-of-time compilation of scripts (while still having the full range of language features like eval).

Lisp and Smalltalk have "images" that do this. I'm not entirely sure how you handle rebasing the JIT compiled closures, tho, if we're inlining pointers from the closed environment and stuff. Maybe you just don't have ASLR in the snapshotted executable lol.

Better tracing

Right now we just println a bunch, which sucks.

Switch to tracing so we have introspection into how long specific parts of the program are taking
Save cranelift_codegen::cfg_printer::CFGPrinter to disk so we can visualize the JIT'd program CFG better (and maybe build our own graphviz printout for JitValues?)

Lineiform metadata interface

Lineiform should probably keep a big map of metadata which it uses when lifting, so you can e.g. tell Lineiform to never try to inline some function.

Implement looping

Similar to #7, but loop instead.

The problem with looping is we don't know where a loop starts until we hit jump to an instruction we have already marked in our Context.seen bitmap. Which means we always "unroll" one interation of a loop no matter what, and then seen the condition for looping and jump back to the beginning.

We can either

build a CFG, and use that to collect all the entry points of a loop
Unroll one execution, and stop advancing any paths that are backwards jumps when we are emitting. Then, once we have JIT compiled all non-loop paths we can, we have a set of entry points to loop and their entry contexts, and can union them all together to create symbolic values for the body of the loop, which we compile.

The context for the fallthrough branches of the loop then have a context where we clobber whatever values in the loop context are different between the start of the loop and the end, allowing us to slightly optimize through loops by giving us a set of registers/stack slots that are invariant the entire loop body.

The problem with option 2 is that we optimally want to recompile functions multiple times in the final version of Lineiform, and if we're unrolling one loop iteration each time it's Bad. We can probably just keep around function compilation metadata after we JIT, and feed it into the JIT of any consumers of the function so that it knows a loop starts at X without having to unroll it once and find back edges though? In the CFG building case we'd probably want to keep that metadata around too so we don't have to re-build a CFG from our JIT output, so I'm not sure if it's any faster or not.

Compile-time closure freeze checking

There's a section in the README about the hypothetical freezing allocator we can use to make sure closure environments are safe to inline by forcing them to be immutable (at the cost of runtime segfaults...)

We can actually do slightly better, and put off implementing that for a bit: Rust has automatic marker traits like Send and Sync that are implemented for any type that only contains fields already marked with that trait. The Rust compiler has an internal marker trait Freeze that's used for marking interior immutability - which is the exact same thing we care about. Magically, marker traits are also implemented for Fn trait objects automatically too, so we can 1) verbatim copy Freeze out of the stdlib (or maybe use https://crates.io/crates/freeze-macros which seems to do exactly that??) 2) change Jit::speedup<F> to be Jit::speedup<F: Freeze> to only accept closures that are interior immutable. This gives us compile-time errors instead of runtime segfaults, which is a bonus.

Implement inlining heuristics

Right now Lineiform just tries to inline every call unconditionally. This is pretty stupid. Most JITs use some combination of current function size, CFG complexity, and target function size to decide if it should inline a call or not, and we'd like to do the same. This is slightly hampered in that we don't have an CFG building pass, but we do have target function size from the symbol.

Use DWARF info to get the calling convention of external functions

When we call a function, we'd like to have a minimum set of values we have to actually emit to registers, in order to avoid having to place all of them. DWARF has some information for what prototype a function has, which afaik is literally what registers it needs as inputs. Not sure if it also includes clobbered registers (SystemV has a list of registers that you can't assume are valid after a function call).

This should probably be lazily cached in the Function struct the first time we request it, since parsing DWARF is fairly expensive and we don't want to do it for functions if we don't have to, and only once if we do. gimli is the crate we'd want to use, and feeding the goblin::Elf structure we already have if we can.

DWARF isn't what I'd called "well documented". This will probably need some reading of dwarfdump's source code and obscure mailing list threads to figure out how to get the register inputs, and maybe I'm just wrong and it doesn't actually give us registers and we have to try and infer them.