Rather than doing a lot of hairy link-time stuff (see tools/allocscompilerwrapper.py)

Do binary instrumentation of allocation functions about liballocs HOT 5 OPEN

stephenrkell commented on August 19, 2024

Do binary instrumentation of allocation functions

from liballocs.

Comments (5)

stephenrkell commented on August 19, 2024

Of course, generating trampolines at run time compromises debuggability, or more generally meta-completeness, unless we can generate debug info / metadata for our trampolines. See #16. Ideally we would create the trampolines in a libdlbind-allocated object, and write into it some DWARF info for them. Generating this DWARF may prove a bit tedious, but we could probably hard-code a template and generate from that. Or arguably, our trampolines are just like PLT entries and don't really need their own DWARF.

from liballocs.

stephenrkell commented on August 19, 2024

Perhaps an appealing way to do the hooking would be by return-address hooking. But then we need to worry about breaking stack walkers, both from debugging and from our own code. In our own code we can cope with it (make introspective clients see the "illusion" of the real stack). But it would be annoying to break the debugger. I wonder if DynInst has a solution to this.

from liballocs.

stephenrkell commented on August 19, 2024

I am leaning towards thinking that we should get rid of caller-side instrumentation altogether. That might work as follows.

only do callee-side instrumentation
insert type info by walking up the stack and finding a call to a classified alloc func, i.e. classified according to both source- (dumpallocs) and binary-level (objdumpallocs) analyses. That supposes we are still requiring users to identify the allocation functions; getting away without that is more for #20 although some kind of bootstrapping based on instrumented built-in allocators (brk, mmap, stack) might just about work.
In general we want the outermost hit, to catch wrapper cases. Or do we? In a wrapper, we expect the in-wrapper site not to be classified. And if there might be allocations during allocation, the innermost classified call seems to be the right one. E.g. SPEC CPU2006's sphinx3 has

void **__ckd_calloc_2d__ (int32 d1, int32 d2, int32 elemsize,
                         const char *caller_file, int32 caller_line)
{
   char **ref, *mem;
   int32 i, offset;

   mem = (char *) __ckd_calloc__(d1*d2, elemsize, caller_file, caller_line);
   ref = (char **) __ckd_malloc__(d1 * sizeof(void *), caller_file, caller_line);

   for (i = 0, offset = 0; i < d1; i++, offset += d2*elemsize)
       ref[i] = mem + offset;

   return ((void **) ref);
}

But if we do this, caller-side instrumentation may need to come back for performance reasons. It basically short-circuits the stack walk: by maintaining __current_allocsite we actually don't need to walk the stack, and can zoom straight to the data we want. Maybe some clever caching can get us something almost as good. Or maybe we can heuristically insert the caller-side instrumentation, at the binary level, after the first time we discover an allocation function by stack walking. We could patch the PLT or maybe the call (for not-via-PLT cases). Indirect calls would still need to be going through the PLT; indirect and not-via-PLT is a problem and might fall back to always walking the stack. Or we could instead do our binary instrumentation at the called entry point. This is less clean because we will CoW-fork the text every time.

from liballocs.

stephenrkell commented on August 19, 2024

Perhaps we should first rename __current_allocsite to __outermost_allocsite and use it to terminate the stack walk...

... then always choose the innermost (first) classified call site that we find on the walk?

This means we are sticking with caller-side instrumentation, which we had wanted to eliminate. I guess the first step is to do the above (fixes sphinx3) and then measure the performance drop of skipping the __outermost_allocsite early termination.

from liballocs.

stephenrkell commented on August 19, 2024

... then, if the performance hit is a problem, replace caller-side wrapping with binary instrumentation of such callers, performed lazily (on the first stack walk that traverses a frame of the named allocation function). Otherwise just scrap it.

That's still assuming we know the names of allocation functions, or at least of 'functions receiving a value with sizeofness'. If we do a more extensive sizeofness analysis, maybe at link time, we can perhaps eliminate that. It's a pretty standard call graph analysis, but has the usual problem of imprecision around indirect calls (for which we'd probably want to assume 'calls any type-compatible callee').

For the call graph analysis, we'd want a local summary of sizeofness sourced, sizeofness sinked, but also 'passed through' i.e. for any integer argument (or integer value read!), supposing it has sizeofness, which callee arg (or value written!) would receive that sizeofness?

A tricky case would be a cross-DSO call into a malloc wrapper. Though at link time we would detect that a given UND symbol receives some sizeofness. Maybe that is enough. It seems to need a further run-time step, accounting for the link map....

from liballocs.

Do binary instrumentation of allocation functions about liballocs HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent