Coder Social home page Coder Social logo

binary_type_inference's People

Contributors

2over12 avatar apstickler avatar ekilmer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

binary_type_inference's Issues

Add Tests that Test Properties of Sketches

Maintaining tests that use fixed type identifiers and expect concrete constraints has become impossible for larger test cases (ie. mooosl). These tests are currently disabled, we should check properties that we expect of these constraints and sketches

Add support for inferring globals.

Inferring global types requires to major features. The VSA does not track pointers into the global region. Even if we had pointers into the global region we would still need some sort of notion of a-locs (potentially, unless maybe we wanted to treat offsets universally). One solution would be to just consume ghidra's globals.

Additionally, we would have to actually collect constraints about globals and solve at some point. We could maybe do this during function aggregation and just add globals as additional interesting variables.

Polymorphic Parameter Binds in /bin/ls Create a Reflexive Edge

After binding a parameter in sub_00109050 in /bin/ls the next parameter bind fails because of an assertion that type sketches should not create reflexive edges. This probably indicates a bug in binding the previous parameter to the one we fail on.

Bug in Retypd Proofs Related to Contravariant Rules Over Interesting Variables

x <= a
x.store <= y
y <= b

The retypd github impl prints the empty set when your interesting variables are [a,b]

But consider that an elementary proof exists for the subtyping constraint a.store <= b

x <= a and x.store exists based on the proof rule InheritL subtypes should inherit fields to the left of a subtyping constraint: a.store
x <= a and a.store exists .store is contravariant so by the contravariant field rule a.store <= x.store
a.store <= x.store and x.store <= y so by transitivity a.store <= y
a.store <= y and y<= b so a.store <= b QED 

Screen Shot 2021-12-10 at 5 10 25 PM

The proof graph demonstrates that the proof connections exist, however, the initial variable is contravariant. We only explore covariant paths (ie. initial cov, end cov). Observationally the path a.store <= b and b<= a.store both exist but only the first path is a valid proof.

The first thing to check is wether the above is provable in normalized form. If it is, what is the graph representation of the normal form proof and why is it not represented. The issue appears to be with proofs that have an initial contravariant field rule.

Refactor Datalog to Allow for Unions.

Currently the datalog selects a singular type for a node by ruling out other types. ie. The pointer constraints consist of a store and load suggesting it is a pointer then making sure there arent field rules, in rules, out rules, or add rules. Instead, we should collect every possible type for a node. If there is a single type we get a single type, otherwise we introduce a union.

Transitivity not propagating across S-Pointer rules.

a <= x.store
x.load <= c

Pretty obviously we should get a <= c here.

Formally S-Pointer gives us x.store <= x.load
Trans of (a <= x.store , x.store <= x.load) gives us a <= x.load
Trans of (a <= x.load, x.load <= c) gives us a <= c

This proof breaks down in our proof graph. This is related to the variance of stacks. We have to be careful to not break #6 while fixing this portion of the proof graph.

graphviz

The trick is a should have an edge to x cov store which would then get everything to work out. But this edge was removed in #6

Formal params and actual returns do not preserve the appropriate subtyping relation in the case of multiple definitions for the returned or passed value

When accessing registers and possibly points to information for formal parameters the subtyping relationship goes in the wrong direction.

Ie. In assignment x=y
Where y is defined by a and b we get
a <= t
b <= t
t <= x

This works to get a passed up to x since both must be a subtype. (ie, a <= t <= x) We are using this same method for formals but it doesn’t work:

We have sub.in_0 <= t ie. Stating that the formal must be a subtype of its usages so say the defines are a and b we want to get sub.in_0 <= a and sub.in_0 <= b
So we want sub.in_0 <= t t <= a and t <= b but we use the same reg access method. So we get (sub.in_0 <= t and a <= t b <= t) the subtyping relationship causes a and b to not constraint in_0 at all.

This can probably happen for the memaccess stuff as well. Ie. imagine a parameter on the stack we get
stack_obj <= x.load.stack_loc

sub.in_0 <= x.load.stack_loc. Once again information on stack_obj does not constraint sub.in_0. Somewhat fortunately x.load.stack_loc will still constrain it in this case.

We should write testcases for both of these. Is it the case that what should actually happen is x.load.stack_loc <= stack_obj?

Property Tests for Polymorphic formals/actuals

Property tests are fairly easy to construct for testings properties we want to hold for polymorphic parameter bindings. The idea is to generate sketch graphs that use other sketches. Then we check if words from the language of the formal in is in the language of the actual in. Then for outs we check if all the words in the actual out are in the formal out.

Handle Callsite Specific Extern Argument Constraints

extern int additional_compute(int val);

void compute(int* ptr) {
    *ptr = additional_compute(*ptr);
}

Ghidra's definition for the stub additional_compute has no arguments. This means the type inference constraint generation does not generate any actual -> formal argument constraints for the call to additional_compute. The decompiler does guess correctly that a single argument is passed to additional compute in compute, so there should be some way to extract these constraints. The high function, however, does not contain info about where the argument is located (top of the stack in the case of x86) which would be required. Back logging this because this only causes missed info if we have a type signature for the extern. If we have a type signature then the parameter will be available in ghidra.

Move subprocedure locators within BTI

So currently we use the decompiler guesses of call parameters and returns for actual calls and then guessed params and returns for actual rets. We should analyze the locators ourself with more clever analyses. For actual calls, the rets would be reaching definitions from the callee that are used.

Generate Simplified Constraint Sets for Functions

Currently everything is solved in one pass. In the retypd paper they first create a transducer over the type variables of functions to create function summary signatures then use those to create an interprocedural analysis effeciently.

Maintain Knowledge of Shared Subgraphs for Lowering

Currently after sketches are labeled there is no knowledge of which nodes in a sketch correspond to nodes in the original constraint graph. Ideally, we would maintain some way to identify shared subgraphs between sketches. The idea being we should build some dependency graph of lowering shared subgraphs where a shared sketch subgraph is represented by a singular C type.

Use addcons to prove relationship between input and output params

These constraints represent a minimized set of constraints needed to prove (approximately) that sub_001014fb.in_0.load <= sub_001014fb.out

The constraints are organized so that the transitive property just flows

Relating t320 to the out parameter:

τ320.load.σ64@0 ⊑ instr_00101544_0_$Uc000
instr_00101544_0_$Uc000 ⊑ instr_00101547_1_$Uc000
instr_00101547_1_$Uc000 ⊑ instr_00101434_0_RBP.store.σ64@-16
instr_00101434_0_RBP.store.σ64@-16 ⊑ sub_001014fb@RSP.σ64@-24
sub_001014fb@RSP.σ64@-24 ⊑ instr_00101434_0_RBP.load.σ64@-16
instr_00101434_0_RBP.load.σ64@-16 ⊑ instr_00101587_1_$Uc000
instr_00101587_1_$Uc000 ⊑ instr_00101587_2_RAX
instr_00101587_2_RAX ⊑ sub_001014fb.out




sub_001014fb.in_0 ⊑ instr_001016bd_0_RDI
sub_001014fb.in_0 ⊑ instr_0010176e_0_RDI
instr_001016bd_0_RDI ⊑ τ313
instr_0010176e_0_RDI ⊑ τ313
τ313 ⊑ instr_00101500_0_RBP.store.σ64@-24
instr_00101500_0_RBP.store.σ64@-24 ⊑ sub_001014fb@RSP.σ64@-32
sub_001014fb@RSP.σ64@-32 ⊑ instr_00101434_0_RBP.load.σ64@-24
instr_00101434_0_RBP.load.σ64@-24 ⊑ instr_0010153d_1_$Uc000


Add ties it together
Add(instr_0010153d_1_$Uc000,instr_00101535_1_RDX,τ320)

The point is we arent currently handling add constraints but the add constraint is the only thing currently tying together the in parameter to the out parameter. We need to apply these inference rules when inspecting the sketch graph:
Screen Shot 2022-04-12 at 4 55 47 PM

It's tricky because this has to happen before simplification so we simplify out the relation, hopefully sketch building on unsimplified cons just works tm :)

Examine implications of not considering the original type when using callsite types for concrete refinement

In dca9410 I removed intersecting/unioning callsite types with the original representation. This decision was made because the sketch union of out parameters would result in a loss of information.

Ie. a type like:

sub_id.in <= sub_id.out
sub_id:callsite_0.out.field_at_0 <= int
sub_id:callsite_1.out.field_at_4 <= float

Would discard the structure because the identity function has no language intersection with the original type. We should more carefully examine the implications of this decision.

Property Tests for Transducer Simplification

Need to look into the complexity theory of this but I think it should be possible to generate a push down system where the stack configuration of that weighted PDS is recognized by some regular language L. ie given an input language L generate a PDS with stack configurations that are = to L. If we can do that then we can generate the set of constraints represented by the PDS and compute the FSA. Then we can check if the FSA recognizes things outside the language L or that it recognizes things inside the language L.

If we can't get equality on the stack configurations we could also get weaker checking by constructing a PDS that is at least guaranteed to be recognized, or guaranteed to contain all words in L.

When creating polymorphic sketches, parent node labels should not be copied to the child.

let in_params = orig_repr
            .quotient_graph
            .get_node_mapping()
            .iter()
            .map(|(dtv, _idx)| dtv.clone())
            .filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_in_parameter());

        for dtv in in_params.collect::<Vec<DerivedTypeVar>>() {
            self.refine_formal_in(condensed, &mut orig_repr, dtv, target_idx);
        }

        let out_params = orig_repr
            .quotient_graph
            .get_node_mapping()
            .iter()
            .map(|(dtv, _idx)| dtv.clone())
            .filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_out_parameter());

        for dtv in out_params.collect::<Vec<DerivedTypeVar>>() {
            self.refine_formal_out(condensed, &mut orig_repr, dtv, target_idx);
        }

So the code above has a bug. When we refine the formal in parameters right now that will copy in information from the parents including nodes labeled by the parents dtvs. Then the parent dtvs (which arent callsite tagged are in the namespace for this sketch) and the out params from the parent will be included in the childs out params which is not correct.

This could be solved by collecting all dtvs before beginning refinement, but this is solving a symptom and not the cause. We shouldnt copy dtvs from the refinement at all. The labels should come from the type being refined

Potential unsound inference of store capabilities due to directionality

In 517dd34 I added directionality to stores.

The idea here was that if we observe a store of some type variable t to a pointer p and p points to the abstract object a then p.store ⊑ a because we could have stored to p and therefore must be compatible. This kinda falls over because really we would want a must analysis here to guarantee that along some path we point to a, otherwise p.store may not have to be a subtype. Ignoring that issue for the moment: directionality made the following claim hold: If pointer z reaches p then we should also conclude z.store ⊑ a

This reasoning is fairly reasonable but has the undesirable consequence of also resulting in an overly strong consequence of p ⊑ z. Given contra-variance on store capabilities we can then derive z.store ⊑ p.store. This says that any pointer defining a stored to pointer must always store a subtype. That's undesirable because imagine some sort of situation like:

void* i;
void* z;
void* p = z;
*p=open(); 
 
z = i;
*z = 1;
}

in this case we get z.store ⊑ fd when z both stores int and fd so in an ideal world our type here is int ⊑ z.store. I need to think more about this. Im assuming the reason this was initially done was to handle how locs are generated (ie. through address computations). Something like:

u=INT_ADD rsp, 8
STORE x, u

and without flipping the direction we get

rsp_defsite.+8 ⊑ unique
unique  ⊑ t
t.store ⊑ stack.at_8_8

with that we get to conclude:

t.store ⊑ unique.store
unique.store ⊑ rsp_defsite.+8.store

but that seems fine to me

Handle Compound Variables Directly During Constraint Generation

Currently, compound outparameters are handled by creating a fake structure with a field for each storage location during lowering. This obviously isnt ideal. We should output compound variables as such in the CWE checker IR then in constraint generation produce constraints that reflect the compound out or in parameter. Changing these as is would break CWE checker abstract interpretation in a lot of ways. We can perhaps work around this by not touching their argument definition that is primarily used for externals and instead emit our own argument definition specifically for use in type constraint generation

Deduplicate type variable strings

There's probably a lot of string duplication due to holding owned strings in typevariables, type vars should probably hold Rcs or something for the string.

Handle Constant Adds Better

Constant addition prior to a load ie. a.load.+x can be viewed as an access to a field. Currently we handle this by pushing adds through loads in the sketch graph, but this doesnt allow this property to be used between uninteresting variables in proofs.

ideally we'd have some sort of proof rule in the FSA:

x \in Z /\ Var a.load.+x /\  Var b /\  a.load.+x <= b
-------------------------------------------------------
a.load <= b.field_at.+x

Handle unbounded points to information more soundly

The question here is how to handle writes to an abstract location that are unbounded. Currently, for an abstract object named RSI we do RSI <= thing.load.@0sigma_size. This approach doesnt work well if some writes are bounded. then we will get RSI@0sigma_size <= thing.load.@0sigma_size. If the program is type safe wrt to our model these two types must be equal. But that's going to result in a reflexive edge with a field to the abstract object.

Ideas of how to handle this:

  • If we have an unbounded write/read to an abstract object, every reference to that object should be unified.

Document Public API

#![warn(missing_docs)] was added to the library aspirationally. To do #2 we need to fix these clippy warnings.

Handle variadic functions

Currently, we dont bind up information to the in parameters of a function that are variadic because we look at the functions formals to bind up the actuals. We should query ghidra or use our own analysis to figure out how to bind these up

Implement Function Cloning for Polymorphism

In mooosl several pointers are getting misunified, ie. byte buffers are unified with the pointers to nodes. This is because both of these pointers reach free param 1 and are returned from calloc. These pointer types are then being unified, causing mistyping of byte buffers. Solving this requires callsite tagging of formal parameters in order to support polymorphic types of functions such as free and calloc.

Long processing time of /bin/ls

Here are the collected files, including the ls binary:

$ unzip -d /tmp/test-ls-bti ls-bti.zip
Archive:  ls-bti.zip
   creating: /tmp/test-ls-bti/1655908655320-0/
  inflating: /tmp/test-ls-bti/1655908655320-0/interesting_tids.pb
  inflating: /tmp/test-ls-bti/1655908655320-0/ir.json
  inflating: /tmp/test-ls-bti/1655908655320-0/lattice.json
  inflating: /tmp/test-ls-bti/1655908655320-0/additional_constraints.pb
  inflating: /tmp/test-ls-bti/ls

ls-bti.zip

Zexts/Sexts should be included in type constraints.

Currently we ignore sign extend and zero extended copies and not emitting type constraints, we should infer from this behavior that the target is integral and match up the portion extended in the constraint. This issue is backlogged until the solver is implemented.

Allow sharing type structure within global variables.

Currently global type inference works in the following way:

After binding polymorphic types we visit all sketchgraphs looking for global variables referenced in sccs.

For each variable we collect the intersection of the global variable subsketch each time it appears. This has the effect of concretizing the global type into one unified sketch. These unified sketches are then placed back into the scc sketches so that parameter aliasing can use this updated information.

Later we create a graph for each global in the global type graph (bad name... it's the type graph of all sccs). Each node that used to point to the global gets pointed to this new subgraph. This strategy doesnt allow for sharing structure between globals that share a structure type. To do this we need to alias uses of a global in a parent global to the child globals representation.

Don't lose observed parameter information

If during constraint generation we get something like

sub_x.in <= tau and we lose the plot for that variable (it isnt related to any other variables, we will get no constraints on that variable and we wont observe that that constraint exists when building sketches). This means parameter types will be missing. We could transparently add sub_x.in < top or bot < sub_x.in to force the var constraints to be observed in the sketch and get .in as a node with [bottom, top] as the type. There may be some subtlety here with contravariant types.

Im not sure i love adding these fake constraints on behalf of the user. On the other hand maybe the user who is doing constraint generation can make a decision about how they are going to track recovered param types

Use path from representing dtv in sketches as identifier for nodes.

The most reliable way to find a node after graph manipulations is by following the path of edges in the original sketch. Since sketches are prefix closed if the node still exists it will always still be at that path. Currently, this is leveraged in a bunch of places but the mapping graph still identifies nodes by indeces etc, maybe we can share some infrastructure related to node idenitification utilizing graph paths.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.