trailofbits / binary_type_inference Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 0.0 3.02 MB

License: GNU General Public License v3.0

Rust 97.01% C 2.64% Nix 0.08% Assembly 0.06% Makefile 0.09% Shell 0.09% Just 0.02%

binary_type_inference's People

Contributors

Stargazers

Watchers

binary_type_inference's Issues

Add Tests that Test Properties of Sketches

Maintaining tests that use fixed type identifiers and expect concrete constraints has become impossible for larger test cases (ie. mooosl). These tests are currently disabled, we should check properties that we expect of these constraints and sketches

Add support for inferring globals.

Inferring global types requires to major features. The VSA does not track pointers into the global region. Even if we had pointers into the global region we would still need some sort of notion of a-locs (potentially, unless maybe we wanted to treat offsets universally). One solution would be to just consume ghidra's globals.

Additionally, we would have to actually collect constraints about globals and solve at some point. We could maybe do this during function aggregation and just add globals as additional interesting variables.

Allow for bare metal analysis in the generation of CWEChecker IR

Currently the get_intermediate_representation_for_reader utility function does not support loading a bare metal binary. This capability will be required for embedded device analysis.

Polymorphic Parameter Binds in /bin/ls Create a Reflexive Edge

After binding a parameter in sub_00109050 in /bin/ls the next parameter bind fails because of an assertion that type sketches should not create reflexive edges. This probably indicates a bug in binding the previous parameter to the one we fail on.

Bug in Retypd Proofs Related to Contravariant Rules Over Interesting Variables

x <= a
x.store <= y
y <= b

The retypd github impl prints the empty set when your interesting variables are [a,b]

But consider that an elementary proof exists for the subtyping constraint a.store <= b

x <= a and x.store exists based on the proof rule InheritL subtypes should inherit fields to the left of a subtyping constraint: a.store
x <= a and a.store exists .store is contravariant so by the contravariant field rule a.store <= x.store
a.store <= x.store and x.store <= y so by transitivity a.store <= y
a.store <= y and y<= b so a.store <= b QED

The proof graph demonstrates that the proof connections exist, however, the initial variable is contravariant. We only explore covariant paths (ie. initial cov, end cov). Observationally the path a.store <= b and b<= a.store both exist but only the first path is a valid proof.

The first thing to check is wether the above is provable in normalized form. If it is, what is the graph representation of the normal form proof and why is it not represented. The issue appears to be with proofs that have an initial contravariant field rule.

What type information should we extract from an INT_EQUAL

Currently we infer nothing but in one of our tests, the only way to infer that variable x is an integer is to observe that it is tested for equality against an integer... should we unify across compares?

Refactor Datalog to Allow for Unions.

Currently the datalog selects a singular type for a node by ruling out other types. ie. The pointer constraints consist of a store and load suggesting it is a pointer then making sure there arent field rules, in rules, out rules, or add rules. Instead, we should collect every possible type for a node. If there is a single type we get a single type, otherwise we introduce a union.

Change Name of the Graph with all Types Inlined to a Supergraph then call graph of all global types the GlobalGraph

The current terminology is confusing. The GlobalGraph represents all types while the graph of globals only represents the types of globals. The supergraph should represent all types to avoid confusion

Experiment with Allowing Type Inference to Instantiate Summary Signatures of Callees into Constraints

This would allow for more precise inference and attack trailofbits/BTIGhidra#19 by including the structure of the callee into the caller FSA inference.

Transitivity not propagating across S-Pointer rules.

a <= x.store
x.load <= c

Pretty obviously we should get a <= c here.

Formally S-Pointer gives us x.store <= x.load
Trans of (a <= x.store , x.store <= x.load) gives us a <= x.load
Trans of (a <= x.load, x.load <= c) gives us a <= c

This proof breaks down in our proof graph. This is related to the variance of stacks. We have to be careful to not break #6 while fixing this portion of the proof graph.

The trick is a should have an edge to x cov store which would then get everything to work out. But this edge was removed in #6

Formal params and actual returns do not preserve the appropriate subtyping relation in the case of multiple definitions for the returned or passed value

When accessing registers and possibly points to information for formal parameters the subtyping relationship goes in the wrong direction.

Ie. In assignment x=y
Where y is defined by a and b we get
a <= t
b <= t
t <= x

This works to get a passed up to x since both must be a subtype. (ie, a <= t <= x) We are using this same method for formals but it doesn’t work:

We have sub.in_0 <= t ie. Stating that the formal must be a subtype of its usages so say the defines are a and b we want to get sub.in_0 <= a and sub.in_0 <= b
So we want sub.in_0 <= t t <= a and t <= b but we use the same reg access method. So we get (sub.in_0 <= t and a <= t b <= t) the subtyping relationship causes a and b to not constraint in_0 at all.

This can probably happen for the memaccess stuff as well. Ie. imagine a parameter on the stack we get
stack_obj <= x.load.stack_loc

sub.in_0 <= x.load.stack_loc. Once again information on stack_obj does not constraint sub.in_0. Somewhat fortunately x.load.stack_loc will still constrain it in this case.

We should write testcases for both of these. Is it the case that what should actually happen is x.load.stack_loc <= stack_obj?

Currently the Stack Analysis and Actual Displacement Assumes a Downward Growing Stack

Property Tests for Polymorphic formals/actuals

Property tests are fairly easy to construct for testings properties we want to hold for polymorphic parameter bindings. The idea is to generate sketch graphs that use other sketches. Then we check if words from the language of the formal in is in the language of the actual in. Then for outs we check if all the words in the actual out are in the formal out.

Handle Callsite Specific Extern Argument Constraints

extern int additional_compute(int val);

void compute(int* ptr) {
    *ptr = additional_compute(*ptr);
}

Ghidra's definition for the stub additional_compute has no arguments. This means the type inference constraint generation does not generate any actual -> formal argument constraints for the call to additional_compute. The decompiler does guess correctly that a single argument is passed to additional compute in compute, so there should be some way to extract these constraints. The high function, however, does not contain info about where the argument is located (top of the stack in the case of x86) which would be required. Back logging this because this only causes missed info if we have a type signature for the extern. If we have a type signature then the parameter will be available in ghidra.

Move subprocedure locators within BTI

So currently we use the decompiler guesses of call parameters and returns for actual calls and then guessed params and returns for actual rets. We should analyze the locators ourself with more clever analyses. For actual calls, the rets would be reaching definitions from the callee that are used.

Generate Simplified Constraint Sets for Functions

Currently everything is solved in one pass. In the retypd paper they first create a transducer over the type variables of functions to create function summary signatures then use those to create an interprocedural analysis effeciently.

Maintain Knowledge of Shared Subgraphs for Lowering

Currently after sketches are labeled there is no knowledge of which nodes in a sketch correspond to nodes in the original constraint graph. Ideally, we would maintain some way to identify shared subgraphs between sketches. The idea being we should build some dependency graph of lowering shared subgraphs where a shared sketch subgraph is represented by a singular C type.

Use addcons to prove relationship between input and output params

These constraints represent a minimized set of constraints needed to prove (approximately) that sub_001014fb.in_0.load <= sub_001014fb.out

The constraints are organized so that the transitive property just flows

Relating t320 to the out parameter:

τ320.load.σ64@0 ⊑ instr_00101544_0_$Uc000
instr_00101544_0_$Uc000 ⊑ instr_00101547_1_$Uc000
instr_00101547_1_$Uc000 ⊑ instr_00101434_0_RBP.store.σ64@-16
instr_00101434_0_RBP.store.σ64@-16 ⊑ sub_001014fb@RSP.σ64@-24
sub_001014fb@RSP.σ64@-24 ⊑ instr_00101434_0_RBP.load.σ64@-16
instr_00101434_0_RBP.load.σ64@-16 ⊑ instr_00101587_1_$Uc000
instr_00101587_1_$Uc000 ⊑ instr_00101587_2_RAX
instr_00101587_2_RAX ⊑ sub_001014fb.out




sub_001014fb.in_0 ⊑ instr_001016bd_0_RDI
sub_001014fb.in_0 ⊑ instr_0010176e_0_RDI
instr_001016bd_0_RDI ⊑ τ313
instr_0010176e_0_RDI ⊑ τ313
τ313 ⊑ instr_00101500_0_RBP.store.σ64@-24
instr_00101500_0_RBP.store.σ64@-24 ⊑ sub_001014fb@RSP.σ64@-32
sub_001014fb@RSP.σ64@-32 ⊑ instr_00101434_0_RBP.load.σ64@-24
instr_00101434_0_RBP.load.σ64@-24 ⊑ instr_0010153d_1_$Uc000


Add ties it together
Add(instr_0010153d_1_$Uc000,instr_00101535_1_RDX,τ320)

The point is we arent currently handling add constraints but the add constraint is the only thing currently tying together the in parameter to the out parameter. We need to apply these inference rules when inspecting the sketch graph:

It's tricky because this has to happen before simplification so we simplify out the relation, hopefully sketch building on unsimplified cons just works tm :)

Examine implications of not considering the original type when using callsite types for concrete refinement

In dca9410 I removed intersecting/unioning callsite types with the original representation. This decision was made because the sketch union of out parameters would result in a loss of information.

Ie. a type like:

sub_id.in <= sub_id.out
sub_id:callsite_0.out.field_at_0 <= int
sub_id:callsite_1.out.field_at_4 <= float

Would discard the structure because the identity function has no language intersection with the original type. We should more carefully examine the implications of this decision.

Property Tests for Transducer Simplification

Need to look into the complexity theory of this but I think it should be possible to generate a push down system where the stack configuration of that weighted PDS is recognized by some regular language L. ie given an input language L generate a PDS with stack configurations that are = to L. If we can do that then we can generate the set of constraints represented by the PDS and compute the FSA. Then we can check if the FSA recognizes things outside the language L or that it recognizes things inside the language L.

If we can't get equality on the stack configurations we could also get weaker checking by constructing a PDS that is at least guaranteed to be recognized, or guaranteed to contain all words in L.

When creating polymorphic sketches, parent node labels should not be copied to the child.

let in_params = orig_repr
            .quotient_graph
            .get_node_mapping()
            .iter()
            .map(|(dtv, _idx)| dtv.clone())
            .filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_in_parameter());

        for dtv in in_params.collect::<Vec<DerivedTypeVar>>() {
            self.refine_formal_in(condensed, &mut orig_repr, dtv, target_idx);
        }

        let out_params = orig_repr
            .quotient_graph
            .get_node_mapping()
            .iter()
            .map(|(dtv, _idx)| dtv.clone())
            .filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_out_parameter());

        for dtv in out_params.collect::<Vec<DerivedTypeVar>>() {
            self.refine_formal_out(condensed, &mut orig_repr, dtv, target_idx);
        }

So the code above has a bug. When we refine the formal in parameters right now that will copy in information from the parents including nodes labeled by the parents dtvs. Then the parent dtvs (which arent callsite tagged are in the namespace for this sketch) and the out params from the parent will be included in the childs out params which is not correct.

This could be solved by collecting all dtvs before beginning refinement, but this is solving a symptom and not the cause. We shouldnt copy dtvs from the refinement at all. The labels should come from the type being refined

Potential unsound inference of store capabilities due to directionality

In 517dd34 I added directionality to stores.

The idea here was that if we observe a store of some type variable t to a pointer p and p points to the abstract object a then p.store ⊑ a because we could have stored to p and therefore must be compatible. This kinda falls over because really we would want a must analysis here to guarantee that along some path we point to a, otherwise p.store may not have to be a subtype. Ignoring that issue for the moment: directionality made the following claim hold: If pointer z reaches p then we should also conclude z.store ⊑ a

This reasoning is fairly reasonable but has the undesirable consequence of also resulting in an overly strong consequence of p ⊑ z. Given contra-variance on store capabilities we can then derive z.store ⊑ p.store. This says that any pointer defining a stored to pointer must always store a subtype. That's undesirable because imagine some sort of situation like:

void* i;
void* z;
void* p = z;
*p=open(); 
 
z = i;
*z = 1;
}

in this case we get z.store ⊑ fd when z both stores int and fd so in an ideal world our type here is int ⊑ z.store. I need to think more about this. Im assuming the reason this was initially done was to handle how locs are generated (ie. through address computations). Something like:

u=INT_ADD rsp, 8
STORE x, u

and without flipping the direction we get

rsp_defsite.+8 ⊑ unique
unique  ⊑ t
t.store ⊑ stack.at_8_8

with that we get to conclude:

t.store ⊑ unique.store
unique.store ⊑ rsp_defsite.+8.store

but that seems fine to me

Use more efficient datastructures for reaching definitions

Lots of shared structure, i think Patricia sets would be good here. Might be part of the perf issue

Handle Compound Variables Directly During Constraint Generation

Currently, compound outparameters are handled by creating a fake structure with a field for each storage location during lowering. This obviously isnt ideal. We should output compound variables as such in the CWE checker IR then in constraint generation produce constraints that reflect the compound out or in parameter. Changing these as is would break CWE checker abstract interpretation in a lot of ways. We can perhaps work around this by not touching their argument definition that is primarily used for externals and instead emit our own argument definition specifically for use in type constraint generation

Deduplicate type variable strings

There's probably a lot of string duplication due to holding owned strings in typevariables, type vars should probably hold Rcs or something for the string.

Handle Constant Adds Better

Constant addition prior to a load ie. a.load.+x can be viewed as an access to a field. Currently we handle this by pushing adds through loads in the sketch graph, but this doesnt allow this property to be used between uninteresting variables in proofs.

ideally we'd have some sort of proof rule in the FSA:

x \in Z /\ Var a.load.+x /\  Var b /\  a.load.+x <= b
-------------------------------------------------------
a.load <= b.field_at.+x

Add CI to run clippy and unit tests

Merging should require tests to pass and no clippy warnings.

Handle unbounded points to information more soundly

The question here is how to handle writes to an abstract location that are unbounded. Currently, for an abstract object named RSI we do RSI <= thing.load.@0sigma_size. This approach doesnt work well if some writes are bounded. then we will get RSI@0sigma_size <= thing.load.@0sigma_size. If the program is type safe wrt to our model these two types must be equal. But that's going to result in a reflexive edge with a field to the abstract object.

Ideas of how to handle this:

If we have an unbounded write/read to an abstract object, every reference to that object should be unified.

Document Public API

#![warn(missing_docs)] was added to the library aspirationally. To do #2 we need to fix these clippy warnings.

Handle variadic functions

Currently, we dont bind up information to the in parameters of a function that are variadic because we look at the functions formals to bind up the actuals. We should query ghidra or use our own analysis to figure out how to bind these up

Use zig cc or some other tool to cross compile souffle for n targets

Implement Function Cloning for Polymorphism

In mooosl several pointers are getting misunified, ie. byte buffers are unified with the pointers to nodes. This is because both of these pointers reach free param 1 and are returned from calloc. These pointer types are then being unified, causing mistyping of byte buffers. Solving this requires callsite tagging of formal parameters in order to support polymorphic types of functions such as free and calloc.

Long processing time of /bin/ls

Here are the collected files, including the ls binary:

$ unzip -d /tmp/test-ls-bti ls-bti.zip
Archive:  ls-bti.zip
   creating: /tmp/test-ls-bti/1655908655320-0/
  inflating: /tmp/test-ls-bti/1655908655320-0/interesting_tids.pb
  inflating: /tmp/test-ls-bti/1655908655320-0/ir.json
  inflating: /tmp/test-ls-bti/1655908655320-0/lattice.json
  inflating: /tmp/test-ls-bti/1655908655320-0/additional_constraints.pb
  inflating: /tmp/test-ls-bti/ls

ls-bti.zip

Zexts/Sexts should be included in type constraints.

Currently we ignore sign extend and zero extended copies and not emitting type constraints, we should infer from this behavior that the target is integral and match up the portion extended in the constraint. This issue is backlogged until the solver is implemented.

Allow sharing type structure within global variables.

Currently global type inference works in the following way:

After binding polymorphic types we visit all sketchgraphs looking for global variables referenced in sccs.

For each variable we collect the intersection of the global variable subsketch each time it appears. This has the effect of concretizing the global type into one unified sketch. These unified sketches are then placed back into the scc sketches so that parameter aliasing can use this updated information.

Later we create a graph for each global in the global type graph (bad name... it's the type graph of all sccs). Each node that used to point to the global gets pointed to this new subgraph. This strategy doesnt allow for sharing structure between globals that share a structure type. To do this we need to alias uses of a global in a parent global to the child globals representation.

In Mooosl Test Missing Constraint for Copy RDI to Stack at 0x40

At insn 00101734 there is a MOV [RBP-56], RDI which corresponds to RSP-64 (with entry 0). No constraint is generated so we fail to bind this var to constraints for the in param

Don't lose observed parameter information

If during constraint generation we get something like

sub_x.in <= tau and we lose the plot for that variable (it isnt related to any other variables, we will get no constraints on that variable and we wont observe that that constraint exists when building sketches). This means parameter types will be missing. We could transparently add sub_x.in < top or bot < sub_x.in to force the var constraints to be observed in the sketch and get .in as a node with [bottom, top] as the type. There may be some subtlety here with contravariant types.

Im not sure i love adding these fake constraints on behalf of the user. On the other hand maybe the user who is doing constraint generation can make a decision about how they are going to track recovered param types

Migrate to Slog for Structured Logging rather than Custom FileDebugLogger

https://github.com/slog-rs/slog gives the hierarchy we need in the logs without having to pass around the logging object everywhere. Also saves us some code.

Do intraprocedural analysis on single function graphs to generate constraints

After retrieving points-to info we should be able to generate single intraproceduralish graphs for each scc instead of the ugly node filtering we do now

Use path from representing dtv in sketches as identifier for nodes.

The most reliable way to find a node after graph manipulations is by following the path of edges in the original sketch. Since sketches are prefix closed if the node still exists it will always still be at that path. Currently, this is leveraged in a bunch of places but the mapping graph still identifies nodes by indeces etc, maybe we can share some infrastructure related to node idenitification utilizing graph paths.

trailofbits / binary_type_inference Goto Github PK

binary_type_inference's People

Contributors

Stargazers

Watchers

binary_type_inference's Issues

Recommend Projects

Recommend Topics

Recommend Org