trailofbits / binary_type_inference Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
Maintaining tests that use fixed type identifiers and expect concrete constraints has become impossible for larger test cases (ie. mooosl). These tests are currently disabled, we should check properties that we expect of these constraints and sketches
Inferring global types requires to major features. The VSA does not track pointers into the global region. Even if we had pointers into the global region we would still need some sort of notion of a-locs (potentially, unless maybe we wanted to treat offsets universally). One solution would be to just consume ghidra's globals.
Additionally, we would have to actually collect constraints about globals and solve at some point. We could maybe do this during function aggregation and just add globals as additional interesting variables.
Currently the get_intermediate_representation_for_reader
utility function does not support loading a bare metal binary. This capability will be required for embedded device analysis.
After binding a parameter in sub_00109050
in /bin/ls the next parameter bind fails because of an assertion that type sketches should not create reflexive edges. This probably indicates a bug in binding the previous parameter to the one we fail on.
x <= a
x.store <= y
y <= b
The retypd github impl prints the empty set when your interesting variables are [a,b]
But consider that an elementary proof exists for the subtyping constraint a.store <= b
x <= a and x.store exists based on the proof rule InheritL subtypes should inherit fields to the left of a subtyping constraint: a.store
x <= a and a.store exists .store is contravariant so by the contravariant field rule a.store <= x.store
a.store <= x.store and x.store <= y so by transitivity a.store <= y
a.store <= y and y<= b so a.store <= b QED
The proof graph demonstrates that the proof connections exist, however, the initial variable is contravariant. We only explore covariant paths (ie. initial cov, end cov). Observationally the path a.store <= b and b<= a.store both exist but only the first path is a valid proof.
The first thing to check is wether the above is provable in normalized form. If it is, what is the graph representation of the normal form proof and why is it not represented. The issue appears to be with proofs that have an initial contravariant field rule.
Currently we infer nothing but in one of our tests, the only way to infer that variable x is an integer is to observe that it is tested for equality against an integer... should we unify across compares?
Currently the datalog selects a singular type for a node by ruling out other types. ie. The pointer constraints consist of a store and load suggesting it is a pointer then making sure there arent field rules, in rules, out rules, or add rules. Instead, we should collect every possible type for a node. If there is a single type we get a single type, otherwise we introduce a union.
The current terminology is confusing. The GlobalGraph represents all types while the graph of globals only represents the types of globals. The supergraph should represent all types to avoid confusion
This would allow for more precise inference and attack trailofbits/BTIGhidra#19 by including the structure of the callee into the caller FSA inference.
a <= x.store
x.load <= c
Pretty obviously we should get a <= c here.
Formally S-Pointer gives us x.store <= x.load
Trans of (a <= x.store , x.store <= x.load) gives us a <= x.load
Trans of (a <= x.load, x.load <= c) gives us a <= c
This proof breaks down in our proof graph. This is related to the variance of stacks. We have to be careful to not break #6 while fixing this portion of the proof graph.
The trick is a
should have an edge to x cov store which would then get everything to work out. But this edge was removed in #6
When accessing registers and possibly points to information for formal parameters the subtyping relationship goes in the wrong direction.
Ie. In assignment x=y
Where y is defined by a and b we get
a <= t
b <= t
t <= x
This works to get a passed up to x since both must be a subtype. (ie, a <= t <= x) We are using this same method for formals but it doesn’t work:
We have sub.in_0 <= t ie. Stating that the formal must be a subtype of its usages so say the defines are a and b we want to get sub.in_0 <= a and sub.in_0 <= b
So we want sub.in_0 <= t t <= a and t <= b but we use the same reg access method. So we get (sub.in_0 <= t and a <= t b <= t) the subtyping relationship causes a and b to not constraint in_0 at all.
This can probably happen for the memaccess stuff as well. Ie. imagine a parameter on the stack we get
stack_obj <= x.load.stack_loc
sub.in_0 <= x.load.stack_loc. Once again information on stack_obj does not constraint sub.in_0. Somewhat fortunately x.load.stack_loc will still constrain it in this case.
We should write testcases for both of these. Is it the case that what should actually happen is x.load.stack_loc <= stack_obj?
Property tests are fairly easy to construct for testings properties we want to hold for polymorphic parameter bindings. The idea is to generate sketch graphs that use other sketches. Then we check if words from the language of the formal in is in the language of the actual in. Then for outs we check if all the words in the actual out are in the formal out.
extern int additional_compute(int val);
void compute(int* ptr) {
*ptr = additional_compute(*ptr);
}
Ghidra's definition for the stub additional_compute has no arguments. This means the type inference constraint generation does not generate any actual -> formal argument constraints for the call to additional_compute. The decompiler does guess correctly that a single argument is passed to additional compute in compute, so there should be some way to extract these constraints. The high function, however, does not contain info about where the argument is located (top of the stack in the case of x86) which would be required. Back logging this because this only causes missed info if we have a type signature for the extern. If we have a type signature then the parameter will be available in ghidra.
So currently we use the decompiler guesses of call parameters and returns for actual calls and then guessed params and returns for actual rets. We should analyze the locators ourself with more clever analyses. For actual calls, the rets would be reaching definitions from the callee that are used.
Currently everything is solved in one pass. In the retypd paper they first create a transducer over the type variables of functions to create function summary signatures then use those to create an interprocedural analysis effeciently.
Currently after sketches are labeled there is no knowledge of which nodes in a sketch correspond to nodes in the original constraint graph. Ideally, we would maintain some way to identify shared subgraphs between sketches. The idea being we should build some dependency graph of lowering shared subgraphs where a shared sketch subgraph is represented by a singular C type.
These constraints represent a minimized set of constraints needed to prove (approximately) that sub_001014fb.in_0.load <= sub_001014fb.out
The constraints are organized so that the transitive property just flows
Relating t320 to the out parameter:
τ320.load.σ64@0 ⊑ instr_00101544_0_$Uc000
instr_00101544_0_$Uc000 ⊑ instr_00101547_1_$Uc000
instr_00101547_1_$Uc000 ⊑ instr_00101434_0_RBP.store.σ64@-16
instr_00101434_0_RBP.store.σ64@-16 ⊑ sub_001014fb@RSP.σ64@-24
sub_001014fb@RSP.σ64@-24 ⊑ instr_00101434_0_RBP.load.σ64@-16
instr_00101434_0_RBP.load.σ64@-16 ⊑ instr_00101587_1_$Uc000
instr_00101587_1_$Uc000 ⊑ instr_00101587_2_RAX
instr_00101587_2_RAX ⊑ sub_001014fb.out
sub_001014fb.in_0 ⊑ instr_001016bd_0_RDI
sub_001014fb.in_0 ⊑ instr_0010176e_0_RDI
instr_001016bd_0_RDI ⊑ τ313
instr_0010176e_0_RDI ⊑ τ313
τ313 ⊑ instr_00101500_0_RBP.store.σ64@-24
instr_00101500_0_RBP.store.σ64@-24 ⊑ sub_001014fb@RSP.σ64@-32
sub_001014fb@RSP.σ64@-32 ⊑ instr_00101434_0_RBP.load.σ64@-24
instr_00101434_0_RBP.load.σ64@-24 ⊑ instr_0010153d_1_$Uc000
Add ties it together
Add(instr_0010153d_1_$Uc000,instr_00101535_1_RDX,τ320)
The point is we arent currently handling add constraints but the add constraint is the only thing currently tying together the in parameter to the out parameter. We need to apply these inference rules when inspecting the sketch graph:
It's tricky because this has to happen before simplification so we simplify out the relation, hopefully sketch building on unsimplified cons just works tm :)
In dca9410 I removed intersecting/unioning callsite types with the original representation. This decision was made because the sketch union of out parameters would result in a loss of information.
Ie. a type like:
sub_id.in <= sub_id.out
sub_id:callsite_0.out.field_at_0 <= int
sub_id:callsite_1.out.field_at_4 <= float
Would discard the structure because the identity function has no language intersection with the original type. We should more carefully examine the implications of this decision.
Need to look into the complexity theory of this but I think it should be possible to generate a push down system where the stack configuration of that weighted PDS is recognized by some regular language L. ie given an input language L generate a PDS with stack configurations that are = to L. If we can do that then we can generate the set of constraints represented by the PDS and compute the FSA. Then we can check if the FSA recognizes things outside the language L or that it recognizes things inside the language L.
If we can't get equality on the stack configurations we could also get weaker checking by constructing a PDS that is at least guaranteed to be recognized, or guaranteed to contain all words in L.
let in_params = orig_repr
.quotient_graph
.get_node_mapping()
.iter()
.map(|(dtv, _idx)| dtv.clone())
.filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_in_parameter());
for dtv in in_params.collect::<Vec<DerivedTypeVar>>() {
self.refine_formal_in(condensed, &mut orig_repr, dtv, target_idx);
}
let out_params = orig_repr
.quotient_graph
.get_node_mapping()
.iter()
.map(|(dtv, _idx)| dtv.clone())
.filter(|dtv| dtv.get_base_variable().get_cs_tag().is_none() && dtv.is_out_parameter());
for dtv in out_params.collect::<Vec<DerivedTypeVar>>() {
self.refine_formal_out(condensed, &mut orig_repr, dtv, target_idx);
}
So the code above has a bug. When we refine the formal in parameters right now that will copy in information from the parents including nodes labeled by the parents dtvs. Then the parent dtvs (which arent callsite tagged are in the namespace for this sketch) and the out params from the parent will be included in the childs out params which is not correct.
This could be solved by collecting all dtvs before beginning refinement, but this is solving a symptom and not the cause. We shouldnt copy dtvs from the refinement at all. The labels should come from the type being refined
In 517dd34 I added directionality to stores.
The idea here was that if we observe a store of some type variable t
to a pointer p
and p points to the abstract object a
then p.store ⊑ a
because we could have stored to p and therefore must be compatible. This kinda falls over because really we would want a must analysis here to guarantee that along some path we point to a, otherwise p.store may not have to be a subtype. Ignoring that issue for the moment: directionality made the following claim hold: If pointer z
reaches p
then we should also conclude z.store ⊑ a
This reasoning is fairly reasonable but has the undesirable consequence of also resulting in an overly strong consequence of p ⊑ z
. Given contra-variance on store capabilities we can then derive z.store ⊑ p.store
. This says that any pointer defining a stored to pointer must always store a subtype. That's undesirable because imagine some sort of situation like:
void* i;
void* z;
void* p = z;
*p=open();
z = i;
*z = 1;
}
in this case we get z.store ⊑ fd
when z both stores int
and fd
so in an ideal world our type here is int ⊑ z.store
. I need to think more about this. Im assuming the reason this was initially done was to handle how locs are generated (ie. through address computations). Something like:
u=INT_ADD rsp, 8
STORE x, u
and without flipping the direction we get
rsp_defsite.+8 ⊑ unique
unique ⊑ t
t.store ⊑ stack.at_8_8
with that we get to conclude:
t.store ⊑ unique.store
unique.store ⊑ rsp_defsite.+8.store
but that seems fine to me
Lots of shared structure, i think Patricia sets would be good here. Might be part of the perf issue
Currently, compound outparameters are handled by creating a fake structure with a field for each storage location during lowering. This obviously isnt ideal. We should output compound variables as such in the CWE checker IR then in constraint generation produce constraints that reflect the compound out or in parameter. Changing these as is would break CWE checker abstract interpretation in a lot of ways. We can perhaps work around this by not touching their argument definition that is primarily used for externals and instead emit our own argument definition specifically for use in type constraint generation
There's probably a lot of string duplication due to holding owned strings in typevariables, type vars should probably hold Rcs or something for the string.
Constant addition prior to a load ie. a.load.+x can be viewed as an access to a field. Currently we handle this by pushing adds through loads in the sketch graph, but this doesnt allow this property to be used between uninteresting variables in proofs.
ideally we'd have some sort of proof rule in the FSA:
x \in Z /\ Var a.load.+x /\ Var b /\ a.load.+x <= b
-------------------------------------------------------
a.load <= b.field_at.+x
Merging should require tests to pass and no clippy warnings.
The question here is how to handle writes to an abstract location that are unbounded. Currently, for an abstract object named RSI we do RSI <= thing.load.@0sigma_size
. This approach doesnt work well if some writes are bounded. then we will get RSI@0sigma_size <= thing.load.@0sigma_size
. If the program is type safe wrt to our model these two types must be equal. But that's going to result in a reflexive edge with a field to the abstract object.
Ideas of how to handle this:
#![warn(missing_docs)]
was added to the library aspirationally. To do #2 we need to fix these clippy warnings.
Currently, we dont bind up information to the in parameters of a function that are variadic because we look at the functions formals to bind up the actuals. We should query ghidra or use our own analysis to figure out how to bind these up
In mooosl several pointers are getting misunified, ie. byte buffers are unified with the pointers to nodes. This is because both of these pointers reach free param 1 and are returned from calloc. These pointer types are then being unified, causing mistyping of byte buffers. Solving this requires callsite tagging of formal parameters in order to support polymorphic types of functions such as free and calloc.
Here are the collected files, including the ls
binary:
$ unzip -d /tmp/test-ls-bti ls-bti.zip
Archive: ls-bti.zip
creating: /tmp/test-ls-bti/1655908655320-0/
inflating: /tmp/test-ls-bti/1655908655320-0/interesting_tids.pb
inflating: /tmp/test-ls-bti/1655908655320-0/ir.json
inflating: /tmp/test-ls-bti/1655908655320-0/lattice.json
inflating: /tmp/test-ls-bti/1655908655320-0/additional_constraints.pb
inflating: /tmp/test-ls-bti/ls
Currently we ignore sign extend and zero extended copies and not emitting type constraints, we should infer from this behavior that the target is integral and match up the portion extended in the constraint. This issue is backlogged until the solver is implemented.
Currently global type inference works in the following way:
After binding polymorphic types we visit all sketchgraphs looking for global variables referenced in sccs.
For each variable we collect the intersection of the global variable subsketch each time it appears. This has the effect of concretizing the global type into one unified sketch. These unified sketches are then placed back into the scc sketches so that parameter aliasing can use this updated information.
Later we create a graph for each global in the global type graph (bad name... it's the type graph of all sccs). Each node that used to point to the global gets pointed to this new subgraph. This strategy doesnt allow for sharing structure between globals that share a structure type. To do this we need to alias uses of a global in a parent global to the child globals representation.
At insn 00101734 there is a MOV [RBP-56], RDI which corresponds to RSP-64 (with entry 0). No constraint is generated so we fail to bind this var to constraints for the in param
If during constraint generation we get something like
sub_x.in <= tau and we lose the plot for that variable (it isnt related to any other variables, we will get no constraints on that variable and we wont observe that that constraint exists when building sketches). This means parameter types will be missing. We could transparently add sub_x.in < top or bot < sub_x.in to force the var constraints to be observed in the sketch and get .in as a node with [bottom, top] as the type. There may be some subtlety here with contravariant types.
Im not sure i love adding these fake constraints on behalf of the user. On the other hand maybe the user who is doing constraint generation can make a decision about how they are going to track recovered param types
https://github.com/slog-rs/slog gives the hierarchy we need in the logs without having to pass around the logging object everywhere. Also saves us some code.
After retrieving points-to info we should be able to generate single intraproceduralish graphs for each scc instead of the ugly node filtering we do now
The most reliable way to find a node after graph manipulations is by following the path of edges in the original sketch. Since sketches are prefix closed if the node still exists it will always still be at that path. Currently, this is leveraged in a bunch of places but the mapping graph still identifies nodes by indeces etc, maybe we can share some infrastructure related to node idenitification utilizing graph paths.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.