yegord / snowman Goto Github PK
View Code? Open in Web Editor NEWThis project forked from smartdec/smartdec
Snowman decompiler
This project forked from smartdec/smartdec
Snowman decompiler
The core::image class should be able to handle extra debug info like linenumbers in original source code if supported by the input format.
The following program (http://people.mozilla.org/~jmuizelaar/snowman/array.o):
push ebp
mov ebp, esp
mov eax, [ebp+0xc]
mov ecx, [ebp+0x8]
movsx eax, word [ecx+eax*2]
pop ebp
decompiles to
int32_t _f(void* a1, int32_t a2) {
return (int32_t)*(int16_t*)((int32_t)a1 + a2 * 2);
}
It seems like we're very close to being able to turn this into:
int32_t _f(int16_t* a1, int32_t a2) {
return a1[a2];
}
Snowman gets pretty confused with the following dll:
http://people.mozilla.org/~jmuizelaar/snowman/switch.dll
__declspec(dllexport)
const char *get(int k)
{
switch (k+1) {
case 0:
return "zero";
case 1:
return "one";
case 2:
return "two";
case 3:
return "three";
default:
return "other";
}
}
const char *get2(int k)
{
switch (k) {
case 0:
return "zero";
case 1:
return "one";
case 2:
return "two";
case 3:
return "three";
default:
return "other";
}
}
__declspec(dllexport)
const char *(*get3)(int k) = get2;
int DllMain(long handle, long reason, void* reserved)
{
return 1;
}
Using the relocations in .reloc we can avoid treating the addresses in the jump table as instructions for disassembly.
I was reading Medusa and the way they generate their disassembler blows me away! They use some python scripts and a yaml file to auto-generate their disassembler. Not only that, x86.yaml covers all instructions up to AVX2, and arm.yaml up to ARMv8. Their scripts generates a header and a source in C++.
What do you think about?
0 libsystem_malloc.dylib 0x00007fff81846ca0 tiny_malloc_from_free_list + 12
1 libsystem_malloc.dylib 0x00007fff818473c3 szone_malloc_should_clear + 320
2 libsystem_malloc.dylib 0x00007fff81849868 malloc_zone_malloc + 71
3 libsystem_malloc.dylib 0x00007fff8184a27c malloc + 42
4 libc++.1.dylib 0x00007fff898b528e operator new(unsigned long) + 30
5 snowman 0x0000000103c26df0 std::1::pair<boost::unordered::iterator_detail::iterator<boost::unordered::detail::ptr_node<nc::core::ir::Term const*> >, bool> boost::unordered::detail::table_impl<boost::unordered::detail::set<std::1::allocator<nc::core::ir::Term const*>, nc::core::ir::Term const, boost::hash<nc::core::ir::Term const>, std::1::equal_to<nc::core::ir::Term const*> > >::emplace_impl<nc::core::ir::Term const* const&>(nc::core::ir::Term const* const&, nc::core::ir::Term const* const&&&) + 208 (unique.hpp:410)
6 snowman 0x0000000103c2b85f nc::core::ir::liveness::LivenessAnalyzer::makeLive(nc::core::ir::Term const) + 79 (Liveness.h:46)
7 snowman 0x0000000103c2bb30 nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const) + 592 (LivenessAnalyzer.cpp:220)
8 snowman 0x0000000103c2bcda nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const) + 1018 (LivenessAnalyzer.cpp:237)
9 snowman 0x0000000103c2bbcb nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const) + 747 (iterator:1171)
10 snowman 0x0000000103c2bb30 nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 592 (LivenessAnalyzer.cpp:220)
11 snowman 0x0000000103c2bcda nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 1018 (LivenessAnalyzer.cpp:237)
12 snowman 0x0000000103c2bbcb nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 747 (iterator:1171)
13 snowman 0x0000000103c2bb30 nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 592 (LivenessAnalyzer.cpp:220)
14 snowman 0x0000000103c2bcda nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 1018 (LivenessAnalyzer.cpp:237)
15 snowman 0x0000000103c2bbcb nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 747 (iterator:1171)
16 snowman 0x0000000103c2bb30 nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 592 (LivenessAnalyzer.cpp:220)
17 snowman 0x0000000103c2bcda nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 1018 (LivenessAnalyzer.cpp:237)
18 snowman 0x0000000103c2bbcb nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 747 (iterator:1171)
19 snowman 0x0000000103c2bb30 nc::core::ir::liveness::LivenessAnalyzer::propagateLiveness(nc::core::ir::Term const_) + 592 (LivenessAnalyzer.cpp:220)
I'll send the binary via-email
This can presumably be detected by looking for reads from 'ecx' without initializing it to anything and 'ret n' at the end.
This would be nice to have.
@yegord would you mind if we integrate the CMake-scripts for finding external dependencies like libbfd or libELF?
The ELFparser we have today is good but since different platforms (notably HPPA, Sparc and MIPS) have different extension for their ELF ABI wouldn't it be good to let an external library handle this if found?
It would be nice to reconstruct swich-case-statements for readability.
This simple strlen implementation for Allegrex:
89014c4: move $v1, $a0
89014c8: lb $v0, 0($v1)
89014cc: bnez $v0, 0x089014C8
89014d0: addiu $v1, $v1, 1
89014d4: nor $v0, $zr, $a0
89014d8: jr $ra
89014dc: addu $v0, $v1, $v0
where only registers a0, v1 and v0 are used, gave me that strange c++ output:
void** strlen(unsigned char* a0, void** a1, void** a2, void** a3, void** t0, void** t1, void* t2) {
unsigned char* v1_8;
v1_8 = a0;
while (*v1_8) {
++v1_8;
}
return (uint32_t)(v1_8 + 1) + ~(uint32_t)a0;
}
Note: Allegrex can have up to 8 argument as registers (a0-a3, t0-t3).
Application Specific Information: Assertion failed: ((jump->thenTarget().basicBlock() == thenBB && jump->elseTarget().basicBlock() == elseBB) || (jump->thenTarget().basicBlock() == elseBB && jump->elseTarget().basicBlock() == thenBB)), function makeExpression, file src/nc/core/ir/cgen/DefinitionGenerator.cpp, line 515.
I can send the binary required to reproduce this by email if desired.
It seems relatively common for compilers to take advantage of the implicit zero extension.
Currently snowman will split:
xorl %edx, %edx
into:
*(int32_t*)&rdx4 = 0;
*((int32_t*)&rdx4 + 1) = 0;
An example binary can be found here: http://people.mozilla.org/~jmuizelaar/snowman/f.dylib
The source is the same as issue #30
ARM decompilation currently seems to suffer quite a bit from confusing code and data and this should help there.
There are lots of options for a better technique
They are ugly and not needed.
We spend a very long time doing dataflow analysis. This seems to be because addDefinition/killDefinition don't scale very well.
The following:
case MIPS_INS_MULTU: {
auto operand0 = operand(0);
auto operand1 = operand(1);
_[
regizter(MipsRegisters::hilo()) ^= (zero_extend(std::move(operand0), 64) * zero_extend(std::move(operand1), 64))
];
break;
}
will result in extra typecasts since snowman believes / assumes that 'hilo' is signed (int64) and not as it should be in this case uint_64.
Hi,
Recently I implemented snowman in x64dbg and we noticed a weird thing. This code:
00007FF66E0D1444 | 48 83 EC 38 | sub rsp,38 |
00007FF66E0D1448 | 48 83 64 24 20 00 | and qword ptr ss:[rsp+20],0 |
00007FF66E0D144E | 41 B9 01 00 00 00 | mov r9d,1 |
00007FF66E0D1454 | 4C 8D 44 24 40 | lea r8,qword ptr ss:[rsp+40] |
00007FF66E0D1459 | 41 8D 51 10 | lea edx,dword ptr ds:[r9+10] |
00007FF66E0D145D | 48 C7 C1 FE FF FF FF | mov rcx,FFFFFFFFFFFFFFFE |
00007FF66E0D1464 | E8 AB FA FC FF | call 7FF66E0A0F14 |
00007FF66E0D1469 | 85 C0 | test eax,eax |
00007FF66E0D146B | 78 0A | js plzplz.7FF66E0D1477 |
00007FF66E0D146D | 80 7C 24 40 00 | cmp byte ptr ss:[rsp+40],0 |
00007FF66E0D1472 | 75 03 | jnz plzplz.7FF66E0D1477 |
00007FF66E0D1474 | CC | int3 |
00007FF66E0D1475 | EB 00 | jmp plzplz.7FF66E0D1477 |
00007FF66E0D1477 | 48 83 C4 38 | add rsp,38 |
00007FF66E0D147B | C3 | ret |
Shows as:
void fun_140001444() {
int32_t eax1;
signed char v2;
eax1 = fun_13ffd0f14();
if (!"intrinsic"() && v2 == 0) {
}
return;
}
The tree appears to go in infinity recursion:
Here is a binary (modified to only have this function, no malware): https://mega.co.nz/#!TgZz2LJa!RhX6Cc-SUiw8-IQufGamfLncm7PorI5odyCQf8mkk7Y
I may plan to add a new architecture instead of using the current MIPS architecture which is a work in progress. because ALLEGREX is not recognized by capstone framework and the fact the latter is using LLVM makes the implementation of ALLEGREX too complex. There are subtle differences which make the use of MIPS architecture not viable for decompiling ALLEGREX code.
So, I am about to provide a specific disassembler (the one provided by pspdecompiler, based on prxtools one, but with addition of a decomposer) for the architecture analyser. It means handling of all instructions including VFPU.
I am pretty sure people from uOFW project may be interested as well. But I am also pretty sure that it will be hard to decompile kernel modules because they may use some tricks which are not ABI compliant, so I am expecting for more tasks to do than making a simple disassembler/analyzer.
As for the author in http://lists.derevenets.com/pipermail/snowman/2015-August/000002.html, it may be great that he/she contributes as well here (PRX handling).
How does snowman handle the stack pointer? It seems like instructions/registers trying to access the stack pointer register will fail on MIPS.
As now they get defined as int32_t for me, however I guess there is a way to create explicit statements for getting the correct type? Right?
Just decompiled a hello world binary stated in the examples page. Got this:
// snowman doesn't resolve the symbol
int64_t g100001010 = 0x100000fa0;
void fun_100000f88(int64_t rdi) {
goto g100001010;
}
int64_t _main() {
fun_100000f88("Hello, World!");
return 0;
}
int64_t g100001000 = 0;
void fun_100000fa0() {
goto g100001000;
}
Shouldn't it be the following?:
int _main() {
puts("Hello, World!");
return 0x0;
}
Btw, disassembly doesn't produce symbol info in asm code, too.
Any suggestion?
Platform:
OSX 10.10
On MIPS the $gp register is saved over over calls between functions.
How should this been taken care of? As for now it will become marked as a local variable and then the liveliness analyzer will kill it off.
"For the N32 and N64 ABIs, a function must preserve the $S0-$s7 registers, the global pointer ($gp or $28), the stack pointer ($sp or $29) and the frame pointer ($30). The O32 ABI is the same except the calling function is required to save the $gp register instead of the called function."
One should either add some algorithm choosing between AMD64 and Microsoft64 calling convention, or give the user the ability to choose a convention manually and redecompile everything.
On MIPS for example there are not only conditional jumps but also conditional calls.
Any hint on how to implement this would be nice.
Also I don't see why the else in a conditional jump cannot be a nullptr. On MIPS you've got delay branches which becomes the directSuccessor and is run before the condition is evaluated: so actually I want to jump to the directSuccessor()'s directSucceror(). Any hints?
long q(long a, long b)
{
return a / b;
}
Sometimes I see pointers by the generated code looking like '(void **)0', kinda obvious example of a 'NULL' declaration for x86/AMD64. But there did (does?) exist where this actually is a valid pointer.
I guess if we could match NULL-pointers for every arch and replace them by 'NULL' readability would increase for the novice.
Hey Yegord.
I see you switched to Capstone engine.
So Question is.
You think it would be possible to add MIPS architecture also to the decompiler,
since Capstone already have the architecture included in the capstone engine.
https://github.com/aquynh/capstone/tree/master/arch/Mips
Or would it be much work to add MIPS support.
Regards
With some effort this could be implemented based on van Emmerik's approach in boomerang. Allowing customization of header files with function declarations in-place. It would improve the readability and the typechecking of decompiled code.
Most compilers allow to insert special instructions through intrinsic functions. This is a way to avoid having .asm files and a good way to help the compiler to be aware of which registers are involved in a intrinsic function and to clobber the necessary ones.
It would be interesting to allow snowman to issue specific intrinsic function instead of inline assembly so it can link the terms used as the arguments with previous statements and the result of the function to future statements, what inline assembly statements are unable to do.
Of course those intrinsic functions are specific to an architecture and do not obey the same ABI rules as standard functions. They must be seen as user named N-ary operators.
gcc-4.4 compiles indirect calls to something like this:
mov lr, pc
ldr pc, [r3, #0]
Snowman seems to miss this being a call because of the mov lr, pc idiom.
Here's a test case:
http://people.mozilla.org/~jmuizelaar/arm-indirect-call.o
I think this is one drawback with having an own ELF parser. I might find a fix by looking at the one in boomerang albeit.
The following program is not properly decompiled when compiled for x86-64 with clang (Apple LLVM version 6.0)
long f(int x)
{
long l = 0;
while (x)
{
l *= l;
l += x;
x--;
}
return l;
}
_f:
pushq %rbp
movq %rsp, %rbp
xorl %eax, %eax
testl %edi, %edi
je 0x37
movslq %edi, %rcx
incq %rcx
xorl %edx, %edx
nopw %cs:_f(%rax,%rax)
movq %rdx, %rax
imulq %rax, %rax
decl %edi
leaq -0x1(%rcx,%rax), %rdx
leaq -0x1(%rcx), %rcx
jne 0x20
addq %rcx, %rax
popq %rbp
retq
The same program as in issue #30 is decompiled as:
_f is used as a 0 constant
int64_t _f(int32_t edi) {
int64_t rax2;
int64_t rcx3;
int64_t rdx4;
int64_t rax5;
int64_t rax6;
*(int32_t*)&rax2 = (int32_t)_f;
*((int32_t*)&rax2 + 1) = (int32_t)_f;
if (edi != _f) {
rcx3 = edi + 1;
*(int32_t*)&rdx4 = (int32_t)_f;
*((int32_t*)&rdx4 + 1) = (int32_t)_f;
do {
rax5 = rdx4;
rax6 = rax5 * rax5;
--edi;
rdx4 = rcx3 + rax6 + -1;
--rcx3;
} while (!(int1_t)(edi == _f));
rax2 = rax6 + rcx3;
}
return rax2;
}
An expression ending up with some thing like "x = y + 64 - 64 + 4;" should be reduced to "x = y + 4;"
Hey @yegord
I am trying to build snowman for ida with Qt 4.8.7 idasdk6.6 for ida 6.6.
The last two prebuild packages http://derevenets.com/
0.6 and 0.7 no one was able to load into ida.Plugin couldent be loaded.
I have rebuild windows Qt 4.8.7 with namespace QT Release
then in snowman
Cmake configure line
cmake -G "Visual Studio 12" -D -DCMAKE_BUILD_TYPE=Release -D QT_NAMESPACE=QT -D IDA_PLUGIN_ENABLED=YES -D IDA_64_BIT_EA_T=NO -D NC_QT5=NO ../src
cmake --build .
fails in command prompt but builds in visual studio.
A note on Cmake configure.
Cmake apparently disregard the =Release flag, and try to nevertheless build as debug.
I build as Release in visual no trouble.
build is regonised in ida.
but after i switch to snowman windows, ida crash.
looks like some heap problem.
pastie of the crashdump here.
http://pastebin.com/MyGDm2WF
Standalone snowman.exe is working great, just not plugin.
Any ideas ?
EDIT!
I rebuilded Qt 4.8.4 with namespace QT, the one ida uses itself and same issue.
Is constant folding not supposed to simplify this expression v1= v0 + 0;
into v1 = v0;
? I have a lot of lines like that because with Allegrex move $v1, $v0
is indeed encoded as addiu $v1, $v0, $zero
with $zero
whatever it is chosen for $zero
to return a register assigned to constant 0 or simply a constant 0 itself.
Constant propagation is also a feature present?
Currently there is a todo file in the repository, but is there a reason not to use issues for those? (I could add the issues if you don't have time for it).
On OS X it's common to do something like:
3: call dword 0x8
8: pop eax
This puts the EIP into eax which is then used for position independent accessing of data.
Snowman seems confused by this. Here's a sample:
http://people.mozilla.org/~jmuizelaar/snowman/const.o
Hey,
The documentation states you can use boost >= 1.49, but when using 1.57 there is a compile error. related: https://bugreports.qt.io/browse/QTBUG-23947
Solved by using 1.49 (I am building with VS13 and Qt 4.8.6)
I just put the entire section of ntdll.dll x64 in snowman (through the x64dbg module), and it crashes here
Any idea what it could be?
Boost version: 1.57.0
CMake Error at D:/CMake/share/cmake-3.3/Modules/FindQt4.cmake:1326 (message):
Found unsuitable Qt version "5.4.0" from
C:/Qt/Qt5.4.0/5.4/msvc2013/bin/qmake.exe, this code requires Qt 4.x
Call Stack (most recent call first):
CMakeLists.txt:107 (find_package)
Configuring incomplete, errors occurred!
See also "D:/snowman/src/build/CMakeFiles/CMakeOutput.log".
i try to use cmake generate snowman's msvc project,but it tell me only supports Qt 4.x
We could make some kinda glue to allow the IDA plugin to use IDA as the disassembly engine which would improve the decompiled code quality.
gcc 4.6.4 yells on this:
[ 25%] Building CXX object nc/core/CMakeFiles/nc-core.dir/ir/cgen/CodeGenerator.cpp.o
In file included from /home/markus/Downloads/snowman/src/nc/core/likec/FunctionDeclaration.h:31:0,
from /home/markus/Downloads/snowman/src/nc/core/likec/FunctionDefinition.h:32,
from /home/markus/Downloads/snowman/src/nc/core/ir/cgen/CodeGenerator.cpp:42:
/home/markus/Downloads/snowman/src/nc/core/likec/ArgumentDeclaration.h: In constructor ‘nc::core::likec::ArgumentDeclaration::ArgumentDeclaration(nc::core::likec::Tree&, const QString&, const nc::core::likec::Type_)’:
/home/markus/Downloads/snowman/src/nc/core/likec/ArgumentDeclaration.h:47:48: error: conversion from ‘int’ to ‘std::unique_ptrnc::core::likec::Expression’ is ambiguous
/home/markus/Downloads/snowman/src/nc/core/likec/ArgumentDeclaration.h:47:48: note: candidates are:
/usr/include/c++/4.6/bits/unique_ptr.h:136:17: note: constexpr std::unique_ptr<_Tp, _Dp>::unique_ptr(std::nullptr_t) [with _Tp = nc::core::likec::Expression, _Dp = std::default_deletenc::core::likec::Expression, std::nullptr_t = std::nullptr_t]
/usr/include/c++/4.6/bits/unique_ptr.h:120:7: note: std::unique_ptr<_Tp, _Dp>::unique_ptr(std::unique_ptr<_Tp, _Dp>::pointer) [with _Tp = nc::core::likec::Expression, _Dp = std::default_deletenc::core::likec::Expression, std::unique_ptr<_Tp, Dp>::pointer = nc::core::likec::Expression]
make[2]: *** [nc/core/CMakeFiles/nc-core.dir/ir/cgen/CodeGenerator.cpp.o] Error 1
make[1]: *** [nc/core/CMakeFiles/nc-core.dir/all] Error 2
AFAIK the x86 takes more than one register; MIPS has got a total of 4 registers which is not able to overlap all 4 at once with a pseudo-register. (2 regs for integer operations and 2 regs for float point operations.)
Decompilation of
8900470: addiu $sp, $sp, -8
8900474: sw $ra, 4($sp)
8900478: jal 0x0890C484
890047c: nop
8900480: lw $ra, 4($sp)
8900484: lui $v1, 0x0891
8900488: sw $v0, 11904($v1)
890048c: jr $ra
8900490: addiu $sp, $sp, 8
gave me
int32_t startTest(void** a0, void** a1, void** a2) {
int32_t v0_4;
v0_4 = sceKernelGetSystemTimeLow();
startSystemTime = v0_4;
return 0x8910000;
}
I would expect something like:
int32_t startTest() {
int32_t v0_4;
v0_4 = sceKernelGetSystemTimeLow();
startSystemTime = v0_4;
return v0_4;
}
As @aerosoul94 pointed out the way to implement a syscall on most architectures would not yield in a function call like '__syscall_XXX' which would be to prefer:
[16/10/15 18:45:03 ] aerosoul94: so it would be like __syscall_XXX(r3, r4);
[16/10/15 18:45:06 ] Markus: yes
[16/10/15 18:45:31 ] aerosoul94: i think your method would output r11(r3, r4);
so we need to either have a way to mark a call as syscall or being able to rename the output from call().
happens in mips and allegrex projects, specifically:
AllegrexInstructionAnalyzer:189,197
MipsInstructionAnalyzer:97,103
in hlide and nihilus projects.
now, i'm completely clueless to how to solve these, so i'm just leaving the issue here.
maybe someone else can fix it
I'm using Visual Studio 2010 with qt 4.8.6 and boost 1.58.
/Users/nietzsche/Downloads/snowman/src/nc/core/likec/BinaryOperator.cpp:84:25: error:
no member named 'abs' in namespace 'std'; did you mean simply 'abs'?
int absPrecedence = std::abs(precedence);
^~~~~~~~
abs
/usr/include/stdlib.h:129:6: note: 'abs' declared here
int abs(int) __pure2;
^
#include <cstdlib> fixed this issue.
Capstone fixes some bugs for ARM (and MIPS) in 3.0.4 and should be upgraded to.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.