Coder Social home page Coder Social logo

libdasm-1's Introduction

libdasm -- simple x86 disassembly library

2004 - 2006 (c) [email protected]

2015 (c) [email protected]

  1. Acknowledgements ===================

Thanks to skape, thief, spoonm and fine folks @nologin for bug reports, ideas and support. Special thanks to ero for creating and contributing pydasm and to skape for rbdasm.

  1. What is libdasm? ===================

libdasm is a C-library that tries to provide simple and convenient way to disassemble Intel x86 raw opcode bytes (machine code). It can parse and print out opcodes in AT&T and Intel syntax.

The opcodes are based on IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference, order number 243667, year 2004. Non-Intel instructions are not supported atm (also, non-Intel but Intel-compatible cpu extensions, like AMD 3DNow! are not supported).

libdasm should compile with all decent C-compilers (only gcc and MSVC tested).

  1. How to use libdasm? ======================

Compiling your application with libdasm is very easy. As usual, there are several ways to do it:

  • Include "libdasm.c" and compile as usual. Remember to copy "libdasm.h" and "opcode_tables.h" in the same directory as they are included by the main c-file.

  • Include "libdasm.h" and compile with "libdasm.c" (and remember to copy also "opcode_tables.h").

  • Compile libdasm as library and link against it statically or dynamically, depending on the system and your needs. Win32 DLL and Unix static/dynamic libraries can be built with the supplied CMake files.

  • Compile pydasm and use libdasm as a python module (see directory "pydasm" for more information).

  • Compile rbdasm and use libdasm as a ruby module (see directory "rbdasm" for more information).

For basic disassembling, there are are only one or two libdasm functions you will need. First and the most important function is get_instruction.

3.1. get_instruction

get_instruction analyzes data stream and fills in a structure presenting the instruction. This structure, defined as struct INSTRUCTION, can be later used for formatting the instruction to printable form or for analyzing the instruction contents. It is defined as follows:

int get_instruction(
      INSTRUCTION *inst,      // pointer to INSTRUCTION structure
      BYTE *addr,             // data buffer
      enum Mode mode          // mode: MODE_32 or MODE_16
);

First argument is a refence to INSTRUCTION structure. There is no need to initialize the structure prior to function call, get_instruction will take care of filling it.

Second argument is an address of code buffer. get_instruction will read data starting from that address and parse a single instruction. INSTRUCTION structure is filled with the components of the returned instruction. Normally you don't need to know about the contents of the structure, but if you need to, read the next chapter.

Third argument, the mode is either 32-bit (MODE_32) or 16-bit (MODE_16). This is the desired addressing mode. Note that the instruction might override the mode.

get_instruction returns the instruction length. If the returned value is zero, it indicates illegal instruction.

When get_instruction returns, you can print the instruction with get_instruction_string or do analysis of the instruction members. When ready, increment data buffer pointer to next instruction and call get_instruction again. Here is pseudocode presenting this procedure:

INSTRUCTION inst;
int len, buflen, c = 0;
BYTE *buf;

do {
    len = get_instruction(&inst, buf+c, MODE_32);
    
    // do something with the instruction
    
    c  += len;
    
} while (c < buflen);

3.2. get_instruction_string

get_instruction_string parses the instruction structure and fills in a string presenting the instruction in given format. Currently, ATT and Intel formats are supported. The function is defined as:

int get_instruction_string(
    INSTRUCTION *instr,     // pointer to INSTRUCTION structure
    enum Format format,     // format: FORMAT_ATT or FORMAT_INTEL
    DWORD offset,           // instruction absolute address
    char *string,           // string buffer
    int length              // string length
);

The offset is needed only if you need to make relational offsets look nice (jmp/call/loop etc.). If you are parsing instructions in known virtual address, use the virtual address. Otherwise, you can use zero. DWORD is defined in libdasm.h as unsigned 32-bit number (libdasm only supports IA-32 atm). string is the pointer to instruction buffer, length is the size of the buffer. Note that the text is truncated if it doesn't fit in buffer.

get_instruction_string will initialize the string and terminate it correctly for convenience. It returns zero if the operation is not successful.

That's it! Check out sample disassembler programs for examples.

3.3 Other libdasm functions

libdasm uses internally lot of useful functions that might help in instruction formatting etc. For example, get_instruction_string calls get_mnemonic_string and get_operand_string for simple instruction formatting. These functions are defined as:

int get_mnemonic_string(
    INSTRUCTION *inst,
    enum Format format,
    char *string,
    int length
);

int get_operand_string(
    INSTRUCTION *inst,
    OPERAND *op,
    enum Format format,
    DWORD offset,
    char *string,
    int length
);

Both functions initialize and terminate the string buffer and return data formatted as defined in member "format". There are also many useful helper functions defined in libdasm.h for analyzing instruction contents.

  1. INSTRUCTION structure ========================

If all you need is to fetch and print out instructions in the data buffer, you can skip this chapter. But if you need to inspect the individual components that make up an instruction, you will need this information.

All libdasm functions inspect and/or manipulate INSTRUCTION structure. It is defined as follows:

typedef struct _INSTRUCTION {
    int length;             // Instruction length
    enum Instruction type;  // Instruction type
    enum Mode mode;         // Addressing mode
    BYTE opcode;            // Actual opcode
    BYTE modrm;             // MODRM byte
    BYTE sib;               // SIB byte
    int extindex;           // Extension table index
    int fpuindex;           // FPU table index
    int dispbytes;          // Displacement bytes (0 = no displacement)
    int immbytes;           // Immediate bytes (0 = no immediate)
    int sectionbytes;       // Section prefix bytes (0 = no section prefix)
    OPERAND op1;            // First operand (if any)
    OPERAND op2;            // Second operand (if any)
    OPERAND op3;            // Additional operand (if any)
    int flags;		// Instruction flags
} INSTRUCTION, *PINSTRUCTION;

Most important members are probably "length", "opcode", and the operands. "length" is the instruction size, also returned by get_instruction. If the instruction size is zero, the instruction is illegal. "opcode" is the instruction opcode byte. Some of the most common instructions also have a meaningful "type" member. This member can have one of the following values:

INSTRUCTION_TYPE_MOV,
INSTRUCTION_TYPE_ADD,
INSTRUCTION_TYPE_SUB,
INSTRUCTION_TYPE_INC,
INSTRUCTION_TYPE_DEC,
INSTRUCTION_TYPE_DIV,
INSTRUCTION_TYPE_MUL,
INSTRUCTION_TYPE_IMUL,
INSTRUCTION_TYPE_XOR,
INSTRUCTION_TYPE_LEA,
INSTRUCTION_TYPE_XCHG,
INSTRUCTION_TYPE_CMP,
INSTRUCTION_TYPE_TEST,
INSTRUCTION_TYPE_PUSH,	// includes enter, pusha and pushf
INSTRUCTION_TYPE_AND,
INSTRUCTION_TYPE_OR,
INSTRUCTION_TYPE_POP,	// includes popa and popf
INSTRUCTION_TYPE_JMP,	// includes jmpf
INSTRUCTION_TYPE_JMPC,  // conditional jump
INSTRUCTION_TYPE_LOOP,
INSTRUCTION_TYPE_CALL,	// includes callf
INSTRUCTION_TYPE_RET,	// includes leave, retn and retf
INSTRUCTION_TYPE_INT,   // interrupt
INSTRUCTION_TYPE_FPU,   // FPU-related instruction
INSTRUCTION_TYPE_OTHER, // Other instructions :-)

The list above is not complete, check out libdasm.h for complete listing of all possible instruction types.

Individual operands can be accessed by the OPERAND structures. All instructions have 0-3 operands which are ordered in INTEL order (op1 is the first operand in INTEL syntax). struct OPERAND is defined as:

typedef struct _OPERAND {
    enum Operand type;      // Operand type (register, memory, etc)
    int reg;                // Register (if any)
    int basereg;            // Base register (if any)
    int indexreg;           // Index register (if any)
    int scale;              // Scale (if any)
    int dispbytes;          // Displacement bytes (0 = no displacement)
    int dispoffset;         // Displacement offset (0 = no diplacement)
    int immbytes;           // Immediate bytes (0 = no immediate)
    int immoffset;          // Immediate offset (0 = no immediate)
    int sectionbytes;       // Section prefix bytes (0 = no section prefix)
    WORD section;           // Section prefix value
    DWORD displacement;     // Displacement value
    DWORD immediate;        // Immediate value
    int flags;		// Operand flags
} OPERAND, *POPERAND;

Operand type is always defined in member "type". This member can have one of the following values:

    OPERAND_TYPE_NONE
    OPERAND_TYPE_MEMORY
    OPERAND_TYPE_REGISTER
    OPERAND_TYPE_IMMEDIATE

If the type is OPERAND_TYPE_NONE, operand is not present in the instruction.

If the type is OPERAND_TYPE_REGISTER, OPERAND member "reg" is present.

If the type is OPERAND_TYPE_MEMORY, some combination of the members "basereg", "indexreg", "scale", "dispbytes" and "displacement" is present. These members form the memory operand as follows:

	[ basereg + scale * indexreg + displacement ] (INTEL)
	displacement(basereg, indexreg, scale)        (ATT)

If the type is OPERAND_TYPE_IMMEDIATE, some combination of the members "immbytes", "sectionbytes", "section" and "immediate" is present. Section-specific members are used only in far type call/jmp. Member "immediate" is filled with the actual immediate value. Example: in "mov eax, 0x11" second operand "immediate" value is 0x11.

If present, register members "reg", "basereg" and "indexreg" can have one of the following values:

REGISTER_EAX
REGISTER_ECX
REGISTER_EDX
REGISTER_EBX
REGISTER_ESP
REGISTER_EBP
REGISTER_ESI
REGISTER_EDI

If registers are not present, they are defined as REGISTER_NOP. Note that the register is not necessarily general purpose register. Only way to detect this is to inspect operand flags. You can also use helper function get_register_type for determining the register type. Register type can be one of the following:

REGISTER_TYPE_GEN
REGISTER_TYPE_SEGMENT 
REGISTER_TYPE_DEBUG 
REGISTER_TYPE_CONTROL 
REGISTER_TYPE_TEST
REGISTER_TYPE_XMM
REGISTER_TYPE_MMX
REGISTER_TYPE_FPU

get_register_type returns some of the values only if the operand type is OPERAND_TYPE_REGISTER. If the operand is OPERAND_TYPE_MEMORY, the registers are always general purpose and for immediate operands, there are of course no registers involved.

  1. Miscellaneous notes ======================

5.1. General output formatting

get_instruction_string tries to follow INTEL/ATT conventions but not too strictly. There are some compromises that are made to keep the implementation simple (or because the current implementation is already too complex..).

5.2. Segment prefix formatting

Libdasm is modelled after the assumption that there is only one memory operand at maximum in the instruction. If there is segment register override, the segment register is placed in front of the memory operand, like this:

mov eax, fs:[0x30]

If there are no memory operands, the segment prefix is placed in front of the instruction:

fs mov eax, 0x30

Some string instructions are also considered containing no memory operands, like cmps. In reality, it contains two memory operands. So the following:

fs cmpsd 

is equivalent to:

cmpsd fs:[esi], es:[edi]

And btw, if you are wondering what are those weird "(bt)" and "(bnt)" prefixes in front of conditional jumps, they are branch hint prefixes ("branch taken" and "branch not taken").

5.3. Instruction correctness

There is not too much sanity checking in current code. So if you feed libdasm enough with random data or illegally constructed instructions it probably gives wrong disassembly at some point. But libdasm should always disassemble correctly "real" code.

5.4. Boundary checks

Libdasm will not check for read buffer boundaries. It means that if the opcode requires additional data to be read and that data cannot be accessed, libdasm might access violate, depending on the implementation. There is no platform-independent way of checking this condition, so you better make sure of it by youself. If the data is real machine code, there is no problem (unless of course there is a bug in libdasm) because libdasm needs to read exactly what the instruction requires and of course the full instruction is in buffer, right? But in some rare cases when disassembling random data this could cause some troubles.

5.5. Endianness

Endianess might not be identified correctly on all platforms (see libdasm.h for definition of LITTLE_ENDIAN). If you encounter endianness related problems, please report the system and possible workaround for the problem.

5.6. Inline functions

Some functions are defined as inline, this might not work for all compilers. Only gcc and msvc are tested by the author.

5.7. Other issues

There are probably MANY unknown bugs in code and in instruction tables. Some known issues are listed in file TODO.txt.

  1. Licensing ============

libdasm is public domain software. You can do whatever you like with it.

libdasm-1's People

Contributors

alexeevdv avatar djhenderson avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.