Coder Social home page Coder Social logo

c3rb3ru5d3d53c / binlex Goto Github PK

View Code? Open in Web Editor NEW
385.0 16.0 45.0 17.24 MB

A Binary Genetic Traits Lexer Framework

License: The Unlicense

CMake 0.58% Makefile 0.11% C++ 96.90% Python 1.85% C 0.54% Dockerfile 0.04%
malware malware-research malware-analysis yara genetic-algorithm machine-learning genetic-programming reverse-engineering

binlex's Introduction

  • ๐Ÿ‘‹ Hi, Iโ€™m @c3rb3ru5d3d53c
  • ๐Ÿ‘€ Iโ€™m interested in malware
  • ๐ŸŒฑ Iโ€™m currently learning about many things
  • ๐Ÿ’ž๏ธ Iโ€™m looking to collaborate on malware research
  • ๐Ÿ“ซ How to reach me @c3rb3ru5d3d53c

binlex's People

Contributors

c3rb3ru5d3d53c avatar catalinv-ncc avatar g0nzu1 avatar herrcore avatar idiom avatar jbx81-1337 avatar jershmagersh avatar jgru avatar kayleylahaie avatar knightsc avatar markel-d00rt-tr avatar mihino89 avatar mrexodia avatar oopo avatar pisco-sour avatar rpkrawczyk avatar sophia-brandt avatar victoriagray avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

binlex's Issues

IDA Plugin

Write IDA Plugin Similar to Cutter Plugin

Memory leak in ClearTrait()

The function ClearTrait overwrites trait->bytes_sha256 with NULL. The memory which may have been allocated is not freed resulting in a memory leak.
The same is true for trait->trait.

Windows: CMake Parameter Issue

This parameter "CMAKE_CXX_FLAGS_RELEASE:STRING=" is inside C/C++ compiling options and breaks IntelliSense of VisualStudio, removing it manually fix the problem. trying to understand how to remove it.

Granularity of the output

Hello, nice project!. I'm wondering whether it wouldn't be useful to make the output more granular - additionally generate traits/bytes for individual blocks/instructions/opcodes/operands. I believe that it would be useful for subsequent processing in some cases, e.g. the first team in the Microsoft Malware Classification Challenge used frequency of opcodes and their N-grams among other things.

Timeout

Add option for execution timeout for advanced users

Tagging

Add a CLI parameter to specify a tag string.

Will allow to find similar samples not just across corpus will also allow for tracking threat-actor code reuse

Refactoring

Refactoring of code to make C++ API more accessible and readable

Trait Assembly

Add assembly output to traits for debugging and additional information (optional cli switch)

Inconsistent Wildcarding

Solve with this solution:

string Common::WildcardTrait(string trait, string bytes){
    int count = bytes.length();
    for(int i = 0; i < count - 2; i = i + 3){
        bytes.erase(bytes.length() - 3);
        size_t index = trait.find(bytes, 0);
        if (index != string::npos){
            for (int j = index; j < trait.length(); j = j + 3){
                trait.replace(j, 2, "??");
            }
            break;
        }
    }
    return TrimRight(trait);
}
string Decompiler::WildcardInsn(cs_insn *insn){
    string bytes = HexdumpBE(insn->bytes, insn->size);
    string trait = bytes;
    for (int j = 0; j < insn->detail->x86.op_count; j++){
        cs_x86_op operand = insn->detail->x86.operands[j];
        switch(operand.type){
            case X86_OP_MEM:
                {
                    if (operand.mem.disp != 0){
                        trait = WildcardTrait(bytes, HexdumpBE(&operand.mem.disp, sizeof(uint64_t)));
                    }
                    break;
                }
            default:
                break;
        }
    }
    return TrimRight(trait);
}

File Parsing Methods

ReadStream(char* bytes, int bytesize, ...)
and then ReadFile, just open the file and use ReadStream
in this way
we can make 2 low level C api, that can be called in python
one: passing a path
two: passing bytes

Problems with function recognition

Hello! I wanted to process an OpenSSL library and noticed that the latest version of binlex recognized only a negligible number of functions - 7 meanwhile IDA recognized 1636. I used command binlex -m pe:x86_64 -i <lib_name> | jq -r 'select(.type == ("function"))', am I doing something wrong or is there a bug please?

CIL/.NET Binary Support

This is already partially implemented in the branch pe_cil.

Work with this branch to add the necessary functionality.

Problems with instructions

The latest version of binlex v1.1.1 outputs some seemingly incorrect instruction traits. Tested on an OpenSSL library (SHA1:ef406228f7694359c5f87e2ee7b4f760dcf160f6). Command binlex -m pe:x86_64 --instructions -i <lib_name> | jq -r 'select(.type == ("instruction")) | .trait' returns a number of weird traits such as 00 00, 00 ff, ??

File size should be long not int

In

int Common::GetFileSize(FILE *fd){
    int start = ftell(fd);
    fseek(fd, 0, SEEK_END);
    int size = ftell(fd);
    fseek(fd, start, SEEK_SET);
    return size;
}
the file size is returned as an `int`. Depending on the architecture of the machine used this may overflow, `long` is the return value of ftell().

TLSH Version Bump

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Bump TLSH Version

Describe alternatives you've considered
N/A

Additional context
N/A

Wildcard NOPs

Wildcard nops for x86/x86_64

CloudEye uses NOPs a beginning of functions, wild card these so leading wildcards can be parsed out

Recursive Decompiler

Modify class DecompilerREV for testing, and implement it before making the switch over.

Should just be able to replace DecompilerREV with Decompiler to use once ready.

Cleaning the docker.sh script

Using the blserver branch, investigate the use of Docker Swarm instead of docker.sh to generate the docker-compose.yml.

This should make deployment of our containers much more user friendly.

If we decide to go with Docker Swarm let c3rb3ru5 know and begin working at replacing docker.sh

Obfuscated Trait Detection, Thresholds and Recursion

  • โœ”๏ธ Cyclomatic Complexity
  • โœ”๏ธ Basic Block Size in Bytes
  • โœ”๏ธ Basic Block Instruction Count
  • โœ”๏ธ Function Size
  • โœ”๏ธ Average Instructions per Block
  • โœ”๏ธ Use cs_disasm_iter() for improved speed and control over program counter
  • โœ”๏ธ Pretty Print
  • โœ”๏ธ Recursive Decompilation by Instruction (fine-tuned control over exceptions)
  • โœ”๏ธ Wildcard Scalars

Allows the user to fine-tune their output, calculating these is best left to decompiler stage especially cyclomatic complexity.

[
  {
    "average_instructions_per_block": 3,
    "blocks": 1,
    "bytes": "01 c3 29 c6 75 c1",
    "bytes_entropy": 0,
    "bytes_sha256": "5776a6a5e142981e2848b93a068268018809b786e310fca8b142cadd724f6f9a",
    "instructions": 3,
    "offset": 337,
    "size": 6,
    "type": "block"
  },
  {
    "average_instructions_per_block": 10,
    "blocks": 15,
    "bytes": "fc e8 8f 00 00 00 60 89 e5 31 d2 64 8b 52 30 8b 52 0c 8b 52 14 31 ff 8b 72 28 0f b7 4a 26 31 c0 ac 3c 61 7c 02 2c 20 c1 cf 0d 01 c7 49 75 ef 52 57 8b 52 10 8b 42 3c 01 d0 8b 40 78 85 c0 74 4c 01 d0 50 8b 58 20 8b 48 18 01 d3 85 c9 74 3c 49 8b 34 8b 01 d6 31 ff 31 c0 c1 cf 0d ac 01 c7 38 e0 75 f4 03 7d f8 3b 7d 24 75 e0 58 8b 58 24 01 d3 66 8b 0c 4b 8b 58 1c 01 d3 8b 04 8b 01 d0 89 44 24 24 5b 5b 61 59 5a 51 ff e0 58 5f 5a 8b 12 e9 80 ff ff ff 5d 68 33 32 00 00 68 77 73 32 5f 54 68 4c 77 26 07 89 e8 ff d0 b8 90 01 00 00 29 c4 54 50 68 29 80 6b 00 ff d5 6a 0a 68 5d b8 d8 22 68 02 00 11 5c 89 e6 50 50 50 50 40 50 40 50 68 ea 0f df e0 ff d5 97 6a 10 56 57 68 99 a5 74 61 ff d5 85 c0 74 0a ff 4e 08 75 ec e8 67 00 00 00 6a 00 6a 04 56 57 68 02 d9 c8 5f ff d5 83 f8 00 7e 36 8b 36 6a 40 68 00 10 00 00 56 6a 00 68 58 a4 53 e5 ff d5 93 53 6a 00 56 53 57 68 02 d9 c8 5f ff d5 83 f8 00 7d 28 58 68 00 40 00 00 6a 00 50 68 0b 2f 0f 30 ff d5 57 68 75 6e 4d 61 ff d5 5e 5e ff 0c 24 0f 85 70 ff ff ff e9 9b ff ff ff 01 c3 29 c6 75 c1 c3",
    "bytes_entropy": 0,
    "bytes_sha256": "ab8e5368e7965b1520f44ab6b7b66ebdf9c9d203b730e444eed758856a07cdb3",
    "instructions": 150,
    "offset": 0,
    "size": 344,
    "type": "function"
  }
]

Thresholds with jq make hunting easy with a query language:

build/binlex -m raw:x86 -i tests/raw/raw.x86 | jq -r '.[] | select(.type == "block" and .size < 32 and .size > 0) | .bytes'
2c 20 c1 cf 0d 01 c7 49 75 ef
52 57 8b 52 10 8b 42 3c 01 d0 8b 40 78 85 c0 74 4c
01 d0 50 8b 58 20 8b 48 18 01 d3 85 c9 74 3c
49 8b 34 8b 01 d6 31 ff 31 c0 c1 cf 0d ac 01 c7 38 e0 75 f4
03 7d f8 3b 7d 24 75 e0
58 5f 5a 8b 12 e9 80 ff ff ff
ff 4e 08 75 ec
e8 67 00 00 00 6a 00 6a 04 56 57 68 02 d9 c8 5f ff d5 83 f8 00 7e 36
e9 9b ff ff ff
01 c3 29 c6 75 c1

โœ”๏ธ To achieve easier management of strings move to std::string and json instead of char *.

Trait format will change thus should be a minor version bump.

Article with research:
obf.pdf

References:

CIL: Strange error when processing specific obfuscated .NET binary.

Description:
Strange error when processing specific obfuscated .NET binary.

To Reproduce:
Download pe.cil.2.zip

Run:

binlex -m auto -i pe.cil.2
Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size
Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size
Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size

Expected Behavior:
Output traits

Affected OS/Version:
Linux/v1.1.1-rc1

Function Names

Attach function names to the queue when parsing shared libs, DLLs, etc

Function names shall be included in the json

x86 / x86_64 Recursion

image

image

Binlex has an issue with x86 code that ends abruptly, should handle with recursion.

Example code from emotet:

dump.bin.zip

The pe.h works great, just capstone being capstone.

QA: v1.1.1 Milestone

  • Track Feature Requests for Staging
  • Track which ones fail, which ones pass
  • feature -> qa (staging) -> milestone (v1.1.1) -> master (prod)

MongoDB Schema, Shards, Replicas, Configs and Routers & RabbitMQ Cluster & Binlex MongoDB / Messaging Queue Workers and HTTP API

In order to work with frequency analysis on traits, we would need to track the file hashes associated with given traits.

To do this we would need the equivalent of a stored procedure in mongodb when documents are posted to keep records of hashes for traits.

This would make the db a little more complex, but it the pay off would be pretty great, as we would be able to search traits by sample hash and more.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.