Coder Social home page Coder Social logo

br0kej / bin2ml Goto Github PK

View Code? Open in Web Editor NEW
33.0 4.0 1.0 1.55 MB

A command line tool for extracting machine learning ready data from software binaries powered by Radare2

License: MIT License

Dockerfile 0.13% Shell 0.18% Rust 99.58% C 0.12%
binary-analysis machine-learning radare2 ml4sec data-generation graph-neural-networks reverse-engineering nlp

bin2ml's Introduction

bin2ml

bin2ml is a command line tool to extract machine learning ready data from software binaries. It's ideal for researchers and hackers to easily extract data suitable for training machine learning approaches such as natural language processing (NLP) or Graph Neural Networks (GNN's) models using data derived from software binaries.

  • Extract a range of different data from binaries such as Attributed Control Flow Graphs, Basic Block random walks and function instructions strings powered by Radare2.
  • Multithreaded data processing throughout powered by Rayon.
  • Save processed data in ready to go formats such as graphs saved as NetworkX compatible JSON objects.
  • Experimental support for creating machine learning embedded basic block CFG's using tch-rs and TorchScript traced models.

bin2ml is under active development and is in an alpha state. Things will change as the tool is developed and built upon further.

Pre-Requisites

  • Radare2 Installed - Info on how to do this can be found here.

Quickstart

git clone https://github.com/br0kej/bin2ml
cd bin2ml
cargo build --release

Alternatively, there are two Dockerfile's provided. Dockerfile.build can be used to build the bin2ml binary without having to have cargo on your workstation or Dockerfile builds bin2ml as well as installing radare2 to provide a means of doing processing within the container.

Docs

bin2ml does come with some documentation (albeit incomplete) and has been developed using mdbook. The documentation can be locally served by installing the platform relevant version of mdbook from here and then executing the commands below:

cd bin2ml/docs
mdbook serve

Alternatively, they can be viewed raw by going to the docs folder here

License

The bin2ml source and documentation are released under the MIT license.

Citation

@misc{collyer2023bin2ml,
  author = {Josh Collyer},
  title = {bin2ml},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/br0kej/bin2ml/}},
}

bin2ml's People

Contributors

br0kej avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

gmh5225

bin2ml's Issues

Error "invalid value: integer" when running inside a podman/docker container

Modified Cargo.toml

[dependencies]
r2pipe = "0.7.0"

Modified Dockerfile

podman build . -t bin2ml
FROM rust as builder

WORKDIR /opt/bin2ml

RUN env USER=root cargo init .

COPY Cargo.toml .
COPY src /opt/bin2ml/src

RUN cd /opt/bin2ml && \
    cargo install --locked --path . && \
    rm -rf /opt/bin2ml && \
    rm -rf /usr/local/cargo/registry

FROM rust

COPY --from=builder /usr/local/cargo/bin/bin2ml /usr/local/cargo/bin/bin2ml

RUN git clone https://github.com/radareorg/radare2 radare2
RUN cd radare2 ; sys/install.sh

CMD bin2ml --version

Executed from host

podman run --rm -it -v `pwd`/test-files:/data:z localhost/bin2ml bash
root@5b7b60c1c6a0:/data# RUST_BACKTRACE=1 bin2ml extract --fpath /bin/true --output-dir out --mode reg
[2023-12-06T03:25:11Z INFO  bin2ml] Creating extraction job
[2023-12-06T03:25:11Z INFO  bin2ml] Single file found
[2023-12-06T03:25:11Z INFO  bin2ml] Extraction Job Type: Register Behaviour
[2023-12-06T03:25:11Z INFO  bin2ml::extract] Starting register behaviour extraction
[2023-12-06T03:25:11Z INFO  bin2ml::extract] Getting function information from binary
[2023-12-06T03:25:11Z INFO  bin2ml::extract] Executing aeafj for each function
thread 'main' panicked at src/extract.rs:311:45:
Unable to convert to JSON object!: Error("invalid value: integer `18446744073709551615`, expected i64", line: 1, column: 2449)
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aborted (core dumped)

Feature Request: r2ghidra decompiler

Is there a way to add a decompiling feature to the tool so that it leverages radare2 plugin to, say decompile functions. In addition to having the ESIL representation norm/non-norm I would like to have the decompiled function code alongside.

Right now, I am using pandas to read the esil json files of functions and test various models to see how well they are performing similarity search's (e.g. cosine similarity, etc.) between various versions of the same function. But I would like to also have the ability to import into pandas alongside the ESIL the functions decompiled pseudocode in order to perform similar semantic evaluations.

Not sure if this is within the intended scope of this tool. Great tool either way very useful in many ways.
Reference:
r2ghidra

Question: How do I toggle normalization?

I see that I can toggle toggle normalization but even when I turn this on/off nothing changes from the output. Not sure if this is working correctly or I am doing something wrong?

# with reg-norm
root@ae6a62491516:/app# bin2ml generate nlp --path file_cfg.json --instruction-type esil --data-out-path .  --output-format funcstring --pairs --reg-norm
root@ae6a62491516:/app# md5sum file_cfg-efs.json 
e4159bbfe995ef55d873e6c9552acc20  file_cfg-efs.json

# without reg-norm get same file
root@ae6a62491516:/app# bin2ml generate nlp --path file_cfg.json --instruction-type esil --data-out-path .  --output-format funcstring --pairs 
root@ae6a62491516:/app# md5sum file_cfg-efs.json 
e4159bbfe995ef55d873e6c9552acc20  file_cfg-efs.json

What I am trying to do is create a non-normalized as well as a normalized output.

Formating error when generating esil

This was a "dumb user" error because I passed not an extracted file but the original file. Perhaps an error msg letting folks know this is not the valid .json file needed.

bin2ml generate nlp --path <file.so> --instruction-type esil --data-out-path . --output-format <single || funcstring>
thread 'main' panicked at src/files.rs:61:51:
Unable to read file: Error { kind: InvalidData, message: "stream did not contain valid UTF-8" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Also if I put an invalid json I get an error as well.

bin2ml generate nlp --path file.so_reg.json --instruction-type esil --data-out-path . --output-format single
thread 'main' panicked at src/files.rs:295:18:
Unable to load and desearlize JSON: ()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

bin2ml generate nlp --path file.so_reg.json --instruction-type esil --data-out-path . --output-format funcstring
thread 'main' panicked at src/files.rs:208:18:
Unable to load and desearilize JSON: ()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.