Coder Social home page Coder Social logo

jtrans's Introduction

jTrans

This repo is the official code of jTrans: Jump-Aware Transformer for Binary Code Similarity Detection.

Illustrating the performance of the proposed jTrans

News

  • [2023/3/2] Update an writeup on using jTrans for binary diffing in HackerGame2022.
  • [2022/7/7] We update BinaryCorp with the original binaries.
  • [2022/6/18] We release the code and models of jTrans.
  • [2022/6/9] We release the preprocessing code and BinaryCorp, the dataset we used in our paper.
  • [2022/5/26] jTrans is now on ArXiv.

Writeups

Get Started

Prerequisites

  • Linux (MacOS and Windows are not currently officially supported)
  • Python 3.8+
  • PyTorch 1.10+
  • CUDA 10.2+
  • IDA pro 7.5+ (only used for dataset processing)

Quick Start

a. Create a conda virtual environment and activate it.

conda create -n jtrans python=3.8 pandas tqdm -y
conda activate jtrans

b. Install PyTorch and other packages.

conda install pytorch cudatoolkit=11.0 -c pytorch
python -m pip install simpletransformers networkx pyelftools

c. Get code and models of jTrans.

git clone https://github.com/vul337/jTrans.git && cd jTrans

Download experiments.tar.gz and models.tar.gz and extract them.

tar -xzvf experiments.tar.gz && tar -xzvf models.tar.gz

d. Get the BinaryCorp dataset Download the processed dataset from this link

e. Finetune new models on the BinaryCorp

python finetune.py -h

d. Evaluation

python eval_save.py -h
python fasteval.py -h

try to evaluate jTrans on BinaryCorp-3M after extracting experiments.tar.gz

python fasteval.py

f. Try jTrans on your own binaries

Make sure you have IDA pro 7.5+ and following the instructions at datautils. After extracting features of your binaries, you can try jTrans on them such as the usage at eval_save.py.

Dataset

  • We present a new large-scale and diversified dataset, BinaryCorp, for the task of binary code similarity detection.
  • The description of the dataset can be found at here and we give an example for using BinaryCorp.
  • If you need to use features that we do not provide in advance, such as call graphs, you can download the raw binaries from here.

Acknowledgement

This project is not possible without multiple great open-sourced code bases. We list some notable examples below.

Bibtex

If this work or BinaryCorp dataset are helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{10.1145/3533767.3534367,
author = {Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao},
title = {JTrans: Jump-Aware Transformer for Binary Code Similarity Detection},
year = {2022},
isbn = {9781450393799},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3533767.3534367},
doi = {10.1145/3533767.3534367},
abstract = {Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.},
booktitle = {Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis},
pages = {1โ€“13},
numpages = {13},
keywords = {Binary Analysis, Similarity Detection, Neural Networks, Datasets},
location = {Virtual, South Korea},
series = {ISSTA 2022}
}

@article{wang2022jtrans,
  title={jTrans: Jump-Aware Transformer for Binary Code Similarity},
  author={Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao},
  journal={arXiv preprint arXiv:2205.12713},
  year={2022}
}

jtrans's People

Contributors

hustcw avatar learner0x5a avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.