Coder Social home page Coder Social logo

sheisc / goshawk Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yunlongs/goshawk

0.0 0.0 0.0 160.75 MB

Goshawk is a static analyze tool to detect memory corruption bugs in C source codes. It utilizes NLP to infer custom memory management functions and uses data flow analysis to abstract their behaviors and then adopts these summaries to enhace bug detection.

Home Page: https://goshawk.code-analysis.org/

C++ 90.85% Python 9.06% CMake 0.09%

goshawk's Introduction

News

  • Goshawk now supports Clang-15.0.0

Code Structure

Directories

  • data_process: The scripts for pre-processing, parsing and normalizing the function prototypes.
  • model: Pre-trained Siamese network, can be directly used to classify functions.
  • plugins: Clang and CSA plugins used by Goshawk.
  • plugins_src: The source codes of Clang plugins.
  • subword_dataset: The learned subword vocabulary and embedding for function prototype segmentation, and official MM function list.

Main Scripts

  • run.py: The entry point of Goshawk, performs each steps of Goshawk.
  • train.py: Train the Siamese network.
  • cal_metric.py: Evaluate the accuracy of the trained model.
  • similarity_inference.py: Utilize the trained Siamese network to generate similarity scores for each function prototype.
  • mysegment.py: The ULM based function prototype segmentation algorithm.
  • frontend_checker.py: Validate the MM functions according the function prototype and data flow.

Ⅰ. Environment Setup

robin-map
python 3.7+
tensorflow = 2.1
CodeChecker
Clang v15.0.0

Download the subword embeddings to the directory subword_dataset/word_embedding.

You can install robin-map from https://github.com/Tessil/robin-map.

You can install CodeChecker from https://github.com/Ericsson/codechecker.

As Clang-15.0.0 is still developing and there is not a released version, thus I provide the version that I had used. You can download this version of Clang-15.0.0 form this link, and compile a clang by yourself.

Ⅱ. How to use

Ⅱ.A Record compilation commands of your target project.

Before using this tool, you need to record the compilation commands used by each file to compile the source code of the project, and then the further analysis will be based on these compilation commands.

We can use CodeChecker to record the required compilation commands. For the projects which use Makefile to build, we can use the log -b cmd to encapsulate the make related cmd to record the compiling process:

CodeChecker log -b "make CC=clang HOSTCC=clang -jN" -o compilation.json

The compilation commands will be recorded in the file of compilation.json.

Ⅱ.B Run the full phases of Goshawk to analyze a target project.

note: For large project, like linux kernel, you should guarantee that there is at least 300GB ROM on you hard disk.

Currently, you only need one command to analyze a project by Goshawk:

python3 run.py target_project_path

But you should make sure that there is a compilation.json file of your project under the target_project_path.

The MM functions and their corresponding MOSs will be generated at output/alloc and output/free.

The bug detection results will be generated at output/report_html/index.html.

Ⅲ Some beneficial components in Goshawk

Ⅲ.A Function Proatotype Segmentation

Function normalize_prototype_file(in_file, out_file) in normalize.py can be used to segment function prototypes.

It Segments and normalizes the function prototypes in the in_file, and the results are saved at out_file.

For example,
    before: void * kmalloc_array(size_t n, size_t size, gfp_t flags)
    after:  <cls> <ptr> kmalloc array ( <noptr> n <dot> <noptr> size <dot> <noptr> flags )

Ⅲ.B Re-train Simaese network for your customized target function identification task (e.g.,MM functions, crypto functions,...).

1. Prepare your training function prototype dataset.

Take crypto function as example, the dataset should be the prototypes of your collected crypto functions. Each line is a function prototype.

crypto.txt
-------------
int crypto_aead_encrypt(struct aead_request *req)
int crypto_aead_decrypt(struct aead_request *req)
static int crypto_aegis128_encrypt_generic(struct aead_request *req)
static int crypto_aegis128_decrypt_simd(struct aead_request *req)
void crypto_aegis128_encrypt_chunk_neon(void *state, void *dst, const void *src,unsigned int size)
void crypto_aegis128_decrypt_chunk_neon(void *state, void *dst, const void *src,unsigned int size)
static int crypto_authenc_esn_encrypt(struct aead_request *req)
static int crypto_authenc_esn_decrypt(struct aead_request *req)
...

2. Train the Siamese network.

We have implemented the re-train of Siamese network in the script Re-train.py. It takes two arguments:

  • training corpus
  • your trained model name

For example:

python Re-train.py crypto.txt crypto

After the training finished, your trained model which names "crypto" is saved at directory "model/crypto".

3. Infer similarities.

The already trained model were saved in the directory "model", you can use them to infer similarity for other functions directly.

We have implemented these functions in the script similarity_inference.py.

You can call the function similarity_inference(model_name, filename) to infer similarity for the functions whose prototypes saved in the argument filename.

Here, model_name should be the name of model that save in the directory model.

For example, there is a file names test.func which contains the follow functions:

test.func
---------
void * mem_malloc(unsigned long size)
void mem_free(void *ptr)
void CAST_set_key(CAST_KEY *key, int len, const unsigned char *data)

We can call the function similarity_inference to infer similarities for them.

from similarity_inference import working_on_raw_function_prototype
similarity_inference("alloc", "test.func") # Infer the similarity for allocation functions.

The result are saved at "temp/func_name_similarity"

temp/func_name_similarity
----
mem_malloc 0.938829920833657
mem_free -0.9019584597976495
cast_set_key -0.9085114460471964

Citation

We release Goshawk source code in the hope of benefiting others. If you find this project useful, please consider citing:

@INPROCEEDINGS {Goshawk,
    author = {Y. Lyu and Y. Fang and Y. Zhang and Q. Sun and S. Ma and E. Bertino and K. Lu and J. Li},
    booktitle = {2022 2022 IEEE Symposium on Security and Privacy (SP) (SP)},
    title = {Goshawk: Hunting Memory Corruptions via Structure-Aware and Object-Centric Memory Operation Synopsis},
    year = {2022},
    issn = {2375-1207},
    pages = {1566-1566},
    doi = {10.1109/SP46214.2022.00137},
    url = {https://doi.ieeecomputersociety.org/10.1109/SP46214.2022.00137},
    publisher = {IEEE Computer Society},
    address = {Los Alamitos, CA, USA},
    month = {may}
}

If your research work is inspired by or benefits from the NLP based function similarity inference module in Goshawk, please also consider citing:

@INPROCEEDINGS{SparrowHawk,
    author = {Lyu, Yunlong and Gao, Wang and Ma, Siqi and Sun, Qibin and Li, Juanru},
    title = {SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation},
    year = {2021},
    isbn = {978-3-030-88322-5},
    publisher = {Springer-Verlag},
    address = {Berlin, Heidelberg},
    url = {https://doi.org/10.1007/978-3-030-88323-2_7},
    doi = {10.1007/978-3-030-88323-2_7},
    booktitle = {Information Security and Cryptology: 17th International Conference, Inscrypt 2021, Virtual Event, August 12–14, 2021, Revised Selected Papers},
    pages = {129–148},
    numpages = {20},
}

goshawk's People

Contributors

yunlongs avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.