Coder Social home page Coder Social logo

dexbert's Introduction

DexBERT

Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode

Environment

  • Java 11.0.11
  • Python 3.7.11
  • numpy 1.19.5
  • torch 1.7.1
  • torchvision 0.8.2
  • ptflops 0.6.8
  • tensorflow 2.6.0
  • tensorboard 2.7.0
  • scikit-learn 1.0.2

Usage

Instruction

  • For most users, if you just want to use a pre-trained DexBERT to generate class features for your own Android analysis tasks, please skip the following instructions and go directly to folder './user_package' where you can run our model without any prior knowledge.

  • For readers who want to replicate our experiments, please follow the steps below to pre-train a DexBERT model and apply it in malicious code localization, app defect detection, and component type classification.

  • Please find some smali examples in the folder './Data/examples'.

DexBERT Pre-training

  • Data preparation:

    • First, find apk hash list at: Data/data/pretraining_apks.txt
    • Second, download and process APKs: python data4pretraining.py -d apk_dir -l apk_hash_list -cp cpu_number
  • Start pre-training:

    • sh pretrainDexBERT.sh
  • Infer a pre-trained model:

    • python InferBERT.py --model_cfg config_file_path --data_file pre-processed_data_file --model_file pre-trained_model_file --vocab vocabulary_path
  • You can avoid the pretraining stage by downloading our pre-trained DexBERT model with this link: https://drive.google.com/file/d/1z6aZQXT1dS6wX1JgPnWJVS_e6Td2sBPg/view?usp=sharing

Malicious Code Localization

App Defect Detection

  • Data preparation:

  • Training & Evaluation:

    • python AppDefectDetection.py

Component Type Classification

  • Data preparation:
    • cd Data & python data4component.py
  • Training & Evaluation:
    • cd Models & python ComponentTypeClassification_FirstState768.py

Compute Model Flops

  • python count_flops.py

Notes:

  • Embedding Size
    • To find a reasonable trade-off between model computation cost and performance, we conducted an ablation study exploring the impact of DexBERT embedding size on three downstream tasks. The experiments contain three different sizes for the hidden embedding of the AutoEncoder (AE), specifically 256, 128, and 64. Additionally, we evaluated the performance by directly utilizing the first state vector of the raw DexBERT embedding, which has a size of 768, without applying any dimension reduction from the AutoEncoder.
    • The experimental results reveal that in the task of Malicious Code Localization, a decrease in vector size does not lead to a significant loss in the performance, until the size is reduced to 128. As for the tasks of Defect Detection and Component Type Classification, the experimental results demonstrate that a larger embedding size resulted in a considerable improvement in performance. However, a size of 128 also offered a solid trade-off for these two tasks, supporting satisfactory performance with a metric score exceeding 0.9.
  • AutoEncoder Module: We considered two potential inputs for the AutoEncoder: the full DexBERT embedding (512x768), and the first state vector of the embedding (size 768). From our observations, these inputs yielded similar performance. However, using the first state vector of the embedding was found to be more efficient, leading to faster convergence during fine-tuning for downstream tasks. Therefore, we use the first state vector as the default input for AutoEncoder.

Citation

If you find our work useful, please consider citing it.

@ARTICLE{10237047,
  author={Sun, Tiezhu and Allix, Kevin and Kim, Kisub and Zhou, Xin and Kim, Dongsun and Lo, David and Bissyandé, Tegawendé F. and Klein, Jacques},
  journal={IEEE Transactions on Software Engineering}, 
  title={DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode}, 
  year={2023},
  volume={},
  number={},
  pages={1-16},
  doi={10.1109/TSE.2023.3310874}}

dexbert's People

Contributors

tiezhusun avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.