Coder Social home page Coder Social logo

luckysitara / bodmas Goto Github PK

View Code? Open in Web Editor NEW

This project forked from whyisyoung/bodmas

0.0 0.0 0.0 114 KB

Code for our DLS'21 paper - BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS.

Home Page: https://whyisyoung.github.io/BODMAS/

License: BSD 2-Clause "Simplified" License

Shell 9.54% Python 76.63% HTML 13.83%

bodmas's Introduction

BODMAS Malware Dataset

Introduction

The BODMAS Malware Dataset is created and maintained by Blue Hexagon and UIUC.

It contains 57,293 malware and 77,142 benign Windows PE files, including binaries (disarmed malware only), feature vectors, and metadata.

Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).

If you end up building on this dataset as part of a project or publication, please include a reference to our paper:

@inproceedings{bodmas,
  title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
  author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
  booktitle = {4th Deep Learning and Security Workshop},
  year = {2021}
}

Download

Please visit this link for more details.

Installation

  1. Before we get started, please check your server storage and memory. I ran most of the experiments on our lab clusters containing 9 servers (see specification here). I use Fabric to distribute code to different servers to simplify repetitive experiments. You can use 1 server, but you need to change some shell scripts, see the Examples section.

  2. Clone this repo to your home directory (you can save to other directories but you need to change some scripts if you did, see the warning in the Examples section:

    cd ~
    git clone [email protected]:whyisyoung/BODMAS.git
  3. We recommend setting up a Python 3.6.8 virtual environment (other Python 3.6 or above versions might also work but didn't test).

    cd BODMAS/code/
    pip install requirements.txt
    python setup.py install

Configuration

  1. For BODMAS, follow the guidelines of the Download section. Put bluehex_metadata.csv and bluehex.npz under BODMAS/code/multiple_data/.

  2. For Ember and UCSB-packerware, you can download pre-processed feature vectors and metadata here (about 3.4 GB in total): Google Drive link. Note for Ember, we combine Ember 2017 and 2018 as a whole. Put the 4 downloaded files under BODMAS/code/multiple_data/.

  3. For SOREL-20M, you can download pre-trained LightGBM and DNN models here: https://github.com/sophos-ai/SOREL-20M If you want to use pretrained SOREL-20M models, you need to specify your locations for some folders in code/bodmas/config.py:

    'sophos_model_folder': '/home/datashare/sophos/baselines/checkpoints/lightGBM/',
    'sophos_features_folder': '/home/datashare/sophos/lightGBM-features/'

Examples

Testing pre-trained models on our BODMAS dataset (Table II in our paper):

  1. Using Ember and random seed 0 as the training set (PLEASE change the hostname of "angel" to yours):

    cd BODMAS/code/
    ./main_pretrain.sh

    For other random seeds (1-4), uncomment the rest of the first code block of main_pretrain.sh, also change the hostname of ("beast" "bishop" "colossus" "cyclops") to yours. It would be highly recommended to run only 1 random seed each time if you don't have enough memory.

    Call graph:

    main_pretrain.sh -> fabric_pretrain.py -> run_pretrain.sh -> pretrain_model_test_on_bluehex.py

    WARNING: If you didn't put this repo under your home directory (i.e., this repo would appear as ~/BODMAS), you might need to change the line 18 of fabric_pretrain.py. This also applies to fabric_multiclass.py (line 17)

  2. Using Sophos pre-trained models, uncomment the second code block of main_pretrain.sh and change the hostname to yours. Using UCSB as the training set, uncomment the third code block of main_pretrain.sh and change the hostname accordingly. Code for Sophos-DNN is very similar thus omitted here.

Incremental Retraining (Fig.1 in our paper)

  1. Before running the script, if you want to test Transcend, you need to ask for access to the Transcend code from Feargus Pendlebury and Lorenzo Cavallaro (https://s2lab.cs.ucl.ac.uk/) . Please CC me as well. Otherwise you can uncomment the corresponding import and related code.

  2. Use corresponding code blocks and change the hostname to yours accordingly:

    ./run_ember_drift.sh
  3. Call graph:

    run_ember_drift.sh -> concept_drift_ember.py

Training with New Data (Fig. 2 in our paper)

  1. Change the hostname to yours accordingly and run the following script. It's highly recommend to run each random seed sequentially to avoid memory error unless you can run them on multiple servers.

    ./main_bluehex_binary.sh
  2. Call graph:

    main_bluehex_binary.sh -> bluehex_main.py

Multi-class classification (Fig. 3, 4 in our paper)

  1. Use corresponding code blocks and change the hostname to yours accordingly:

    ./main_bluehex_multiclass.sh

    Call graph:

    main_bluehex_multiclass.sh -> fabric_multiclass.py -> run_multiclass.sh -> bluehex_main.py

Contact

If you have any questions, please contact Limin ([email protected]).

Licensing

BSD 2-Clause "Simplified" License.

bodmas's People

Contributors

whyisyoung avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.