Coder Social home page Coder Social logo

kipple-data's Introduction

kipple-data

This repository houses the data associated with the kipple project. It has two primary folders:

  • data, which contains files of zipped memmap'd feature arrays for adversarial malware, and
  • records, which contains a list of the associated md5/sha256 value (more below) for each dat file.

Note that for each dat file in data there is an associated txt file in records with the latter listing the md5/sha256 values encoded in the array.

In total, there are 13 data stores, matching the following table:

Name Description Count
msf_normal Randomly generated implants from msfvenom, no added-code parameter 5884
msf_sorel Randomly generated implants from msfvenom, added-code from the SoReL dataset 33633
msf_vs Randomly generated implants from msfvenom, added-code from VirusShare 7614
sorel_malware_rl Adversarial malware generated using Malware RL over the SoReL dataset 37553
sorel_sml_gamma Adversarial malware generated using the GAMMA attack from SecML Malware on the SoReL dataset 5167
sorel_small_pad Adversarial malware generated using the padding attack with a small pad from SecML Malware on the SoReL dataset 225
sorel_large_pad Adversarial malware generated using the padding attack with a large pad from SecML Malware on the SoReL dataset 277
sorel_header_ev Adversarial malware generated using the DOS Header attack from SecML Malware on the SoReL dataset 2590
vs_malware_rl Adversarial malware generated using Malware RL over malware from VirusShare 24581
vs_sml_gamma Adversarial malware generated using the GAMMA attack from SecML Malware on malware from VirusShare 5629
vs_small_pad Adversarial malware generated using the padding attack with a small pad from SecML Malware on malware from VirusShare 2347
vs_large_pad Adversarial malware generated using the padding attack with a large pad from SecML Malware on malware from VirusShare 2815
vs_header_ev Adversarial malware generated using the DOS Header attack from SecML Malware on malware from VirusShare 2814

Pre-requisites

This data is zipped. The main kipple repo assumes you will unzip it -- we strongly recommend unzipping once you download the repo. The zip is only to make sure we're in line with file size requirements.

File Hashes

The records directory contains files listing the file hashes associated with each data array. Due to the different data sources, and some small code hiccups, there are some nuances in the naming convention:

  • All hashes under the "msf" category are the MD5 file hashes of the implant generated by msfvenom.
  • All hashes under the "vs" category are the MD5 file hashes of the original malware downloaded from VirusShare.
    • In some cases, multiple variants of the same original sample were created; in these cases, after the original sample is created, the subsequent ones have a "-ABC-.exe" after them, where is the variant number.
    • In some cases, a sha256 value may be used in place of an MD5.
  • All hashes under the "sorel" category of file hashes are the hashes of the original malware.
    • SoReL modifies the malware binaries to be non-executable, giving them a different hash than the "active"/original malware.
    • The sha256 values correspond to the original version.
  • There may be some names solely consisting of "-".

Why this format?

The memmap'd format for storage probably isn't ideal -- it would be better to have stored + shared the malware as feature sets similar to how EMBER stores the data. However, to save time during testing we would effectively add all newly generated malware samples to the existing memmap'd set, letting us run quicker tests. Hopefully at some point in the future I'll go through and revise the format storage.

Usage

Assuming you've already unzipped, the following code would be an example of running a classifier over the kipple data:

import ember
import os
from ember.features import PEFeatureExtractor
import lightgbm as lgb
import gzip
import numpy as np

# Load EMBER feature extractor + number of dimensions
extractor=PEFeatureExtractor(feature_version=2, print_feature_warning=False)
ndim = extractor.dim

# Load the data in the array we want to use
target_data="msf_normal"
num_entries=sum(1 for line in open("records/" + target_data + ".txt"))
malware_data = np.memmap("data/" + target_data + ".dat", dtype=np.float32, mode="r", shape=(num_entries, ndim))

# Load a local model
model_location="/exes/kipple_repo/kipple/models/initial.txt.gz"
with gzip.open(model_location,"rb") as f:
    md=f.read().decode('ascii')
mdl=lgb.Booster(model_str=md)

num_correct=0
for i in range (0, num_entries):
    if mdl.predict([malware_data[i]])[0] > .85:
        num_correct=num_correct+1
print(num_correct/num_entries)

There are more examples in the primary kipple directory.

Useful References

kipple-data's People

Contributors

aapplebaum avatar aapplebaumtwo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.