Coder Social home page Coder Social logo

drjerry / cve-score Goto Github PK

View Code? Open in Web Editor NEW
19.0 5.0 8.0 146 KB

ML research on software vulnerabilities

License: BSD 2-Clause "Simplified" License

Python 92.41% Shell 4.68% JavaScript 0.53% R 2.38%
vulnerability-analysis machine-learning exploit-prediction vulnerability-data

cve-score's Introduction

Scoring software vulnerabilities

This project evolved as a collection of tools for analyzing software vulnerability data. It is largely a set of command line utilities. Each script focuses on a single unit of work, the aim being that more complex processing pipelines are built via composition. This design allows for leveraging other CL utilities, keeping the API to a minimal surface area.

One of the main intended uses is training ML models for the exploit prediction problem. Please see that paper references for more background.

System requirements

The utilities target Python 3 (tested against 3.5-7). See requirements.txt for the Python dependencies.

jq 1.5+ is required for essentially all data processing tasks. (See data workflow below.) One can download the latest stable version for your target platform, and Linux systems allow for installation via the system package manager.

Data workflow

Exploit prediction is a supervised learning problem. Most machine learning workflows start by marshaling the data into a tabular format--an N-by-D feature matrix, together with an additional column for the labels--and perform all cleaning and feature engineering steps from there. The DataFrame structure in R or Pandas are designed around this.

The tools here emphasize a different, "opinionated" workflow whose point of departure is the fact that raw vulnerability data is most readily available in a hierarchically structured format like XML or JSON instead of flat tables. The target format for the data is a line-delimited file of JSON records -- the so-called JSONL format. Each data cleaning or feature engineering step consumes a JSONL file and emits a new one, thereby building a pipeline of processing steps with checkpoints along the way.

One of the design decisions that is best made explicit early on involves the preferred way of defining and encoding features. Suppose that the input records have a top-level property called "foo," each being an object of categorical attributes:

{..., "foo": {"type": "debian", "version": "1.2", "affected": true}, ...}
{..., "foo": {"type": "legacy", ""affected": false}, ...}
...

One possible approach is to create a feature for each of the paths foo.type, foo.version, foo.affected, etc., each of which will be a categorical variable and have its own one-hot encoding. Instead, the preferred approach is to use a bag-of-words encoding for the top-level property. Its vocabulary is the space of all inter-object paths, eg, type.debian, version.1.2, etc., so that the the preprocessed records become:

{..., "foo": ["type.debian", "version.1.2", "affected.true"], ...}
{..., "foo": ["type.legacy", "affected.false"], ...}
...

The two approaches are mathematically equivalent. However the latter helps to keep the data wrangling steps simpler. For each data set, one only needs to specify the transforms and encodings for a bounded set of top-level properties.

The data cleaning and feature engineering steps of the workflow operate on the data one record at a time (or "row-wise"), and then the final encoding step transforms the data into a columnar format consumable for ML training and evaluation. That target format is a Python dictionary associating feature names (the keys) to 2D numpy arrays. A given array will have shape (N, K), where N is the number of records, and K is the dimensionality of the vector encoding for that feature. Note that the term "feature" here is applied loosely, as it may included the class labels for a supervised learning problem, in which case K=1.

Workflow outline

  1. Create a file of JSON records, where all records have the same set of keys corresponding to the "features" of interest. A basic walk through on data acquisition illustrates this.

  2. Apply the preprocssing script to the raw data, creating another file of JSON records with the same top level keys, but the corresponding values are either arrays of strings (literally bag-of-tokens) or numeric values.

  3. Apply the encoding script to transform the preprocessed records into the target dictionary of numpy arrays.

Command line API

This sections documents the preprocessing and encoding scripts in more detail. Each of these scripts consumes and emits files as part of a data pipeline that can be summarized as follows:

preprocess.py

Argument State Description
config required input JSON configuration file.
rawdata required input JSONL file of raw features.
processed output JSONL file of processed features.
vocabulary optional output JSON file of vocabularies.

encode.py

Argument State Description
config required input JSON configuration file.
vocabulary required input JSON file of vocabularies.
processed output JSONL file of processed features.
encoded output Dictionary of numpy arrays.

config schema

Both of the scripts take a config argument that defines all of preprocessing and encoding methods applied to each feature. It is a JSON array of objects, one for each feature, with the following schema:

[
  {
    "feature":      // key name in JSON input records.
    "preprocessor": // reference in preprocess.PREPROCESSORS
    "encoder":      // reference in encode.ENCODERS
    // optional key-word arguments for preprocessor and encoder methods.
  },
  ...
]

vocabulary schema

When working with a feature (text or structured data) to which a bag-of-words encoding will be applied, it is important to extract the vocabulary for that feature, which fixes an assignment of each token to its dimension in the vector representation. As it is critical that the same vector representation for a feature used to train an estimator is also applied to new examples during inference, the vocabulary needs to be treated as an artifact of preprocessing that becomes an input of any encoding step.

The vocabulary artifact emitted by the preprocessing script is a JSON file with a simple nested format:

{
  <feature>: {
    <token_0>: <frac_0>,
    <token_1>: <frac_1>,
    ...
  }
  ...
}

The top level keys are features from the input data, but only those targeting bag-of-words encoding; numeric features are absent. The nested maps associate each token in that "feature space" to the fraction of records in which that token appears.

When this object is consumed by the encoding script, the only thing that matters for the vector representation of a feature is its "key space" of tokens, as the token-to-dimension mapping is established by sorting. This allows for different dimension reduction strategies by pruning or otherwise transforming these nested objects in the input vocabulary; the numeric ranks are only provided as an aid toward these steps.

cve-score's People

Contributors

drjerry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.