Coder Social home page Coder Social logo

gugle-gu / feature-extraction-for-cert-insider-threat-test-datasets Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lcd-dal/feature-extraction-for-cert-insider-threat-test-datasets

0.0 0.0 0.0 27 KB

Feature extraction for CERT insider threat test dataset

License: MIT License

Python 100.00%

feature-extraction-for-cert-insider-threat-test-datasets's Introduction

Feature extraction for CERT insider threat test dataset

This is a script for extracting features (csv format) from the CERT insider threat test dataset [1], [2], versions 4.1 to 6.2. For more details, please see this paper: Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning.

[1] Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1

[2] J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37.

Run feature_extraction script

  • Require python3, numpy, pandas, joblib. The script is written and tested in Linux only.
  • By default the script extracts week, day, session, and sub-session data (as in the paper).
  • To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded here), then run python3 feature_extraction.py
  • To change number of cores used in parallelization (default 8), use python3 feature_extraction.py numberOfCores, e.g python3 feature_extraction.py 16.

Extracted Data

Extracted data is stored in ExtractedData subfolder.

Note that in the extracted data, insider is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches.

Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in here.

Data representations

From the extracted data, temporal_data_representation.py can be used to generate different data representations, as presented in this paper: Anomaly Detection for Insider Threats Using Unsupervised Ensembles.

python3 temporal_data_representation.py --help

Sample classification and anomaly detection results

Sample code is provided in:

Citation

If you use the source code, or the extracted datasets, please cite the following paper:

D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.

Data representations and anomaly detection:

D. C. Le, N. Zincir-Heywood, "Anomaly Detection for Insider Threats Using Unsupervised Ensembles," in IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1152โ€“1164. June 2021, doi:http://doi.org/10.1109/TNSM.2021.3071928.

feature-extraction-for-cert-insider-threat-test-datasets's People

Contributors

lcd-dal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.