Coder Social home page Coder Social logo

machine-learning-based-malware-detection-engine's Introduction

Machine Learning Malware Detection Engine

The X Lab

OKR

Objective

  • 搭建基于ML技术的恶意文件检测引擎 实验原型 恶意文件=[Windows Files, PDF, Android APPs]

Key Results

  1. Basic
  • CPU Consumption: (Needed to fill)
  • Memory Consumption: (Needed to fill)
  • Size of the dataset: needs to be at 5M scale
  • ML Model Training Time using the whole dataset: Within 12 hours
  • Model 在防火墙应用场景中的平均预测时间(对于单个样本): needs to be at 1 ms level
  • Model 在防火墙应用场景中的最大预测时间(对于单个样本): needs to be at 10 ms level
  • Model False Positive Rate: needs to be < 0.0001%
  • Model Update Frequency: needs to be once every day
  • Model accuracy: Needs to be 99.999%
  • Self-evidence for effectiveness & effectiveness: 未知威胁的有效发现(如与VT服务进行交叉验证)
  1. Medium
  • File types support extention from Windows Files (PEs, DLLs) to MS-Office docs, PDFs, Android & iOS, Linux etc.
  • 级联Malware Prediction Engine: From Coarse-Grained Good/Dad file classification to further malware family prediction
  1. Advanced
  • Construction of the cloud-based backend management system

Questions to ask

Accuracy

  • Upon how much data does the machine learning solution base its decisions? Is it enough?

  • From where does the data come? Is there a wide variety of sources, or are they dependent on third-party threat aggregator sites?

  • How often is the data collected?

  • How often are new models trained and propagated to the customer?

  • How is the system trained? Is it trained through a constant supply of rich data sets, so properties discovered can be used in future machine learning decisions?

  • How does the vendor handle false positives?

  • How does the vendor handle false negatives that the vendor later discovers (after the customer has run the malware)?

Speed

  • How quickly can the solution make a determination that leads to action?

  • How quickly can it obtain enough relevant new data to influence the decisions it makes?

Efficiency

  • Where and how quickly does the analysis take place?

  • What is the impact on the end-user system?

  • What type of analysis is done on incoming files? On endpoints only, on cloud only, or a combination?

  • Does it rely on post-event analysis (detecting rather than preventing)?

Rule of Thumb

  • Ongoing training of the model similarly relies on continuous access to large amounts of new data.

  • The rate of false positives can be extremely high if the data set is not robust.

  • It is important to note that the machine must have access to both benign and malicious data in order to accurately distinguish between the two.

  • ? Training a model based solely on bad data increases the chance of high false-positive rates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.