Coder Social home page Coder Social logo

classification-of-malicious-code-by-ml-py's Introduction

Classification-of-Malicious-Code-by-ML-PY

Intro

The goal of this project is to perform malware analysis on a given dataset from python scanned files to implement basic machine learning algorithms: Gaussian Naive Bayes, Random Forest Classifier, Decision Tree Classifier, and Linear SVC. We use two data sets, one is generated from exe/dll scanning files, and the other is a sample obtained from the Internet.

About PEFILE

Refers to a file in a certain format, executable files exe dynamic link library (dll) driver files (sys) are all PE file formats. Pefile can parse, read or modify PE files.

  • The structure of a PE file when it is stored on disk is different from the structure after it is loaded into memory.
  • When the PE file is loaded into the memory through the Windows loader, the version in the memory is called a module (Module).
  • The starting address of the mapping file is called the module handle (hModule), also known as the base address (ImageBase).

Install

pip install pefile

Import

import pefile
import os, string, shutil, re
import sys
import csv
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics 
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict

Usage

python scan_file <insert exe file>
Read <Input>.csv Manually input the csv file from the scanFile

What to expect?

The scanner should take the output dataset csv from the exe file. By using the output csv we should be able to run ml for analysis. Use the following algorithm.

The following results are from the internet example dataset:

  • Gaussian Naive Bayes model accuracy(in %): 32.24573030843742
  • Random Forest model accuracy(in %): 98.44506755034412
  • Decision Tree Classifier Accuracy: 71.32296711700229
  • Linear SVC Classifier accuracy(in %): 96.06853296619245

The next step:

Continue to study the PEFILE scanner. Make the dataset output complete and be able to identify and remove malicious files from exe programs.

Credit To

GitHub - pratikpv/malware_detect2: Malware Classification using Machine learning https://github.com/pratikpv/malware_detect2

GitHub - aayuv17/Malware-Analysis: Malware Analysis using Machine Learning https://github.com/aayuv17/Malware-Analysis

GitHub - bindog/ToyMalwareClassification: Kaggle微软恶意代码分类 https://github.com/bindog/ToyMalwareClassification

erocarrera/pefile: pefile is a Python module to read and work with PE (Portable Executable) files (github.com) https://github.com/erocarrera/pefile

Malicious Code Analysis Practical Series Articles https://github.com/Vxer-Lee/MalwareAnalysis/tree/master/3.%20%E5%8A%A8%E6%80%81%E5%88%86%E6%9E%90%E5%9F%BA%E7%A1%80%E6%8A%80%E6%9C%AF

Machine learning for encrypted malicious traffic detection: Approaches, datasets and comparative study - ScienceDirect

In the future (something to investigate)

https://blog.csdn.net/weixin_46625757/article/details/124088469?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-8-124088469-blog-127131381.235^v29^pc_relevant_default_base3&spm=1001.2101.3001.4242.5&utm_relevant_index=10

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.