Coder Social home page Coder Social logo

0xdead4f / deeplearningmaliciousurls Goto Github PK

View Code? Open in Web Editor NEW

This project forked from colorado-mesa-university-cybersecurity/deeplearning-maliciousurls

1.0 0.0 0.0 34.42 MB

Machine Learning Models to Detect and Classify Malicious URLs

License: MIT License

Jupyter Notebook 100.00%

deeplearningmaliciousurls's Introduction

Machine Learning Models to Detect and Classify Malicious URLs

Introduction

This research project compares the accuracies of varioius machine algorithms and deep learning frameworks in detecting and classifying malicious URLs using lexcial features.

Experiments results show that Random Forest, an ensemble-based classifier, not only outperformed 8 other traditional machine learning classifiers but also some deep neural network models generated by cutting-edge popular frameworks such as TensorFlow and PyTorch in detecting and classifying malicious URLs using lexical features.

Dataset

Data Cleanup

  • dropped samples and attributes with NaN, Infifinity and mising values
  • removed whitespaces from column/attribute names

Dataset Summary

  • labeled 5 URL types with total 36,707 samples (before cleanup)
  • consists of 79 lexical features extracted from URLs
  • table below shows original sample counts (Total) and New Totals (after data cleanup)
  • Total (Dropping NaN Rows) - remaining total samples after dropping samples with NaN values
    • ~17K rows were dropped
  • Total (Dropping NaN Cols) - remaining total samples after dropping columns/attributes with NaN values
    • 7 attributes are dropped as a result with total 72 attributes remaining
Dataset URL Type Total Dropping NaN Rows Dropping NaN Cols
All.csv benign 7,781 2,709 7,781
defacement 7,930 2,477 7,930
malware 6,712 4,440 6,711
phishing 7,586 4,014 7,577
spam 6,698 5,342 6,698
malicious 28,926 16,273 28,916

Machine Learning Algorithms

  • perfomance results using various machine learning algorithms and deep learning frameworks are compared
  • authors of the dataset[1] have evaluated 3 classifiers
    • C4.5 (Decision Tree)
    • KNN (K-Nearest Neighbors)
    • RF (Random Forest)
    • RF achieved the best overall results
      • 0.97 Precision and 0.97 Recall on Multi-class
      • ~ 0.99 Precision and 0.99 Recall on (Single-class)
  • we evaluate 9 ML classifiers provided in sci-kit learn framework
    1. Logistic Regression (LR)
    • Linear Discriminant Analysis (LDA)
    • K-Nearest Neighbors (KNN)
    • Classification and Regression Trees (CART)
    • Gaussian Naive Bayes (NB)
    • Support Vector Machines (SVM)
    • Random Forest (RF)
    • Decision Tree (DT)
    • Ada Boost (AB)
  • 2 linear classifiers (LR and LDA)
  • 5 nonlinear (KNN, CART, NB, SVM, and DT)
  • 2 Ensemble-based (RF, AB)

Deep Learning Frameworks

fast.ai & PyTorch

  • fast.ai provides high level Python API wrapper over PyTorch with the goal of making deep learning easier to use
  • PyTorch is an open-source Python version of Torch machine learning framework developed by Facebook
  • PyTorch uses dynamic computational graphs (a.k.a. Define-by-Run approach) which let you process variable-length inputs and outputs
  • network is defined dynamically via the actual forward computation

Keras, TensorFlow & Theano

  • Keras is an open-source high-level neural networks API, written in Python and cabable of running on top of TensorFlow, CNTK, or Theano
  • Keras allows for easy and fast prototyping
    • through user friendliness, modularity, and extensibility
    • runs seamlessly on CPU and GPU
  • we experimented with TensorFlow and Theano as backend
  • TensorFlow is an open-source ML framework developed by Google
  • TensorFlow uses static computational graphs (a.k.a. Define-and-Run approach)
  • Theano is no longer maintained

Model Evaluations

  • use 10-fold cross validation to estimate accuracy results
  • split dataset into 10 parts, train on 9 and test on 1 and repeat for all combination of train-test splits
  • calculate the average accuracy

Experiments and Results

Multi-class Classification (All.csv)

  • classification of URL types (benign, malware, spam, phishing, defacement)

Machine Learning Algorithm Results

Results on Dataset after Dropping NaN Rows

  • Comparision of Algorithms using Box plot

  • Validation Results from Best Classifer (Random Forest)

  • Confusion Matrix

  • Classification Report:

    precision recall f1-score support
    defacement 0.97 0.95 0.96 526
    benign 0.95 0.97 0.96 546
    malware 0.98 0.98 0.98 913
    phishing 0.91 0.93 0.92 764
    spam 0.99 0.98 0.98 1048
    accuracy 0.96 3797
    macro avg 0.96 0.96 0.96 3797
    weighted avg 0.96 0.96 0.96 3797

Results on Dataset after Dropping NaN Columns

  • Comparision of Algorithms using Box-plot

  • Validation Results from Best Classifer (Random Forest)

  • Confusion Matrix

  • Classification Report:

              precision    recall  f1-score   support

  Defacement       0.98      0.98      0.98      1594
      benign       0.97      0.98      0.98      1541
     malware       0.99      0.98      0.98      1367
    phishing       0.95      0.95      0.95      1523
        spam       0.99      0.97      0.98      1315

    accuracy                           0.97      7340
   macro avg       0.97      0.97      0.97      7340
weighted avg       0.97      0.97      0.97      7340

Deep Learning Framework Results

Framework CPU Accuracy (%) GPU Accuracy (%) TPU Accuracy (%)
Fast.AI 97.08 97.23 97.26
Keras-TensorFlow 96.37 95.79 95.60
Keras-Theano * * *

Binary-class Classification (All.csv)

  • re-labeled defacement, malware, phishing, spam, defacement as malicious type (1) and benign as 0
  • detecting malicious URLs (malicious Vs benign)

Machine Learning Algorithm Results

Results on Dataset after Dropping NaN Rows

  • Comparision of Algorithms using Box plot

  • Validation Results from Best Classifer (Random Forest)

  • Confusion Matrix

  • Classification Report:

              precision    recall  f1-score   support

      benign       0.95      0.98      0.97       546
   malicious       1.00      0.99      0.99      3251

    accuracy                           0.99      3797
   macro avg       0.97      0.99      0.98      3797
weighted avg       0.99      0.99      0.99      3797

Results on Dataset after Dropping NaN Columns

  • Comparision of Algorithms using Box-plot

  • Validation Results from Best Classifer (Random Forest)

  • Confusion Matrix

  • Classification Report:

              precision    recall  f1-score   support

      benign       0.97      0.98      0.98      1541
   malicious       0.99      0.99      0.99      5799

    accuracy                           0.99      7340
   macro avg       0.98      0.99      0.98      7340
weighted avg       0.99      0.99      0.99      7340

Deep Learning Framework Results

Framework CPU Accuracy (%) GPU Accuracy (%) TPU Accuracy (%)
Fast.AI 98.83 98.62 98.73
Keras-TensorFlow 98.62 98.70 98.79
Keras-Theano * * *

References

  1. Mohammad Saiful Islam Mamun, Mohammad Ahmad Rathore, Arash Habibi Lashkari, Natalia Stakhanova and Ali A. Ghorbani, "Detecting Malicious URLs Using Lexical Analysis", Network and System Security, Springer International Publishing, P467-482, 2016

deeplearningmaliciousurls's People

Contributors

rdunski avatar rambasnet avatar johnsonclayton avatar ndbellew avatar bkhadka2 avatar

Stargazers

0xdead4f avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.