This research project compares the accuracies of varioius machine algorithms and deep learning frameworks in detecting and classifying malicious URLs using lexcial features.
Experiments results show that Random Forest, an ensemble-based classifier, not only outperformed 8 other traditional machine learning classifiers but also some deep neural network models generated by cutting-edge popular frameworks such as TensorFlow and PyTorch in detecting and classifying malicious URLs using lexical features.
- downloaded from: https://www.unb.ca/cic/datasets/url-2016.html
- run Baseline-Experiments.ipynb jupyter notebook to download dataset using provided Bash script
- dropped samples and attributes with NaN, Infifinity and mising values
- removed whitespaces from column/attribute names
- labeled 5 URL types with total 36,707 samples (before cleanup)
- consists of 79 lexical features extracted from URLs
- table below shows original sample counts (Total) and New Totals (after data cleanup)
- Total (Dropping NaN Rows) - remaining total samples after dropping samples with NaN values
- ~17K rows were dropped
- Total (Dropping NaN Cols) - remaining total samples after dropping columns/attributes with NaN values
- 7 attributes are dropped as a result with total 72 attributes remaining
Dataset | URL Type | Total | Dropping NaN Rows | Dropping NaN Cols |
---|---|---|---|---|
All.csv | benign | 7,781 | 2,709 | 7,781 |
defacement | 7,930 | 2,477 | 7,930 | |
malware | 6,712 | 4,440 | 6,711 | |
phishing | 7,586 | 4,014 | 7,577 | |
spam | 6,698 | 5,342 | 6,698 | |
malicious | 28,926 | 16,273 | 28,916 |
- perfomance results using various machine learning algorithms and deep learning frameworks are compared
- authors of the dataset[1] have evaluated 3 classifiers
- C4.5 (Decision Tree)
- KNN (K-Nearest Neighbors)
- RF (Random Forest)
- RF achieved the best overall results
- 0.97 Precision and 0.97 Recall on Multi-class
- ~ 0.99 Precision and 0.99 Recall on (Single-class)
- we evaluate 9 ML classifiers provided in sci-kit learn framework
- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN)
- Classification and Regression Trees (CART)
- Gaussian Naive Bayes (NB)
- Support Vector Machines (SVM)
- Random Forest (RF)
- Decision Tree (DT)
- Ada Boost (AB)
- 2 linear classifiers (LR and LDA)
- 5 nonlinear (KNN, CART, NB, SVM, and DT)
- 2 Ensemble-based (RF, AB)
- fast.ai provides high level Python API wrapper over PyTorch with the goal of making deep learning easier to use
- PyTorch is an open-source Python version of Torch machine learning framework developed by Facebook
- PyTorch uses dynamic computational graphs (a.k.a. Define-by-Run approach) which let you process variable-length inputs and outputs
- network is defined dynamically via the actual forward computation
- Keras is an open-source high-level neural networks API, written in Python and cabable of running on top of TensorFlow, CNTK, or Theano
- Keras allows for easy and fast prototyping
- through user friendliness, modularity, and extensibility
- runs seamlessly on CPU and GPU
- we experimented with TensorFlow and Theano as backend
- TensorFlow is an open-source ML framework developed by Google
- TensorFlow uses static computational graphs (a.k.a. Define-and-Run approach)
- Theano is no longer maintained
- use 10-fold cross validation to estimate accuracy results
- split dataset into 10 parts, train on 9 and test on 1 and repeat for all combination of train-test splits
- calculate the average accuracy
- classification of URL types (benign, malware, spam, phishing, defacement)
-
Comparision of Algorithms using Box plot
-
Validation Results from Best Classifer (Random Forest)
-
Confusion Matrix
-
Classification Report:
precision recall f1-score support defacement 0.97 0.95 0.96 526 benign 0.95 0.97 0.96 546 malware 0.98 0.98 0.98 913 phishing 0.91 0.93 0.92 764 spam 0.99 0.98 0.98 1048 accuracy 0.96 3797 macro avg 0.96 0.96 0.96 3797 weighted avg 0.96 0.96 0.96 3797
-
Comparision of Algorithms using Box-plot
-
Validation Results from Best Classifer (Random Forest)
-
Confusion Matrix
-
Classification Report:
precision recall f1-score support
Defacement 0.98 0.98 0.98 1594
benign 0.97 0.98 0.98 1541
malware 0.99 0.98 0.98 1367
phishing 0.95 0.95 0.95 1523
spam 0.99 0.97 0.98 1315
accuracy 0.97 7340
macro avg 0.97 0.97 0.97 7340
weighted avg 0.97 0.97 0.97 7340
Framework | CPU Accuracy (%) | GPU Accuracy (%) | TPU Accuracy (%) |
---|---|---|---|
Fast.AI | 97.08 | 97.23 | 97.26 |
Keras-TensorFlow | 96.37 | 95.79 | 95.60 |
Keras-Theano | * | * | * |
- re-labeled defacement, malware, phishing, spam, defacement as malicious type (1) and benign as 0
- detecting malicious URLs (malicious Vs benign)
-
Comparision of Algorithms using Box plot
-
Validation Results from Best Classifer (Random Forest)
-
Confusion Matrix
-
Classification Report:
precision recall f1-score support
benign 0.95 0.98 0.97 546
malicious 1.00 0.99 0.99 3251
accuracy 0.99 3797
macro avg 0.97 0.99 0.98 3797
weighted avg 0.99 0.99 0.99 3797
-
Comparision of Algorithms using Box-plot
-
Validation Results from Best Classifer (Random Forest)
-
Confusion Matrix
-
Classification Report:
precision recall f1-score support
benign 0.97 0.98 0.98 1541
malicious 0.99 0.99 0.99 5799
accuracy 0.99 7340
macro avg 0.98 0.99 0.98 7340
weighted avg 0.99 0.99 0.99 7340
Framework | CPU Accuracy (%) | GPU Accuracy (%) | TPU Accuracy (%) |
---|---|---|---|
Fast.AI | 98.83 | 98.62 | 98.73 |
Keras-TensorFlow | 98.62 | 98.70 | 98.79 |
Keras-Theano | * | * | * |
- Mohammad Saiful Islam Mamun, Mohammad Ahmad Rathore, Arash Habibi Lashkari, Natalia Stakhanova and Ali A. Ghorbani, "Detecting Malicious URLs Using Lexical Analysis", Network and System Security, Springer International Publishing, P467-482, 2016
- Your First Machine Learning Project in Python Step-by-Step - https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
- Define-by-Run - https://docs.chainer.org/en/stable/guides/define_by_run.html
- TensorFlow: Static Graphs - https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
- Home - Keras Documentation - https://keras.io/
- TensorFlow | TensorFlow - https://www.tensorflow.org/