Coder Social home page Coder Social logo

popseli / prediction-of-phishing-webpages Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 2.0 9.84 MB

This project develops an ML binary classification model to predict phishing webpages.

Jupyter Notebook 100.00%
machine-learning python phishing-websites-detection model-explainability mysql sql webscraping dns-queries

prediction-of-phishing-webpages's Introduction

Project Overview

The aim of this project was to develop a binary classifier for distinguishing phishing webpages from legitimate ones using URL and webpage structure based features. Phishing webpages are webpages of websites owned by attackers for the purpose of collecting users' sensitive information such as email addresses, passwords, bank account details etc. The information is then used to impersonate victims in order to undertake various malicious activities including theft of money, execution of other cyberattacks and cyberespionage. Since phishing webpages highly resemble with legitimate webpages that prompt sensitive information, users find it difficult to distinguish the two, leading to most users even those with cybersecurity awareness to fall into the phishing attacks. The purpose of this project, therefore, is to build an automated ML based classifier that can exploit key differences in the structures of the two types of webpages to automatically detect the former, thus, protecting online users from the attacks. The classifier can be used to build an application that can be deployed as a built web browser feature or web browser plug-in to offer protection at the time user is attempting to access any webpage that prompts for sensitive information. Here is the Github link of the dataset, feature descriptions and codes of the project.

Objectives

To achieve the aim stated above, the following tasks were to be performed:

  • Investigating and identifying potential webpage structure based features for distinguishing the two types of webpages.
  • Extracting the features from realible data sources of phishing and legitimate webpages.
  • Identify the most relevant feature set for the prediction task.
  • Evaluating the relevant feature set using various ML classification algorithms to identify the best performing algorithm based on the performances reported by a wide range of metrics.
  • Performing model explainability analysis to learn the influence of each feature on the model's output.

Tasks Performed

  • Retrieving sets of active phishing and legitimate webpages from online repositories
  • Creating a dataset by extracting feature values from the webpages
  • Data profiling
  • Data cleaning
  • Data exploration analysis
  • Feature correlation analysis
  • Automated feature selection
  • Model evaluation
  • ROC analysis
  • Hyperparameter tuning
  • Model explainability analysis

Dataset

A total of 35 features based on URL and webpage structure, webpage contents and third party information related to a webpage were identified for the prediction task. We then retrieved 12,691 phishing and 13,494 legitimate webpages that prompt sensitive information from PhishTank and Tranco online repositories. From each active webpage, we extracted feature values to form a dataset of 26,115 records. The records were stored in a MySQL database.

Key Software and Libraries Used

  • Python
  • Scikit-learn
  • Numpy
  • Pandas
  • MySQL
  • Geoip2 database
  • Google and Bing search engines
  • Google Translator
  • PhishTank
  • Beautiful soup
  • Matplotlib
  • Seaborn
  • Category encoders
  • SHAP

Prediction Result Summary

CatBoost was observed to outperform other algorithms across most metrics by achieving an accuracy of 98.67 %, FPR of 0.89 % and FNR of 1.81 %. After tuning its hyperparameters, the algorithm's performance was improved to an accuracy of 98.76 % and FNR of 1.58 %.

Below is the summary of the performance results and ROC analysis of the evaluated algorithms indicating that CatBoost is the best performer.

The model explainability analysis using SHAP shown below indicates the ranking of features' influence on the model output thus their ranking of importance. FQDNBlacklistCounts is determined to be the most important feature among the best features whereas obfuscationCharFQDN to be the least one.

prediction-of-phishing-webpages's People

Contributors

popseli avatar

Watchers

 avatar  avatar  avatar

prediction-of-phishing-webpages's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.