Coder Social home page Coder Social logo

azure-databricks-malware-prediction's Introduction

azure-databricks-malware-prediction

End-to-end Machine Learning Pipeline demo using Delta Lake, MLflow and AzureML in Azure Databricks

The Problem

The problem I set out to solve is this public Kaggle competition hosted my Microsoft earlier this year. Essentially, Microsoft has provided datasets containing Windows telemetry data for a variety of machines; in order words - a dump of various windows features (Os Build, Patch version etc.) for machines like our laptops. The idea is to use the test.csv and train.csv dataset to develop a Machine Learning model that would predict a Windows machine's probability of getting infected with various families of malware.

Architecture

TBD.

01-kaggle-dataset-download: Connecting to Kaggle via API and copying competition files to Azure Blob Storage

The Kaggle API allows us to connect to various competitions and datasets hosted on the platform: API documentation.

Pre-requisite: You should have downloaded the kaggle.json containing the API username and key and localized the notebook below.

In this notebook, we will -

  1. Mount a container called bronze in Azure Blob Storage
  2. Import the competition data set in .zip format from Kaggle to the mounted container
  3. Unzip the downloaded data set and remove the zip file

02-extract-transform-load: EXTRACT, TRANSFORM, LOAD from BRONZE to SILVER Zone

Here is the Data Lake Architecture we are emulating:

TBD.

Pre-requisite: You should have run 01-kaggle-dataset-download to download the Kaggle dataset to BRONZE Zone.

In this notebook, we will -

  1. EXTRACT the downloaded Kaggle train.csv dataset from BRONZE Zone into a dataframe
  2. Perform various TRANSFORMATIONS on the dataframe to enhance/clean the data
  3. LOAD the data into SILVER Zone in Delta Lake format
  4. Repeat the above three steps for test.csv
  5. Take the Curated test.csv data and enhance it further for ML scoring later on.

03-data-visualization: Data Visualization

I'm leveraging a lot of the visualization/data exploration done by the brilliant folks over at Kaggle that have already spent a lot of time exploring this Dataset.

Pre-requisite: You should have run 02-extract-transform-load and have the curated data ready to go in SILVER Zone.

In this notebook, we will -

  1. Import the malware_train_delta training dataset from SILVER Zone into a dataframe
  2. Explore live visualization capabilities built into Databricks GUI
  3. Explore the top 10 features most correlated with the HasDetections column - the column of interest
  4. Generate a correlation heatmap for the top 10 features
  5. Explore top 3 features via various plots to visualize the data

04-train-model-pipelines: Use MLflow to create Machine Learning Pipelines and Track Experiments

Tracking Experiments with MLflow

MLflow Tracking is one of the three main components of MLflow. It is a logging API specific for machine learning and agnostic to libraries and environments that do the training. It is organized around the concept of runs, which are executions of data science code. Runs are aggregated into experiments where many runs can be a part of a given experiment and an MLflow server can host many experiments.

MLflow tracking also serves as a model registry so tracked models can easily be stored and, as necessary, deployed into production.

In this notebook, we will -

  1. Load our train dataset from SILVER Zone
  2. Use MLflow to create a Logistic Regression ML Pipeline
  3. Explore the run details using MLflow integration in Databricks

05-model-serving: Model Serving - Batch Scoring and REST API

Operationalizing machine learning models in a consistent and reliable manner is one of the most relevant technical challenges in the industry today.

Docker, a tool designed to make it easier to package, deploy and run applications via containers is almost always involved in the Operationalization/Model serving process. Containers essentially abstract away the underlying Operating System and machine specific dependencies, allowing a developer to package an application with all of it's dependency libraries, and ship it out as one self-contained package.

By using Dockerized Containers, and a Container hosting tool - such as Kubernetes or Azure Container Instances, our focus shifts to connecting the operationalized ML Pipeline (such as MLflow we discovered earlier) with a robust Model Serving tool to manage and (re)deploy our Models as it matures.

Azure Machine Learning Services

We'll be using Azure Machine Learning Service to track experiments and deploy our model as a REST API via Azure Container Instances.


Pre-requisite: You should have run 02-extract-transform-load to have the validation data ready to go in SILVER Zone, and 04-train-model-pipelines to have the model file stored on the Databricks cluster.

In this notebook, we will -

  1. Batch score test_validation SILVER Zone data via our MLflow trained Linear Regression model
  2. Use MLflow and Azure ML services to build and deploy an Azure Container Image for live REST API scoring via HTTP

azure-databricks-malware-prediction's People

Contributors

mdrakiburrahman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.