Coder Social home page Coder Social logo

obinnaonyema / creditcardchurn_udacityazmlcapstone Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 4.18 MB

My capstone project on the Udacity Machine Learning Engineer with Azure Nanodegree. We attempt to predict likelihood of churn for credit card customers of a bank

Jupyter Notebook 96.39% Python 3.61%

creditcardchurn_udacityazmlcapstone's Introduction

Credit Card Churn

This is my chosen project for the Capstone Project in the Azure Machine Learning Engineer Nanodegree by Udacity. In this project, I attempt to build a model that will help a bank manager predict customers that are likely to churn. The manager's intention is to proactively engage these customers with a view to preventing churn.

2 methods are used: one with Automated ML and one with the hyperdrive service. The better model is selected and deployed and sample requests are sent to the deployed endpoint.

Table of Contents

Project Set Up and Installation

To run this experiment, try the following:

  1. Download this repository into your Azure ML Workspace. The raw data file is also provided in this repository.
  2. Set up your compute to run the jupyter notebook. You can use a DS2_V2 instance.
  3. Next you run each cell in the notebooks automl.ipynb, hyperparameter_tuning.ipynb. I ran each notebook one at a time to take note of the differences
  4. The notebooks contain the configuration to set up compute clusters for training the data.

This is the flow of the project as expected by the instructure:

Project Flow

Dataset

Overview

The data set was obtained from Kaggle here. It contains details of credit card customers of a bank. There are 22 columns and 10000 rows. The last 2 columns were advised to be discarded by the data set provider which I have done in this project.

The data contains details such as education status, number of dependents, inactivity period, transaction counts, credit limits and so on. Our label is Attrition_Flag which says whether a customer is an existing customer or attrited customer.

Task

The aim of the project is to predict the likelihood of churn, which is denoted by the records in the Attrition_Flag column of the raw data set. It is interesting to note that this dataset had some imbalance in the sense that only 1627 out of 10127 customers have churned.

Access

I use TabularDatasetFactory class to download the data from my github repo.

cc_cust = TabularDatasetFactory.from_delimited_files(path=cc_data, separator=',')

Automated ML

I ran the experiment a few times using timeout settings of 20mins and 40mins but this was too short for any child runs to start. I set experiment_timeout_minutes to 60 so as to allow significant time for training of a few child models. Primary metric used is accuracy and exit score of 0.95 to prevent continued runs when no significant improvement is happening to the score. I tried to set a metric exit score as after a few attempts at running the experiment I noticed the score doesn't improve much after the first 5 child runs.

Results

Accuracy score was 0.97 for the best model. I ran the experiment more than one noticing that each time the accuracy score was different. Perhaps the difference in compute resources may have influenced the outcome.

The image below shows run details:

Run details AutoML

This shows the best model:

Best model details

Best model chart

Here are the parameters of the best model

LightGBMClassifier
{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': 1,
 'num_leaves': 31,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'verbose': -10}

In the future, I would like to see if letting the experiment run for multiple hours will yield much better results. Also, testing to see how increasing the number of cross-validations or using a larger data set may reduce bias.

Hyperparameter Tuning

Logistic Regression algorithm from the ScikitLearn library was used for the HyperDrive Experiment. These parameters were set up:

  • C determines the strength of the regularization: higher values of C mean less regularization. I used the values (1, 2, 3, 4)
  • max_iter which is the number of iterations over the entire dataset. I used the values (40, 80, 120, 130, 200)
  • Random Sampling allowed us optimize resource usage.

Results

The HyperDrive experiment yielded an accuracy score of 0.8986 with regularization parameter of 4 and max iteration of 200.

In the future, I would like to test improving the experiment by using a different sampling technique.

The image below shows run details with the run in progress:

Run details hyperdrive

This shows best run:

Best run hyperdrive

Model Deployment

The best model was registered and deployed. I chose to deploy the AutoML model. Inference config was setup to handle the deployment environment and scoring. A container instance with 1 CPU core and 1gb RAM sufficient for development operation was set up.

Registered model

A sample JSON payload was created from 5 rows of the original data set. With this I called the model endpoint for scoring.

I chose 5 rows of data to make up the parameters of the JSON request payload. Then I compared the response to the values in Attrition_Flag from the original dataset. The code detailing this is in the AutoML notebook.

Here are the endpoint details showing it in healthy state:

Endpoint healthy state

Screen Recording

You may view my screen cast here. I have run through the tasks briefly demonstrating with the files on my github repo.

Future Work

I would like to work with a larger data set, and experiment with other sampling methods as well enable app insights if the model were to be used in a production environment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.