Aim: Based on a customer's profile, predict which type of touchpoint has the highest probability of resulting in a purchase

Project overview:

Created a tool to predict touchpoint for a customer based on their profiles
Optimized Random Forest and XGBoost classifiers using GridSearchCV to get the best model
Model made deployment ready with Pickle

Workflow:

data-cleaning_and_eda.ipynb -> model-building.ipynb

Data Cleaning

1. Check for missing values in every column and drop duplicates

2. Removed rows with no touchpoints value / nTouchpoints = 0

EDA

1. Explore the relationship of the segment variable with other variables in the dataset

2. Discover any presence of multicollinearity and its degree with a heatmap

3. Visualizing distribution of variables with distribution plots and barplots

4. One hot encode categorical variables

For categorical variables, I made columns for each such that they are transformed into binary variables.

Model Building

Metrics for evaluating models:

Multiclass logloss since we are predicting the probabilities of the next touchpoint, I want to find the average difference between all probability distributions.
F1-Score(Micro) since we have imbalanced classes of labels.

1a. Standardize/normalize numerical data

1b. Stratified train test split

I wrote a custom script to split my dataset into train, validation and test sets using the stratify strategy. Train size 80%, Validation set and Test set 10% each.

2. Try baseline ensemble model: Random Forest

Random Forest

I picked RF Classifer simply because it runs fast and I am able to use GridSearchCV to iterate to the best model possible efficiently. After initializing and tuning my RandomForestClassifier model with GridSearchCV, I got a train accuracy of 1.0 and test accuracy of 0.77688 which shows overfitting.

Our RF Classifier seems to pay more attention to average spending, income and age.

3. Explore ensemble model: XGBoost

XGBoost

Initial XGB model

XGB model after tuning with GridSearchCV : max_depth, min_child_weight and reg_alpha

Our XGBoost model pays high attention to the 'unknown' marital status. This could be due to the fact that there are only 44 customers with 'unknown' marital status, hence to reduce bias, our xgb model assigns more weight to 'unknown' feature.

XGBoost Accuracy: 0.9678972712680578

XGBoost F1-Score (Micro): 0.9678972712680578

I will pick the final XGBoost model since it gives significantly higher F1-score and accuracy. We can also easily control overfitting by further tuning the reg_alpha value in our model.

Model Deployment

I included a pickle file for further deployment of the model into FlaskAPI in the future! For productionization, a flask API endpoint can be hosted on a server and it will take in a list of values from a customer's profile and return the recommended touchpoint.

engom / data-analysis-model-xgboost Goto Github PK

data-analysis-model-xgboost's Introduction

Aim: Based on a customer's profile, predict which type of touchpoint has the highest probability of resulting in a purchase

Data Cleaning

1. Check for missing values in every column and drop duplicates

2. Removed rows with no touchpoints value / nTouchpoints = 0

EDA

1. Explore the relationship of the segment variable with other variables in the dataset

2. Discover any presence of multicollinearity and its degree with a heatmap

3. Visualizing distribution of variables with distribution plots and barplots

4. One hot encode categorical variables

Model Building

1a. Standardize/normalize numerical data

1b. Stratified train test split

2. Try baseline ensemble model: Random Forest

3. Explore ensemble model: XGBoost

Model Deployment

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent