Coder Social home page Coder Social logo

engom / data-analysis-model-xgboost Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ernestng11/touchpoint-prediction

0.0 0.0 0.0 14.68 MB

Completed Project - Predicting customer touchpoint using XGBoost tuned with GridSearchCV

Jupyter Notebook 100.00%

data-analysis-model-xgboost's Introduction

Aim: Based on a customer's profile, predict which type of touchpoint has the highest probability of resulting in a purchase

Project overview:

  1. Created a tool to predict touchpoint for a customer based on their profiles
  2. Optimized Random Forest and XGBoost classifiers using GridSearchCV to get the best model
  3. Model made deployment ready with Pickle

Workflow:

data-cleaning_and_eda.ipynb -> model-building.ipynb

Data Cleaning

1. Check for missing values in every column and drop duplicates

2. Removed rows with no touchpoints value / nTouchpoints = 0

EDA

1. Explore the relationship of the segment variable with other variables in the dataset

Income line plot Average Spending line plot

2. Discover any presence of multicollinearity and its degree with a heatmap

Collinearity heatmap

3. Visualizing distribution of variables with distribution plots and barplots

Marital plot Segment plot Social media plot Credit rating plot nTouchpoints plot Age plot Age distribution plot Income distribution plot Average spending dist plot

4. One hot encode categorical variables

For categorical variables, I made columns for each such that they are transformed into binary variables.

Model Building

Metrics for evaluating models:

  1. Multiclass logloss since we are predicting the probabilities of the next touchpoint, I want to find the average difference between all probability distributions.
  2. F1-Score(Micro) since we have imbalanced classes of labels.

1a. Standardize/normalize numerical data

Age distribution plot Income distribution plot Average spending dist plot

1b. Stratified train test split

I wrote a custom script to split my dataset into train, validation and test sets using the stratify strategy. Train size 80%, Validation set and Test set 10% each.

2. Try baseline ensemble model: Random Forest

Random Forest

I picked RF Classifer simply because it runs fast and I am able to use GridSearchCV to iterate to the best model possible efficiently. After initializing and tuning my RandomForestClassifier model with GridSearchCV, I got a train accuracy of 1.0 and test accuracy of 0.77688 which shows overfitting.

FI

Our RF Classifier seems to pay more attention to average spending, income and age.

3. Explore ensemble model: XGBoost

XGBoost

Initial XGB model

mean logloss plot mean error plot

XGB model after tuning with GridSearchCV : max_depth, min_child_weight and reg_alpha

mean logloss plotmean error plot

FI

Our XGBoost model pays high attention to the 'unknown' marital status. This could be due to the fact that there are only 44 customers with 'unknown' marital status, hence to reduce bias, our xgb model assigns more weight to 'unknown' feature.

XGBoost Accuracy: 0.9678972712680578

XGBoost F1-Score (Micro): 0.9678972712680578

I will pick the final XGBoost model since it gives significantly higher F1-score and accuracy. We can also easily control overfitting by further tuning the reg_alpha value in our model.

Model Deployment

I included a pickle file for further deployment of the model into FlaskAPI in the future! For productionization, a flask API endpoint can be hosted on a server and it will take in a list of values from a customer's profile and return the recommended touchpoint.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.