Coder Social home page Coder Social logo

mspcvsp / starbuckscapstonechallenge Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 8.0 2.87 MB

Udacity Data Scientist Nanodegree Project - Analyze Starbucks Capstone Challenge Dataset

License: GNU General Public License v3.0

Jupyter Notebook 96.86% Python 3.14%

starbuckscapstonechallenge's Introduction

StarbucksCapstoneChallenge

Udacity Data Scientist Nanodegree Capstone Project - Analyze Starbucks Capstone Challenge Dataset

Project Overview

Customer satisfaction drives business success and data analytics provides insight into what customers think. For example, the phrase "360-degree customer view" refers to aggregating data describing a customer's purchases and customer service interactions.

The Starbucks Udacity Data Scientist Nanodegree Capstone challenge data set is a simulation of customer behavior on the Starbucks rewards mobile application. Periodically, Starbucks sends offers to users that may be an advertisement, discount, or buy one get on free (BOGO). An important characteristic regarding this dataset is that not all users receive the same offer.

This data set contains three files. The first file describes the characteristics of each offer, including its duration and the amount a customer needs to spend to complete it (difficulty). The second file contains customer demographic data including their age, gender, income, and when they created an account on the Starbucks rewards mobile application. The third file describes customer purchases and when they received, viewed, and completed an offer. An offer is only successful when a customer both views an offer and meets or exceeds its difficulty within the offer's duration.

Problem Statement / Metrics

The problem that I chose to solve is to build a model that predicts whether a customer will respond to an offer. My strategy for solving this problem has four steps. First, I will combine the offer portfolio, customer profile, and transaction data. Each row of this combined dataset will describe an offer's attributes, customer demographic data, and whether the offer was successful. Second, I will assess the accuracy and F1-score of a naive model that assumes all offers were successful. This provides me a baseline for evaluating the performance of models that I construct. Accuracy measures how well a model correctly predicts whether an offer is successful. However, if the percentage of successful or unsuccessful offers is very low, accuracy is not a good measure of model performance. For this situation, evaluating a model's precision and recall provides better insight to its performance. I chose the F1-score metric because it is "a weighted average of the precision and recall metrics". Third, I will compare the performance of logistic regression, random forest, and gradient boosting models. Fourth, I will refine the parameters of the model that has the highest accuracy and F1-score.

Results Summary

  • Model ranking based on training data accuracy
    1. RandomForestClassifier model accuracy: 0.742
    2. GradientBoostingClassifier model accuracy: 0.736
    3. LogisticRegression model accuracy: 0.722
    4. Naive predictor accuracy: 0.471
  • Model ranking based on training data F1-score
    1. RandomForestClassifier model f1-score: 0.735
    2. GradientBoostingClassifier model f1-score: 0.725
    3. LogisticRegression model f1-score: 0.716
    4. Naive predictor f1-score: 0.640
  • Results suggest that the random forest model has the best training data accuracy and F1-score

Bias and variance are two characteristics of a machine learning model. Bias refers to inherent model assumptions regarding the decision boundary between different classes. On the other hand, variance refers a model's sensitivity to changes in its inputs. A logistic regression model constructs a linear decision boundary to separate successful and unsuccessful offers. However, my exploratory analysis of customer demographics for each offer suggests that this decision boundary will be non-linear. Therefore, an ensemble method like random forest or gradient boosting should perform better.

Both random forest and gradient boosting models are a combination of multiple decision trees. A random forest classifier randomly samples the training data with replacement to construct a set of decision trees that are combined using majority voting. In contrast, gradient boosting iteratively constructs a set of decision trees with the goal of reducing the number of misclassified training data samples from the previous iteration. A consequence of these model construction strategies is that the depth of decision trees generated during random forest model training are typically greater than gradient boosting weak learner depth to minimize model variance. Typically, gradient boosting performs better than a random forest classifier. However, gradient boosting may overfit the training data and requires additional effort to tune. A random forest classifier is less prone to overfitting because it constructs decision trees from random training data samples. Also, a random forest classifier's hyperparameters are easier to optimize (1).

The problem that I chose to solve was to build a model that predicts whether a customer will respond to an offer. My strategy for solving this problem has four steps. First, I combined offer portfolio, customer profile, and transaction data. Second, I assessed the accuracy and F1-score of a naive model that assumes all offers were successful. Third, I compared the performance of logistic regression, random forest, and gradient boosting models. This analysis suggests that a random forest model has the best training data accuracy and F1-score. Fourth, I refined random forest model hyperparameters using a grid search. My analysis suggests that the resulting random forest model has an training data accuracy of 0.753 and an F1-score of 0.746. The test data set accuracy of 0.736 and F1-score of 0.727 suggests that the random forest model I constructed did not overfit the training data.

"Feature importance" refers to a numerical value that describes a feature's contribution to building a model that maximizes its evaluation metric. A random forest classifier is an example of a model that estimates feature importance during training. My analysis of the Starbucks Capstone Challenge customer offer effectiveness training data suggests that the top five features based on their importance are:

1. Offer difficulty (how much money a customer must spend to complete an offer)  
2. Offer duration   
3. Offer reward  
4. Customer income  
5. Whether a customer created an account on the Starbucks rewards mobile application in 2018  

Since the top three features are associated with an customer offer, it may be possible to improve the performance of a random forest model by creating features that describe an offer's success rate as a function of offer difficulty, duration, and reward. These additional features should provide a random forest classifier the opportunity to construct a better decision boundary that separates successful and unsuccessful customer offers.

(1) How can the performance of a Gradient Boosting Machine be worse than Random -Forests.

Files

  • Starbucks_Capstone_notebook.ipynb
    • Jupyter notebook that performs three tasks:
      • Combines offer portfolio, customer demographic, and customer transaction data
      • Generates training customer demographic data visualizations and computes summary statistics
      • Generates logistic regression, random forest, & gradient boosting models
  • clean_data.py
    • Python software that combines offer portfolio, customer demographic, and customer transaction data
  • exploratory_data_analysis.py
    • Generates training customer demographic data visualizations and computes summary statistics
  • LICENSE
    • Repository license file
      .gitignore
    • Describes files and/or directories that should not be checked into revision control
  • README.md
    • Markdown file that summarizes this repository

Python Libraries Used

-Python Data Analysis Library
-Numpy
-Matplotlib
-seaborn: Statistical Data Visualization
-re: Regular expression operations
-os โ€” Miscellaneous operating system interfaces
-scikit-learn: Machine Learning in Python
-Joblib: running Python functions as pipeline jobs

References

starbuckscapstonechallenge's People

Contributors

mspcvsp avatar

Stargazers

Mina Abadir avatar  avatar Ying avatar Rawan avatar  avatar

Watchers

James Cloos avatar  avatar Kingsley Obeng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.