Coder Social home page Coder Social logo

udacity_spark's Introduction

Sparkify

This purpose to build this prediction model is to predict the loss of current clients.

Initialization

The pacakges required for running these script are pyspark, seaborn, re

File Description

Sparkify.ipynb: the jupyter file that relates to the data exploration, data modeling process. The final output of this scrpit is a trained model.

Modeling

A model function is created, which help to split dataset, build model pipeline, as well as tuning the model. The model used here is Random Forest Classifier.

Data Preprocessing

The key variables (gender and level as well as the label Churn User) have been converted from string to vectors for the modeling process.

Implementation

A pyspark machine learning pipeline which includes all necessary steps, such as one hot encoding, model training and model tuning are generated.

Refinement

The initial modeling result by using Random Forest Classifier reaches a F1 socre around 73%. By adding parameters grid to ajust hyperparameters such as numTrees, maxDepth, the F1 score now reaches 84.55%, much higher than the initial model result.

Results

Model Evaluation and Validation

In the very begining of this study, I am meant to use Logistic Regression Model instead of this Random Forest. However, per quick test run, the model metrics score is not as high as expected. Therefore, I switch to use Random Forest, which successfully increases the result as expected. In terms of the model hyperparameters, as per my past experience with scikit-learn package, the num of trees and tree depath are two areas worthwhile to try.

Justification

By using level, gender to train the model, the successfully rate to predict the loss of client is around 85%. We have tried following methods to achieve this result.

  • testing on different models;
  • tuning the model hyperparameters;
  • using cross validation

Conclusion

Reflection

As per above modeling process, we now have generated a well trained model that can be used to predict if the user will lose based on gender and level details. Throughout this study, the build of model pipeline is silghtly difficult especially when we need to deal with multiple categorical variables.

Improvement

For furture cases, we could try to include the number of variables considered, trying to achieve a better F1 Score. Meanwhile, the current model is run on a local machine to predict static data. A better way to do this is to deploy this model to a server so that once the user is labelled with high potential to lose, we could quick find out and take actions.

udacity_spark's People

Contributors

brunoyang0918 avatar

Watchers

 avatar

Forkers

aakashbehl1994

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.