developing Skills of:
- Loading large datasets into Spark and manipulating them using Spark SQL and Spark Dataframes
- Using the machine learning APIs within Spark ML to build and tune models
- Integrating the skills I've learned in the Spark course and the Data Scientist Nanodegree program
the goad is to predict churned users based on activites and attributes data of them. And deploy the solution on a distributed system. The size of original datasets is 12GB. Due to the limited computation power of free version of IBM Cloud, a medium-sized sub-datasets is utilized.
- Pyspark SQL and Pyspark ML and other libraries it is building on.
- Matplotlib for visualization.
- IBM Cloud(free) or other Clould sevices.
Procedures of analysis:
- data cleaning
- data exploration
- feature engineering
- modelling
- deploy on IBM cloud
Results:
Two models, logistice regression and random forest, with different hyperparameters are tested. Random forest was found to
yield better performance (AUC score of 0.6) thus was selected as final proposal.
Summary and some flections on this project: Medium post
The dataset is kindly provided by Udacity team. And some instructions in the notebook are also well prepared by Udacity team.