Module 3 Final Project

Congratulations! You've made it through another intense module, and now you're ready to show off your newfound Machine Learning skills!

All that remains for Module 3 is to complete the final project!

The Project

The main goal of this project is to create a classification model. For this project you have the choice to either:

choose a data set from a curated list
choose a pre-approved dataset brought to the group by one of the members

After choosing your dataset, you'll then use everything you've learned about Data Science and Machine Learning thus far to source a dataset, preprocess and explore it, and then build and interpret a classification model that answers your chosen question.

Datasets

You are allowed to select one of the data sets described below. Each comes with its own advantages and disadvantages, and, of course, its own associated business problem and stakeholders. It may be desirable to flesh out your understanding of the audience or the business proposition a little more than sketched out here. If you select one of these three data sets, you need no further approval from your instructor.

Chicago Car Crash Data. Note this links also to Vehicle Data and to Driver/Passenger Data.

Build a classifier to predict the primary contributory cause of a car accident, given information about the car, the people in the car, the road conditions etc. You might imagine your audience as a Vehicle Safety Board who's interested in reducing traffic accidents, or as the City of Chicago who's interested in becoming aware of any interesting patterns. Note that this is a multi-class classification problem. You will almost certainly want to bin or trim or otherwise limit the number of target categories on which you ultimately predict. Note e.g. that some primary contributory causes have very few samples.
Customer Churn Data

Build a classifier to predict whether a customer will ("soon") stop doing business with SyriaTel, a telecommunications company. Note that this is a binary classification problem.

Most naturally, your audience here would be the telecom business itself, interested in losing money on customers who don't stick around very long. Are there any predictable patterns here?
Tanzanian Water Well Data

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

Build a classifier to predict the condition of a water well, using information about the sort of pump, when it was installed, etc. Note that this is a ternary classification problem.

DELIVERABLES

A public GitHub repository.
An environment.yml file that contains all the necessary packages needed to recreate your conda environment.
A standalone src/ directory that stores all relevant source code.
- All functions have docstrings that act as professional-quality documentation.
- If applicable, well documented SQL queries with appropriate single-line or multiline comments.
- Quality classification model
  - Whenever necessary, briefly explain in comments the changes made from one iteration to the next, and why you made these choices
A standalone data/ directory that stores all relevant raw and processed data files.
- Be sure to include how the data was obtained!
- All large files are labeled in the .gitignore file to avoid having them accidentally live in your commit history.
A standalone references/ directory that stores all relevant literature, data dictionaries, or useful references that were used to help you during the project.
- Use this directory to store physical copies of the .pdf files; or
- Create a README.md file that cites external resources that were used.
A standalone reports/ directory that stores your presentation.pdf files
A standalone notebooks/ directory that stores both your exploratory and report notebooks
- A record of your workflow should be stored in notebooks/exploratory. Don't be afraid to leave in error messages, so you know what didn't work!
A user-focused README.md file that briefly covers your process, methodology and findings.
- Someone with no context on your project should be able to use this document to understand the structure of your project, and adapt your code for their needs.
- Implied above is a schema which diagramming the route to important files.
One final Jupyter Notebook file stored in notebooks/report that focuses on visualization and presentation
- The very beginning of the notebook contains a description of the purpose of the notebook.
  - This is helpful for your future self and anyone of your colleagues that needs to view your notebook. Without this context, you’re implicitly asking your peers to invest a lot of energy to help solve your problem. Help them to jump into your project by providing them the purpose of this Jupyter Notebook.
- Explanation of the data sources and where one can retrieve them
  - Whenever possible, link to the corresponding data dictionary
- Custom functions and classes are imported from Python modules and are not created directly in the notebook. As soon as you have a working function in one of your exploratory notebooks, copy it over to src so it is reusable.
- At least 4 meaningful data visualizations, with corresponding interpretations. All visualizations are well labeled with axes labels, a title, and a legend (when appropriate)
- Take the time to make sure that you craft your story well, and clearly explain your process and findings in a way that clearly shows both your technical expertise and your ability to communicate your results!
An "Executive Summary" Keynote/PowerPoint/Google Slide presentation (delivered as a PDF export) that explains what you have found.
- Make sure to also add and commit this file as presentation.pdf of your non-technical presentation to your repository with a file name of reports/presentation.pdf.
- Contain between 5-10 professional quality slides detailing:
  - A high-level overview of your methodology
  - The results you’ve uncovered
  - Any real-world recommendations you would like to make based on your findings (ask yourself--why should the executive team care about what you found? How can your findings help the company/stakeholder?)
  - Avoid technical jargon and explain results in a clear, actionable way for non-technical audiences.
- The slides should use visualizations whenever possible, and avoid walls of text
- All visualizations included in this presentation should also be exported as image files (e.g. with plt.savefig, not by taking a screenshot) and saved under reports/figures/

The Process

These steps are informed by Smart Vision's¹ description of the CRISP-DM process.

1. Business Understanding

No matter what dataset you pick:

Start by reading this document, and write down some of the questions one could answer using the data.
Think through a specific context in which someone would be asking a question, or be faced with a decision, for which the dataset could provide guidance. Write down the specific question or decision you decide to investigate.
Figure out, given the specific context and question / decision you decided to investigate, whether all predictions should be treated the same. Are there some types of predictions that would be more costly to get wrong than others? Are there some types that would provide a larger benefit than others?
Generate a list of questions a person asking the question / facing the decision from above would ask about your dataset. (For example: a person faced with a decision to buy the best affordable house she can, and presented with a dataset of housing prices, might ask "what's the effect on prices of having a porch? Of having a swing on that porch?")

Three things to be sure you establish during this phase are:

Objectives: what questions are you trying to answer, and for whom?
Project plan: you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless you should determine the division of labor, communication expectations, and timeline.
Success criteria: what does a successful project look like? How will you know when you have achieved it?

2. Data Understanding

Write a script to download the data (or instructions for future users on how to manually download it), and explore it. Do you understand what the columns mean? Which column is the target? What kind of data cleaning is required?

It may be useful to generate visualizations of the data during this phase.

3. Data Preparation

Through Pandas (and/or SQL, if you like), perform any necessary data cleaning and develop a filtering (or querying) process that brings all relevant data for analysis in one dataframe. Be sure to document any data that you choose to drop or otherwise exclude. This is also the phase to consider any feature scaling or one-hot encoding required to feed the data into a classification model.

4. EDA

Using the list of questions you generated during the Business Understanding step above as a guide, explore the relationships in your data without using an algorithm. How do the features change in response to each other? How does a feature change in response to the target? Strategize how you can answer as many questions as you can without using modeling.

5. Modeling

The focus this time is on prediction. Good prediction is a matter of the model generalizing well. Steps we can take to assure good generalization include: testing the model on unseen data, cross-validation, and regularization. What sort of model should you build? A diverse portfolio is probably best. Classification models we've looked at so far include logistic regression, knn, decision trees, bagging, and boosting (on Monday), each of these with different flavors. You are encouraged to try any or all of these.

6. Evaluation

Recall that there are many different metrics we might use for evaluating a classification model. Accuracy is intuitive, but can be misleading, especially if you have class imbalances in your target. Perhaps, depending on you're defining things, it is more important to minimize false positives, or false negatives. It might therefore be more appropriate to focus on precision or recall. You might also calculate the AUC-ROC to measure your model's discrimination.

Use your understanding of the business context to drive your decisions.

7. Deployment

In this case, your "deployment" comes in the form of the deliverables listed above. Make sure you can answer the following questions about your process:

"How did you pick the question(s) that you did?"
"Why are these questions important from a business perspective?"
"How did you decide on the data cleaning options you performed?"
"Why did you choose a given method or library?"
"Why did you select those visualizations and what did you learn from each of them?"
"Why did you pick those features as predictors?"
"How would you interpret the results?"
"How confident are you in the predictive quality of the results?"
"What are some of the things that could cause the results to be wrong?"

Citation

"What is the CRISP-DM Methodology?" Smart Vision Europe. Available at: https://www.sv-europe.com/crisp-dm-methodology/

learn-co-students / dsc-mod-3-project-chi-sea-ds-051120 Goto Github PK

dsc-mod-3-project-chi-sea-ds-051120's Introduction

Module 3 Final Project

The Project

Datasets

DELIVERABLES

The Process

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. EDA

5. Modeling

6. Evaluation

7. Deployment

Citation

dsc-mod-3-project-chi-sea-ds-051120's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org