Coder Social home page Coder Social logo

homework1's Introduction

Practical Applications in Machine Learning - Homework 1

The goal of Homework 1 assignment is to build your first end-to-end Machine Learning (ML) pipeline using public datasets and by creating your own datasets. The learning outcomes for this assignment are:

  • Build framework for end-to-end ML pipeline in Streamlit. Create your first web application!
  • Develop web application that walks users through steps of ML pipeline starting with data visualization and preprocessing steps.

This assignment is contains two parts:

  1. End-to-End ML Pipeline: Many ML projects are NOT used in production or NOT easily used by others including ML engineers interested in exploring prior ML models, testing their models on new datasets, and helping users explore ML models. To address this challenge, the goal of this assignment is to implement a front- and back-end ML project, focusing on the dataset exploration and preprocessing steps. The hope is that the homework assignments help students showcase ML skills in building end-to-end ML pipelines and deploying ML web applications for users which we build on in future assignments.

  2. Dataset Curation (In-Class Activity): It is often challenging to collect datasets when none exists in your problem domain. Thus, it is important to understand how to curate new datasets and explore existing methodologies for data collection. Part II of HW1 focuses on how to collect datasets, annotate the data, and evaluate the annotations in preparation for ML tasks.

HW1 serves as an outline for the remaining assignments in the course, building end-to-end ML pipelines and deploying useful web applications using those models. This assignment in particular focuses on the data exploration and preprocess.

  • Due: Friday February 17, 2023 at 11:00PM
  • What to turn in: Submit responses on GitHub AutoGrader
  • Assignment Type: Individual
  • Time Estimate: 9 Hours
  • Submit code via GitHub: https://classroom.github.com/a/fiL30jIe
  • Submit Reflection Assessment via Canvas (multiple choice, 5 questions)

Figure: This shows a demonstration of the web application for End-to-End ML pipelines.

Installation

Install Streamlit

pip install streamlit     # Install streamlit
streamlit hello           # Test installation

Next, let's update the libraries. First, let's update conda itself:

conda update -c defaults -n base conda

And recreate the environment:

conda env create -f environment.yml

Start Jupyter

(Optional) Install Juypter notebook is not installed already.

python3 -m ipykernel install --user --name=python3

And that's it! You can now start Jupyter like this:

jupyter notebook

This should open up your browser, and you should see Jupyter's tree view, with the contents of the current directory.

Update This Project and its Libraries

I regularly update the notebooks to fix issues and add support for new libraries. So make sure you update this project regularly.

For this, open a terminal, and run:

cd $HOME # or whatever development directory you chose earlier
cd homework1 # go to this project's directory
git pull

If you get an error, it's probably because you modified a notebook. In this case, before running git pull you will first need to commit your changes. I recommend doing this in your own branch, or else you may get conflicts:

git checkout -b my_branch # you can use another branch name if you want
git add -u
git commit -m "describe your changes here"
git checkout master
git pull

Run Github AutoGrader

Run Github autograder using the following command in the termal:

pytest
  • end-to-end-ml-pipeline.ipynb: This is the example from the textbook on predicting housing prices. We will use this notebook to create an online ML end-to-end pipeline. We will focus on data collction and preprocessing steps.
  • end-to-end-ml-pipeline.py: HW1 assignment template using streamlit for web application UI and workflow of activties.
  • pages/*.py files: Contains code to explore data, preprocess it and prepare it for ML. It includes checkpoints for the homework assignment.
  • datasets: folder that conatins the dataset used for HW1 in 'housing/housing.csv'
  • notebooks: contains example notebooks for HW1
  • test_homework1.py: contains Github autograder functions
  • images/: contains images for readme

1. Build End-to-End ML Pipeline

The first part of HW1 focuses on ‘Building an End-to-End ML Pipeline’ which consists of creating modules that perform the following tasks: exploring and visualizing the data to gain insights and preprocess and prepare data for machine learning algorithms.

1.1 California Housing Dataset

Create useful visualizations for machine learning tasks. This assignment focuses on visualizing features from a dataset given some input .csv file (locally or in the cloud), the application is expected to read the input dataset. Use the pandas read_csv function to read in a local file. Use Streamlit layouts to provide multiple options for interacting with and understanding the dataset.

This assignment involves testing the end-to-end pipeline in a web application using a California Housing dataset from the textbook: Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc., 2022 [GitHub]. The dataset was capture from California census data in 1990 and contains the following features:

  • longitude - longitudinal coordinate
  • latitude - latitudinal coordinate
  • housing_median_age - median age of district
  • total_rooms - total number of rooms per district
  • total_bedrooms - total number of bedrooms per district
  • population - total population of district
  • households - total number of households per district'
  • median_income - median income
  • ocean_proximity - distance from the ocean
  • median_house_value - median house value

We will explore these features further in the remaining sections, including in Reflection questions.

1.2. Explore and visualize the data to gain insights.

Task 1: Import data from local machine (Checkpoint 1; 1 point)

The goal of this task is to create ML pipeline modules on two pages. The first step is to upload a dataset and make two columns on the ‘Explore Dataset’ and ‘Preprocess Data’ pages: Column 1) upload a dataset from your local machine Column 2) upload a dataset from a cloud source Both Columns 3) Restore a dataset from prior pages using Streamlit sessions_state dictionary, update this data structure with variables needed in other pages of the website.

The ‘Explore Dataset’ and ‘Preprocess Data’ pages require data to be uploaded or restored. Your task is to complete the load_dataset() function on the Explore Dataset page and the restore_dataset() function on the ‘Preprocess Data’ page. Use the following functions to upload data and restore variables from other pages.

st.fileuploader() - function that uploads a file from a users local machine st.session_state - dictionary that stores that state of variables pd.read_csv() - function that reads a csv file and store it in a pandas dataframe

data = st.file_uploader()
df = pd.read_csv(data)

Figure: This shows an example of adding a button to upload a dataset (Checkpoint 1).

Next, explore dataset features using helper functions to summarize features in the dataset and visualize them on a plot (see figures below). Provide the option to select one plot to display including Scatterplots, Lineplots, Histogram, and Boxplot using the Streamlit selectbox function. Once a plot has been selected, the visualization should update with the appropriate figure. Use the user_input_features function to collect filters for each figure and update the plot accordingly.

Figure: Summary of features in the Housing dataset.

Figure: Example visualization of latitude and longitude features.

Task 2: Show correlations between a primary feature and one or more secondary features. Correlation is a quantitative measure of how much two variables/features are correlated ranging from -1 (negatively correlated) to 1 (positively correlated). Your goal is to provide options to explore correlations between multiple features using the pandas correlation function. (Checkpoint 2; 1 point)

correlation = df.corr[feature] #feature is name of feature (string type)

Figure: This show correlation of multiple pairs of features.

1.3 Preprocess and prepare the data for machine learning algorithms.

This step in the ML pipeline cleans that dataset by removing bad/unuseful data points, imputing that dataset, performing correlation analysis, and formatting the data for ML tasks.

Task 3: Remove irrelevant/useless features. Collect a user's preferences on one or more features to remove using the Streamlit multiselect function. Use the pandas library to remove irrelevant/useless values using the pandas drop function. (Checkpoint 3; 1 point)

Figure: This shows an example of removing multiple features.

Task 4 - Impute data. Collect a user’s preference on a feature to impute using the Streamlit selectbox function. Perform data imputation by skipping missing values, replacing missing values with zero, the mean or median of a feature across the dataset. Identify the ‘bad’ values and provide the aforementioned options for data imputation. Then, report summary statistics to help users understand how much imputation is required including the

  • number of features with missing values,
  • average number of data points missing per category, and the
  • total number of missing values in the dataset. Use the following functions: dropna(), mean(), and median(). (Checkpoint 4; 1 point)

Figure: This show an example of data imputing of multiple features.

Task 5: Summarize descriptive statistics. Collect a user's preferences on one or more features to summarize using the Streamlit multiselect function. Then, use the same function to collect statistics to report. Summarize descriptive statistics for a selection of features and multiple statistics including minimum, maximum, mean, and median using pandas library. Your task is to populate the out_dict dictionary with an output string showing the statistics for each variable (Checkpoint 5; 1 point).

Figure: Summary of descriptive statistics.

Task 6 - Set train and test split of dataset. Collect the percentage of data for the test dataset using the Streamlit number_input function, split the dataset using the train_test_split function, and compute the percentage of data for the training set and split the data accordingly. (Checkpoint 6)

from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.3)

Figure: This show an example of spliting the dataset into train and test datasets.

1.4 Helper Functions/Code

  • The fetch_housing_data() function fetches data from a cloud/online source and stores it on a local computer in a ‘./datasets/housing/’ directory.

  • The display_features() function looks up the features in the dataframe in a feature_lookup table to summarize each feature.

  • The user_input_features() function enables users to visualize features. The assignment code includes options for the x- and y-axis to display features (see Figure above). Create a sidebar that links to a figure on the main ‘Explore Dataset‘ page to help users visualize features. The sidebar should contain a menu to filter parameter settings and update the figure as appropriately. Students are encouraged to review documentation on (Streamlit layouts](https://docs.streamlit.io/library/api-reference/layout).

1.5 Testing Code with Github AutoGrader

pytest

Test end-to-end pipeline application using:

streamlit run end-to-end-ml-pipeline.py

2. Reflection Assessment

Submit on Canvas.

Further Issues and questions ❓

If you have issues or questions, don't hesitate to contact the teaching team:

homework1's People

Contributors

amt298 avatar amy-boncelet avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.