The 525-group23's discuss from ubc-mds

4. Obtain best hyperparameter settings using spark's MLlib

rubric={correctness:20}

Upload this notebook to your jupyterHub (AWS managed jupyterHub in cluster) you setup in Task 2 and follow instruction given there. https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task4.ipynb

1. Team-work contract

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

3. Downloading the data

1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

5. Load the combined CSV to memory and perform a simple EDA in Python

Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
- Changing dtype of your data
- Load just columns what we want
- Loading in chunks
- Dask
Discuss your observations.

Feedback - MileStone 2

Congrats!

You have successfully done all the sections; well done!

3. Develop a ML model using scikit-learn

rubric={correctness:25}

Upload this notebook to your jupyterHub (TLJH in your EC2) from your previous milestone and follow instruction given there. https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task3.ipynb

Group meetings

Hi team 👋

As we all agreed, our next meeting will be tonight, 2 April at 8:00 pm PST. Here's the zoom link: https://ubc.zoom.us/j/69525248249?pwd=MFFhVU1qTlB0dktldHlYcFFOWElPdz09

2. Creating repository and project structure

1. Similar to previous project courses, create a public repository under UBC-MDS org for your project.
2. Write brief introduction of the project in the README.
3. Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Reflection

Discuss any challenges or difficulties you faced when dealing with this large data on your laptops. Briefly explain your approach to overcome the the challenges or reasons why you were not able to overcome them.

6. Wrangle the data in preparation for machine learning

Description:

rubric={correctness:20}

Our data currently covers all of NSW, but say that our client wants us to create a machine learning model to predict rainfall over Sydney only. There's a bit of wrangling that needs to be done for that:

We need to query our data for only the rows that contain information covering Sydney
We need to wrangle our data into a format suitable for training a machine learning model. That will require pivoting, resampling, grouping, etc.

6.1) Get the data from s3
6.2) First query for Sydney data and then drop the lat and lon columns (we don't need them).

syd_lat = -33.86
syd_lon = 151.21
Expected shape (1150049, 2).

6.3) Save this processed file to s3 for later use:

Save as a csv file ml_data_SYD.csv to s3://mds-s3-student96/output/ expected shape (46020,26) - This includes all the models as columns and also adding additional column Observed loaded from observed_daily_rainfall_SYD.csv from s3.

4. Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas please put a link where TAs can find the following-

This notebook with solution to 1 & 3
Screenshot from
- Output after trying curl. Here is a sample. This is just an example; your input/output doesn't have to look like this, you can design the way you like. But at a minimum, it should show your prediction value.

Milestone 2 checklist

- Setup your EC2 instance with JupyterHub.
- Install all necessary things needed in your UNIX server (amazon ec2 instance).
- Setup your S3 bucket.
- Move the data that you wrangled in your last milestone to s3.
- Get the data from S3 in your notebook and make data ready for machine learning.

3. Setup the server

Description:

rubric={correctness:20}

3.1) Login in to the server (instance). The person who spins up the EC2 instance will only have access to the server as he only got the private key. If someone else wants to log in to that instance, you need to get hold of that private key ( Refer 1.10 ). Need to know more ? Click here
3.2) Setup a common data folder to download data, and this folder should be accessible by all users in the JupyterHub. Following commands make a folder and make it accessible to everyone. Want to learn more about basic UNIX commands? Click here.

sudo mkdir -p /srv/data/my_shared_data_folder
sudo chmod 777 /srv/data/my_shared_data_folder/

3.3)(OPTIONAL, no bonus points) If you want a sharing notebook environment, then check out this. if you plan to do this, make sure you install the "members" package in your server run sudo apt-get install members."
3.4) Install AWS CLI. More details here.

NOTE:We are installing this in our EC2 instance, but we can install this anywhere to interact with s3. Say you can install it in your local machine and move data to s3.

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install

3.5) Setup your access key and secret. Do it from your AWS console. Make sure you keep your "Access key ID" & secret key somewhere safe.
3.6) Use these credentials to configure AWS CLI (aws configure). More details here. "Default region" and "output format" you can leave empty.
3.7) AWS cli can be used to interact with a lot of services. Check this out. To get a feel, we will use CLI to interact with s3 and wait for step 6.

Please attach this screen shots from your group for grading
Make sure you mask the IP address refer here.

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/3_result.png

1. Develop your API

Description:

rubric={mechanics:45}

You probably got how to set up primary URL endpoints from the sampleproject.ipynb notebook and have them process and return some data. Here we are going to create a new endpoint that accepts a POST request of the features required to run the machine learning model that you trained and saved in last milestone (i.e., a user will post the predictions of the 25 climate model rainfall predictions, i.e., features, needed to predict with your machine learning model). Your code should then process this data, use your model to make a prediction, and return that prediction to the user. To get you started with all this, I've given you a template which you should fill out to set up this functionality:

NOTE: You won't be able to test the flask module (or the API you make here) unless you go through steps in 2. Deploy your API. However, here you can make sure that you develop all your functions and inputs properly.

1. Setup your EC2 instance

Description:

rubric={correctness:20}

Follow the instructions shown during the lecture to set up your EC2 instance. You can use this as your reference, but please make sure you follow the below instructions.

Please attach this screen shots from your group for grading.
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/1_result.png

5. Setup your S3 bucket and move data

Description:

rubric={correctness:20}

5.1) Get comfortable with S3 UI. Go from the AWS console.
5.2) Create a bucket there. The name should be mds-s3-xxx. Replace xxx with your "IAM user account".
5.3) All other options leave as it is. (Make sure AWS region is Canada).
5.4) Create your first folder called "output".
5.5) Move the "observed_daily_rainfall_SYD.csv" file from the Milestone1 data folder to your s3 bucket from your local computer. (it's a tiny file, so maybe you can easily use UI to upload).
5.6) Moving the parquet file we downloaded in step 4 to S3 using the cli what we installed in step 3.7. Refer this document and figure out yourself!

Hint: We are interested in the cp command. local is the directory path on our server.

Please attach this screen shots from your group for grading
Make sure it has 3 objects.

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/4_result.png

2. Setup your browser, jupyter environment & connect to the master node

rubric={correctness:25}

2.1) Under cluster summary > Application user interfaces > On-cluster user interfaces: Click on Enable an SSH Connection.
2.2) From instructions in the popup from Step 2.1, use: Step 1: Open an SSH Tunnel to the Amazon EMR Master Node. Remember you are running this from your laptop terminal, and after running, it will look like this.
2.3) From instructions in the popup from Step 2.1, please ignore Step 2: Configure a proxy management tool. Instead follow instructions given here, under section Example: Configure FoxyProxy for Firefox:. Get foxyproxy standard here
2.4) Move to application user interfaces tab, use the jupytetHub URL to access.
2.4.1) Username: jovyan, Password :jupyter. These are default more details here
2.5)[ OPTIONAL ] Remember, we are using EMR managed jupyterHub, and the setup they have is different from TLJH. So before you add users in jupyterHub, run this by SSHing into the master node. Follow the instruction cluster summary > Connect to the Master Node Using SSH. Remember, you are running this from your laptop terminal. Once you get inside the server/instance, add your team members.

 sudo docker exec jupyterhub useradd -m -s /bin/bash -N <your team member IAM id>
 sudo docker exec jupyterhub bash -c "echo <your team member IAM id>:<your team member password> | chpasswd"

2.6) Login into the master node from your laptop terminal (cluster summary > Connect to the Master Node Using SSH), and install necessary packages. Here are needed packages based on the solution that I have; you might have to install other packages depending on your approach.

sudo yum install python3-devel
sudo pip3 install pandas
sudo pip3 install s3fs

IMPORTANT: Make sure ssh -i ~/ggeorgeAD.pem -ND 8157 [email protected] is running in your terminal window before trying to access your jupyter URL. Sometimes the connection might lose; in that case run that step again to access your jupyterHub.

Please attach this screen shots from your group for grading
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task2.png

Submission

From DSCI-525 slack channel: https://ubc-mds.slack.com/archives/C24J4AQT1/p1618859045315300?thread_ts=1618858868.314700&cid=C24J4AQT1

SUBMISSION: Please put a link in canvas where TAs can find the following-

Python 3 notebook, with the code for ML model in scikit-learn. (You can develop this on your existing jupyterHub in your EC2 instance, from milestone2)
PySpark notebook, with the code for obtaining best hyperparameter settings. ( For this you have to use PySpark notebook in your EMR cluster )
Screenshot from:
- Setup your EMR cluster (Task 1).
- Setup your browser , jupyter environment & connect to the master node (Task 2).
- Your S3 bucket showing model.joblib file. (From Task 3 Develop a ML model using scikit-learn)

5. Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone.

4. Combining data CSVs

Use one of the following options to combine data CSVs into a single CSV.
- Pandas
- DASK
When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

3. Summarize your journey from Milestone 1 to Milestone 4

2. Deploy your API

Milestone 3 checklist

2. Setup your JupyterHub

Description:

rubric={correctness:20}

2.1) Under description, check for "IPv4 Public IP" and paste the IP address in your browser for your JupyterHub.
2.2) Enter your "IAM user account" and use a strong password & note it down somewhere, as what you enter here will be the admin password.
2.3) In your JupyterHub, go to "Control Panel" --> "admin." Here add other members of your group use their "IAM user account" and make them admins.
2.4) Check if other members can log in to the JupyterHub from their machines by giving them the URL to connect. Step 2.2 is applicable here for other members.

Please attach this screen shots from your group for grading
I want to see all the group members here in this screenshot https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/2_result.png

6. Perform a simple EDA in R

Pick an approach to transfer the dataframe from python to R.
Discuss why you chose this approach over others.

Deliverables for Milestone1:

In the textbox provided on Canvas for the Milestone 1 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone

Also:

The URL of the team-work contract

Milestone 4 checklist

4. Get the data what we wrangled in our first milestone

Description:

You have to install the packages that are needed. Refer this TLJH document.Refer pip section.

Don't forget to add option -E. This way, all packages that you install will be available to other users in your JupyterHub. These packages you must install and install other packages needed for your wrangling.

sudo -E pip install pandas
sudo -E pip install pyarrow
sudo -E pip install s3fs

As in the last milestone, we looked at getting the data transferred from Python to R, and we have different solutions.

Milestone 1 checklist

1. Setup your EMR cluster

rubric={correctness:25}

Follow the instructions shown during the lecture to set up your EC2 instance. Please make sure you follow the below instructions.

Please attach this screen shots from your group for grading
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task1.png

MileStone 1 Feedback

Well-designed readme file.
The report was perfect, and you have successfully done all the sections.
Very well reasoning in part 3.
Nice exploration in part 4!

ubc-mds / 525-group23 Goto Github PK

525-group23's Issues

Recommend Projects

Recommend Topics

Recommend Org