Coder Social home page Coder Social logo

ubc-mds / 525-group23 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.51 MB

This repository is used for DSCI 525 - Web and Cloud Computing course project

License: MIT License

Jupyter Notebook 37.62% HTML 62.38%
rainfall-prediction ensamble-methods download-da preprocess-dataset

525-group23's Issues

1. Team-work contract

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

3. Downloading the data

  • 1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
  • 2. Extract the zip file, again programmatically, similar to how we did it in class.

You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

2. Creating repository and project structure

  • 1. Similar to previous project courses, create a public repository under UBC-MDS org for your project.
  • 2. Write brief introduction of the project in the README.
  • 3. Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Reflection

Discuss any challenges or difficulties you faced when dealing with this large data on your laptops. Briefly explain your approach to overcome the the challenges or reasons why you were not able to overcome them.

6. Wrangle the data in preparation for machine learning

Description:

rubric={correctness:20}

Our data currently covers all of NSW, but say that our client wants us to create a machine learning model to predict rainfall over Sydney only. There's a bit of wrangling that needs to be done for that:

We need to query our data for only the rows that contain information covering Sydney
We need to wrangle our data into a format suitable for training a machine learning model. That will require pivoting, resampling, grouping, etc.

  • 6.1) Get the data from s3

  • 6.2) First query for Sydney data and then drop the lat and lon columns (we don't need them).

syd_lat = -33.86
syd_lon = 151.21
Expected shape (1150049, 2).
  • 6.3) Save this processed file to s3 for later use:

Save as a csv file ml_data_SYD.csv to s3://mds-s3-student96/output/ expected shape (46020,26) - This includes all the models as columns and also adding additional column Observed loaded from observed_daily_rainfall_SYD.csv from s3.

4. Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas please put a link where TAs can find the following-

  • This notebook with solution to 1 & 3
  • Screenshot from
    • Output after trying curl. Here is a sample. This is just an example; your input/output doesn't have to look like this, you can design the way you like. But at a minimum, it should show your prediction value.

Milestone 2 checklist

    • Setup your EC2 instance with JupyterHub.
    • Install all necessary things needed in your UNIX server (amazon ec2 instance).
    • Setup your S3 bucket.
    • Move the data that you wrangled in your last milestone to s3.
    • Get the data from S3 in your notebook and make data ready for machine learning.

3. Setup the server

Description:

rubric={correctness:20}

  • 3.1) Login in to the server (instance). The person who spins up the EC2 instance will only have access to the server as he only got the private key. If someone else wants to log in to that instance, you need to get hold of that private key ( Refer 1.10 ). Need to know more ? Click here

  • 3.2) Setup a common data folder to download data, and this folder should be accessible by all users in the JupyterHub. Following commands make a folder and make it accessible to everyone. Want to learn more about basic UNIX commands? Click here.

sudo mkdir -p /srv/data/my_shared_data_folder
sudo chmod 777 /srv/data/my_shared_data_folder/
  • 3.3)(OPTIONAL, no bonus points) If you want a sharing notebook environment, then check out this. if you plan to do this, make sure you install the "members" package in your server run sudo apt-get install members."

  • 3.4) Install AWS CLI. More details here.

NOTE:We are installing this in our EC2 instance, but we can install this anywhere to interact with s3. Say you can install it in your local machine and move data to s3.

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
  • 3.5) Setup your access key and secret. Do it from your AWS console. Make sure you keep your "Access key ID" & secret key somewhere safe.

  • 3.6) Use these credentials to configure AWS CLI (aws configure). More details here. "Default region" and "output format" you can leave empty.

  • 3.7) AWS cli can be used to interact with a lot of services. Check this out. To get a feel, we will use CLI to interact with s3 and wait for step 6.

Please attach this screen shots from your group for grading
Make sure you mask the IP address refer here.

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/3_result.png

1. Develop your API

Description:

rubric={mechanics:45}

You probably got how to set up primary URL endpoints from the sampleproject.ipynb notebook and have them process and return some data. Here we are going to create a new endpoint that accepts a POST request of the features required to run the machine learning model that you trained and saved in last milestone (i.e., a user will post the predictions of the 25 climate model rainfall predictions, i.e., features, needed to predict with your machine learning model). Your code should then process this data, use your model to make a prediction, and return that prediction to the user. To get you started with all this, I've given you a template which you should fill out to set up this functionality:

NOTE: You won't be able to test the flask module (or the API you make here) unless you go through steps in 2. Deploy your API. However, here you can make sure that you develop all your functions and inputs properly.

1. Setup your EC2 instance

Description:

rubric={correctness:20}

Follow the instructions shown during the lecture to set up your EC2 instance. You can use this as your reference, but please make sure you follow the below instructions.

  • 1.1) Choose AMI "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type 64-bit (x86)".

  • 1.2) Choose an Instance Type t2.xlarge.

  • 1.3) Make sure you go with the default VPC & subnet.

  • 1.4) Get the configuration code from step 7 in the above link and replace "admin-user-name"(remove < > as well) with your AWS IAM username.

  • 1.5) Storage use Root with size 30 GB.

  • 1.6) Add tag, enter "Owner" under the Key field. In the Value field in the Name row, give your IAM username.

  • 1.7) Select an existing security group Name "DSCI525."

  • 1.8) Review page looks like this before you launch the instance.

  • 1.9) In the pop-up, "Select an existing key pair or a new key pair." If you are setting up your cluster for the first time, click on create a new key pair and name it as your "IAM user account". Download the private key and keep it secure. Next time when you set up the EC2 instance, make sure you select "Choose an existing keypair" and choose the one you already made for you.

  • 1.10) Search for your "IAM user account" under instances to see if it's running. Give it 15 - 20 min as even if it shows running, it will take more time to set up JupyterHub. So please wait...!

  • 1.11) Check out the "Connect" button to determine how you can connect to the instance. Now you are given the DOOR access to the server, which we mentioned in our first class. :)

Please attach this screen shots from your group for grading.
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/1_result.png

5. Setup your S3 bucket and move data

Description:

rubric={correctness:20}

  • 5.1) Get comfortable with S3 UI. Go from the AWS console.

  • 5.2) Create a bucket there. The name should be mds-s3-xxx. Replace xxx with your "IAM user account".

  • 5.3) All other options leave as it is. (Make sure AWS region is Canada).

  • 5.4) Create your first folder called "output".

  • 5.5) Move the "observed_daily_rainfall_SYD.csv" file from the Milestone1 data folder to your s3 bucket from your local computer. (it's a tiny file, so maybe you can easily use UI to upload).

  • 5.6) Moving the parquet file we downloaded in step 4 to S3 using the cli what we installed in step 3.7. Refer this document and figure out yourself!

Hint: We are interested in the cp command. local is the directory path on our server.

Please attach this screen shots from your group for grading
Make sure it has 3 objects.

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/4_result.png

2. Setup your browser, jupyter environment & connect to the master node

rubric={correctness:25}

  • 2.1) Under cluster summary > Application user interfaces > On-cluster user interfaces: Click on Enable an SSH Connection.
  • 2.2) From instructions in the popup from Step 2.1, use: Step 1: Open an SSH Tunnel to the Amazon EMR Master Node. Remember you are running this from your laptop terminal, and after running, it will look like this.
  • 2.3) From instructions in the popup from Step 2.1, please ignore Step 2: Configure a proxy management tool. Instead follow instructions given here, under section Example: Configure FoxyProxy for Firefox:. Get foxyproxy standard here
  • 2.4) Move to application user interfaces tab, use the jupytetHub URL to access.
  • 2.4.1) Username: jovyan, Password :jupyter. These are default more details here
  • 2.5)[ OPTIONAL ] Remember, we are using EMR managed jupyterHub, and the setup they have is different from TLJH. So before you add users in jupyterHub, run this by SSHing into the master node. Follow the instruction cluster summary > Connect to the Master Node Using SSH. Remember, you are running this from your laptop terminal. Once you get inside the server/instance, add your team members.
 sudo docker exec jupyterhub useradd -m -s /bin/bash -N <your team member IAM id>
 sudo docker exec jupyterhub bash -c "echo <your team member IAM id>:<your team member password> | chpasswd"
  • 2.6) Login into the master node from your laptop terminal (cluster summary > Connect to the Master Node Using SSH), and install necessary packages. Here are needed packages based on the solution that I have; you might have to install other packages depending on your approach.
sudo yum install python3-devel
sudo pip3 install pandas
sudo pip3 install s3fs

IMPORTANT: Make sure ssh -i ~/ggeorgeAD.pem -ND 8157 [email protected] is running in your terminal window before trying to access your jupyter URL. Sometimes the connection might lose; in that case run that step again to access your jupyterHub.

Please attach this screen shots from your group for grading
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task2.png

Submission

From DSCI-525 slack channel: https://ubc-mds.slack.com/archives/C24J4AQT1/p1618859045315300?thread_ts=1618858868.314700&cid=C24J4AQT1

SUBMISSION: Please put a link in canvas where TAs can find the following-

  • Python 3 notebook, with the code for ML model in scikit-learn. (You can develop this on your existing jupyterHub in your EC2 instance, from milestone2)

  • PySpark notebook, with the code for obtaining best hyperparameter settings. ( For this you have to use PySpark notebook in your EMR cluster )

  • Screenshot from:

    • Setup your EMR cluster (Task 1).
    • Setup your browser , jupyter environment & connect to the master node (Task 2).
    • Your S3 bucket showing model.joblib file. (From Task 3 Develop a ML model using scikit-learn)

5. Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment include:

  • The URL of your public project's repository
  • The URL of your notebook for this milestone.

4. Combining data CSVs

  1. Use one of the following options to combine data CSVs into a single CSV.

  2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)

  3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

2. Setup your JupyterHub

Description:

rubric={correctness:20}

  • 2.1) Under description, check for "IPv4 Public IP" and paste the IP address in your browser for your JupyterHub.

  • 2.2) Enter your "IAM user account" and use a strong password & note it down somewhere, as what you enter here will be the admin password.

  • 2.3) In your JupyterHub, go to "Control Panel" --> "admin." Here add other members of your group use their "IAM user account" and make them admins.

  • 2.4) Check if other members can log in to the JupyterHub from their machines by giving them the URL to connect. Step 2.2 is applicable here for other members.

Please attach this screen shots from your group for grading
I want to see all the group members here in this screenshot https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/2_result.png

Deliverables for Milestone1:

In the textbox provided on Canvas for the Milestone 1 assignment include:

  • The URL of your public project's repository
  • The URL of your notebook for this milestone

Also:

  • The URL of the team-work contract

4. Get the data what we wrangled in our first milestone

Description:

You have to install the packages that are needed. Refer this TLJH document.Refer pip section.

Don't forget to add option -E. This way, all packages that you install will be available to other users in your JupyterHub. These packages you must install and install other packages needed for your wrangling.

sudo -E pip install pandas
sudo -E pip install pyarrow
sudo -E pip install s3fs

As in the last milestone, we looked at getting the data transferred from Python to R, and we have different solutions.

1. Setup your EMR cluster

rubric={correctness:25}

Follow the instructions shown during the lecture to set up your EC2 instance. Please make sure you follow the below instructions.

  • 1.1) Go to advanced options.
  • 1.2) Choose Release 6.2.0.
  • 1.3) Check JupyterHub 1.1.0 & Spark 3.0.1.
  • 1.4) Core instances to be 0, master 1.
  • 1.5) Root device EBS volume size 30 GB.
  • 1.6) Cluster name
  • 1.7) Uncheck Termination protection.
  • 1.6) Add tag, enter "Owner" under the Key field. In the Value field in the Name row, give your
  • 1.9) Select your keypair what you have used in your previous milestone (milestone 2).
  • 1.10) EC2 security group go with the default. Remember this is a managed service, what we learned from the shared responsibility model, and hence AWS will take care of many things. EMR comes in the list of container services. Check this.
  • 1.11) Wait for the cluster to start. This takes around ~15 min. Wait for your cluster status to be Waiting .

Please attach this screen shots from your group for grading
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task1.png

MileStone 1 Feedback

  • Well-designed readme file.
  • The report was perfect, and you have successfully done all the sections.
  • Very well reasoning in part 3.
  • Nice exploration in part 4!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.