Coder Social home page Coder Social logo

sf_dat_17's Introduction

SF DAT 17 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (9/14/15 - 12/02/15).

Instructors: Sinan Ozdemir (who is super cool!!!!!!)

Teaching Assistants: David, Matt, and Sri (who are all way more awesome)

Office hours: All will be held in the student center at GA, 225 Bush Street

Course Project Information

Course Project Examples

Monday Wednesday
9/14: Introduction / Expectations / Intro to Data Science 9/16: Git / Python
9/21: Data Science Workflow / Pandas 9/23: More Pandas!
9/28: Intro to Machine Learning / Numpy / KNN 9/30: Scikit-learn / Model Evaluation
Project Milestone: Question and Data Set
HW Homework 1 Due
10/5: Linear Regression 10/7: Logistic Regression
10/12: Columbus Day (NO CLASS) 10/14: Working on a Data Problem
10/19: Clustering 10/21: Natural Language Processing
10/26: Naive Bayes
Milestone: First Draft Due
10/28: Decision Trees
11/2: Ensembling Techniques 11/4: Dimension Reduction
Milestone: Peer Review Due
11/9 Support Vector Machines 11/11: Web Development with Flask
11/16: Recommendation Engines 11/18: Neural Networks Continued
11/23: SQL 11/25: Turkey Day (NO CLASS)
11/30: Projects 12/2: Projects

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "SF_DAT_17 team" and add your photo!

Resources

Class 1: Introduction / Expectations / Intro to Data Science

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Intro to Data Science: (slides)
  • Tools: check for proper setup of Git, Anaconda, overview of Slack

####Homework

  • Make sure you have everything installed as specified above in "Installation and Setup" by Wednesday

Class 2: Git / Python

  • Introduction to Git
  • Intro to Python: (code)

####Homework

  • Go through the python file and finish any exercise you weren't able to in class
  • Make sure you have all of the repos cloned and ready to go
    • You should have both "SF___DAT___17" and "SF___DAT___17__WORK"
  • Read Greg Reda's Intro to Pandas

Resources:

  • In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here

  • Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)

    • Here is a video of Wes McKinney going through his notebook!

    Class 3: Pandas

Agenda

  • Intro to Pandas walkthrough here
    • I will give you semi-cleaned data allowing us to work on step 3 of the data science workflow
    • Pandas is an excellent tool for exploratory data analysis
    • It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
    • Pandas Lab!

Homework

  • Begin thinking about potential projects that you'd want to work on. Consider the problems discussed in class today (we will see more next time and next Monday as well)
    • Do you want a predictive model?
    • Do you want to cluster similar objects (like words or other)?

Resources:

Class 4 - More Pandas

Agenda

  • Class code on Pandas here
  • We will work with 3 different data sets today:
  • Pandas Lab! here

####Homework

  • Please review the readme for the first homework. It is due NEXT Wednesday (9/30/2015)
  • The one-pager for your project is also due. Please see project guidelines

Class 5 - Intro to ML / Numpy / KNN

####Agenda

  • Intro to numpy code
    • Numerical Python, code adapted from tutorial here
    • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides
    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Iris pre-work code and code solutions
    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
  • Lab to create our own KNN model

####Homework

  • The one page project milestone as well as the pandas homework!
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

Class 6: scikit-learn, Model Evaluation Procedures

  • Introduction to scikit-learn with iris data (code)
  • Exploring the scikit-learn documentation: user guide, module reference, class documentation
  • Discuss the article on the bias-variance tradeoff
  • Look as some code on the bias variace tradeoff
    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
  • Model evaluation procedures (slides, code)

Homework:

Optional:

  • Practice what we learned in class today!
    • If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
    • If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
    • Either way, you can submit your commented code to your SF_DAT_15_WORK, and we'll give you feedback.

Resources:

Class 7: Linear Regression

Homework:

Resources:

Class 8: Logistic Regression

Homework:

Resources:

Class 9: Working on a Data Problem

  • Today we will work on a real world data problem! We will have 3 options.

  • Option 1: (stocks) Use stock data from over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns. data here

    • Project overview (slides)
      • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...
  • Option 2: Using ingredients to predict the type of recipe (Kaggle)[https://www.kaggle.com/c/whats-cooking]

  • Option 3: San Francisco Crime Classification (Kaggle)[https://www.kaggle.com/c/sf-crime]

Class 10: Clustering and Visualization

  • The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
  • The code for today focuses on two main examples:
    • We will investigate simple clustering using the iris data set.
    • We will take a look at a harder example, using Pandora songs as data. See data. See code here
    • Checking out some of the limitations of K-Means Clutering here

Homework:

  • Project Milestone 2 is due in one week!
  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Resources:

##Class 11: Natural Language Processing

Agenda

  • Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
  • code showing topics in NLP
  • lab analyzing tweets about the stock market

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
    • Confusion matrix: a good guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! First draft is due Monday!!

##Class 12: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

  • Learn about ROC/AUC curves
  • Learn the Naive Bayes Classifier
    • Slides here
    • Code here
    • In the code file above we will create our own spam classifier!

Resources

##Class 13: Decision Trees

We will look into a slightly more complex model today, the Decision Tree.

Agenda

Homework

  • Project reviews due next Wednesday!

Resources

  • Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
  • The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
  • For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
    • Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
    • If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
    • If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
  • Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Class 14: Ensembling

Resources:

Class 15: Dimension Reduction

Resources

  • Some hardcore math in python here
  • PCA using the iris data set here and with 2 components here
  • PCA step by step here
  • Check out Pyxley

Class 16: Neural Networks and SVM

Agenda

Resources

Class 18: Recommendation Engines

  • Recommendation Engines slides
  • Recommendation Engine Example code

Resources:

Class 19: More Neural Networks

  • We will need a new package! sudo pip install pybrain
  • Recap here
  • Let's build our own! here
  • Let's use Pybrain! here
  • A talk from OpenGov

Resources

  • Code adapted from here and here
  • Calculus adapted from here
  • Sklearn will come out with their own supervised neural network soon! here

Class 20: Databases and SQL

Resources

Next Steps

The hardest thing to do now is to stay sharp! I have a few recommendations on next steps in order to make sure that you don't forget what we learned here!

Thank you all for such a wonderful time and I truly hope to stay in touch.

sf_dat_17's People

Contributors

sinanuozdemir avatar dyerrington avatar

Watchers

James Cloos avatar Ali Saad avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.