ECO 395M: Data Mining and Statistical Learning

Welcome to ECO 395M, a course on data mining and statistical learning for students in the Master's program in Economics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Exercises

I will post the exercises here and will call attention to their due dates in the week-by-week outline below.

Week-by-week running outline

Every week I will update this outline to reflect what we're currently working on, as well as to give you a preview of what's coming.

Raw RMarkdown files for all slides are in this GitHub repo.

Weeks 12-13: Unsupervised learning, continued: PCA, networks, and association rules

Slides on PCA here.

Intro slides on networks.

Slides on association rules here.

Miscellaneous:

Gephi, a great piece of software for exploring graphs
The Gephi quick-start tutorial
a little Python utility for scraping Spotify playlists

Reading: rest of chapter 10 of Introduction to Statistical Learning.

Week 11: clustering

Slides here.

Reading: chapter 10.3 of Introduction to Statistical Learning.

I have posted a set of project guidelines to give you some more specifics about how to prepare and submit your reports. But the basic idea is what we've discussed before: find a problem and data set that interests you, approach it using the tools we've learned in class, and write a report. Remember that, if you'd like to get feedback on your project idea, I'm asking you to turn in a prospectus by 5 PM on Friday, April 19. Send the prospectus to me at [email protected] with the subject: "ECO 395 Project Prospectus: (your names)."

The prospectus is optional. Ideally you should address the question, proposed methods, and data sources you will pursue. But really, just be as specific as you can. If you can't address all these questions, that's OK. This is not for a grade; its just an opportunity to bounce your ideas off me, and if you can't find the time to send in a prospectus, that's OK.

I've also posted the fourth and final set of exercises this semester, on unsupervised learning techniques. These are due at 5 PM on Friday, April 26.

Note on topic order: we're skipping chapter 7 in the interests of time, although it's good stuff to know. We will come back to chapter 8 (and possibly 9) after we've done Chapter 10.

Weeks 9-10: Model selection and regularization

Slides here.

Reading: chapter 6 of Introduction to Statistical Learning.

Week 8: Resampling methods (CV, bootstrap)

Slides here.

In class:

Week 7: Classification, continued (multinomial logit, Bayes)

Slides same as last week.

In class:

Week 6: Classification

Slides here.

Reading: Chapter 4 of "Introduction to Statistical Learning."

In class:

glass.R

Weeks 4-5: Linear regression

Slides here.

Reading: Chapter 3 of "Introduction to Statistical Learning."

In class:

oj.R and oj.csv
saratoga_lm.R

Week 3: Basic concepts in statistical learning

Slides here.

Reading: Chapters 1-2 of "Introduction to Statistical Learning."

In class:

Week 2: data visualization and practice with R

Contingency tables and bar plots; basic plots for numerical data (scatterplot, boxplot, histogram, line graphs); lattice plots. Introduction to ggplot2.

Examples of bad graphics. Baby set of slides here.

Some software walkthroughs that show some of the capabilities of basic R graphics:

Survival on the Titanic: summarizing variation in categorical variables
City temperatures: measuring and visualizing dispersion in one numerical variable.
Test scores and GPA for UT grads: association between numerical and categorical variables.

If you really want to get good at plotting in R, you should learn ggplot2. Here are two references, written by the ggplot2 package author (Hadley Wickham), that are pretty useful at getting the basics:

Some examples of ggplot2 in action, from the basic to the advanced (and truly beautiful):

Further references:

excerpts from my course notes on data science. We'll look at some example graphics in Chapter 1.
Good graphics: scan through some of the New York Times' best data visualizations. Lots of good stuff here but for our purposes, the best things to look at are those in the "Data Visualizations" section, about 60% of the way down the page. Control-F for "Data Visualization" and you'll find it. Here are three examples:

Week 1: the data scientist's toolbox

Slides here.

Topics: Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

The first thing to do is to install R and then RStudio on your own computer. Detailed instructions for installing these two programs can be found here. Both are free.

R is the underlying data-analysis program we'll use in this course, while RStudio provides a nice front-end interface to R that makes certain repetitive steps (e.g. loading data, saving plots) very simple. I will use RStudio in class most days this semester, and you will use it most weeks for your homework. RStudio depends upon having R available behind the scenes, so make sure you install both, even though you won't need to interact directly with R.

Please install these on your own computer; you'll need them for the second day of class. At some point before class next week, complete the following R walkthroughs if you need an R refresher. If you're comfortable with R, you can safely skip these.

Important links:

Looking ahead to next week: data visualization. The following software walkthroughs will help you get your feet wet -- a lot of this will probably be a reminder!

Survival on the Titanic: summarizing variation in categorical variables
City temperatures: measuring and visualizing dispersion in one numerical variable.
Test scores and GPA for UT grads: association between numerical and categorical variables.

kylietaylor / eco395m Goto Github PK

eco395m's Introduction

ECO 395M: Data Mining and Statistical Learning

Exercises

Week-by-week running outline

Weeks 12-13: Unsupervised learning, continued: PCA, networks, and association rules

Week 11: clustering

Weeks 9-10: Model selection and regularization

Week 8: Resampling methods (CV, bootstrap)

Week 7: Classification, continued (multinomial logit, Bayes)

Week 6: Classification

Weeks 4-5: Linear regression

Week 3: Basic concepts in statistical learning

Week 2: data visualization and practice with R

Week 1: the data scientist's toolbox

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent