Welcome to ECO 395M, a course on data mining and statistical learning for students in the Master's program in Economics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
I will post the exercises here and will call attention to their due dates in the week-by-week outline below.
Every week I will update this outline to reflect what we're currently working on, as well as to give you a preview of what's coming.
Raw RMarkdown files for all slides are in this GitHub repo.
Slides on association rules here.
Miscellaneous:
- Gephi, a great piece of software for exploring graphs
- The Gephi quick-start tutorial
- a little Python utility for scraping Spotify playlists
Reading: rest of chapter 10 of Introduction to Statistical Learning.
Reading: chapter 10.3 of Introduction to Statistical Learning.
I have posted a set of project guidelines to give you some more specifics about how to prepare and submit your reports. But the basic idea is what we've discussed before: find a problem and data set that interests you, approach it using the tools we've learned in class, and write a report. Remember that, if you'd like to get feedback on your project idea, I'm asking you to turn in a prospectus by 5 PM on Friday, April 19. Send the prospectus to me at [email protected] with the subject: "ECO 395 Project Prospectus: (your names)."
The prospectus is optional. Ideally you should address the question, proposed methods, and data sources you will pursue. But really, just be as specific as you can. If you can't address all these questions, that's OK. This is not for a grade; its just an opportunity to bounce your ideas off me, and if you can't find the time to send in a prospectus, that's OK.
I've also posted the fourth and final set of exercises this semester, on unsupervised learning techniques. These are due at 5 PM on Friday, April 26.
Note on topic order: we're skipping chapter 7 in the interests of time, although it's good stuff to know. We will come back to chapter 8 (and possibly 9) after we've done Chapter 10.
Reading: chapter 6 of Introduction to Statistical Learning.
In class:
In class:
Reading: Chapter 4 of "Introduction to Statistical Learning."
In class:
Reading: Chapter 3 of "Introduction to Statistical Learning."
In class:
- oj.R and oj.csv
- saratoga_lm.R
Reading: Chapters 1-2 of "Introduction to Statistical Learning."
In class:
Contingency tables and bar plots; basic plots for numerical data (scatterplot, boxplot, histogram, line graphs); lattice plots. Introduction to ggplot2.
Examples of bad graphics. Baby set of slides here.
Some software walkthroughs that show some of the capabilities of basic R graphics:
- Survival on the Titanic: summarizing variation in categorical variables
- City temperatures: measuring and visualizing dispersion in one numerical variable.
- Test scores and GPA for UT grads: association between numerical and categorical variables.
If you really want to get good at plotting in R, you should learn ggplot2. Here are two references, written by the ggplot2 package author (Hadley Wickham), that are pretty useful at getting the basics:
Some examples of ggplot2 in action, from the basic to the advanced (and truly beautiful):
Further references:
- excerpts from my course notes on data science. We'll look at some example graphics in Chapter 1.
- Good graphics: scan through some of the New York Times' best data visualizations. Lots of good stuff here but for our purposes, the best things to look at are those in the "Data Visualizations" section, about 60% of the way down the page. Control-F for "Data Visualization" and you'll find it. Here are three examples:
- Low-income students in college
- The French presidential election
- LeBron James's playoff scoring record
Topics: Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
The first thing to do is to install R and then RStudio on your own computer. Detailed instructions for installing these two programs can be found here. Both are free.
R is the underlying data-analysis program we'll use in this course, while RStudio provides a nice front-end interface to R that makes certain repetitive steps (e.g. loading data, saving plots) very simple. I will use RStudio in class most days this semester, and you will use it most weeks for your homework. RStudio depends upon having R available behind the scenes, so make sure you install both, even though you won't need to interact directly with R.
Please install these on your own computer; you'll need them for the second day of class. At some point before class next week, complete the following R walkthroughs if you need an R refresher. If you're comfortable with R, you can safely skip these.
Important links:
- Introduction to RMarkdown
- RMarkdown tutorial
- Introduction to GitHub
- Jeff Leek's guide to sharing data
Looking ahead to next week: data visualization. The following software walkthroughs will help you get your feet wet -- a lot of this will probably be a reminder!
- Survival on the Titanic: summarizing variation in categorical variables
- City temperatures: measuring and visualizing dispersion in one numerical variable.
- Test scores and GPA for UT grads: association between numerical and categorical variables.