Coder Social home page Coder Social logo

eco395m's Introduction

ECO 395M: Data Mining and Statistical Learning

Welcome to ECO 395M, a course on data mining and statistical learning for students in the Master's program in Economics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Exercises

I will post the exercises here and will call attention to their due dates in the week-by-week outline below.

Week-by-week running outline

Every week I will update this outline to reflect what we're currently working on, as well as to give you a preview of what's coming.

Raw RMarkdown files for all slides are in this GitHub repo.

Weeks 12-13: Unsupervised learning, continued: PCA, networks, and association rules

Slides on PCA here.

Intro slides on networks.

Slides on association rules here.

Miscellaneous:

Reading: rest of chapter 10 of Introduction to Statistical Learning.

Week 11: clustering

Slides here.

Reading: chapter 10.3 of Introduction to Statistical Learning.

I have posted a set of project guidelines to give you some more specifics about how to prepare and submit your reports. But the basic idea is what we've discussed before: find a problem and data set that interests you, approach it using the tools we've learned in class, and write a report. Remember that, if you'd like to get feedback on your project idea, I'm asking you to turn in a prospectus by 5 PM on Friday, April 19. Send the prospectus to me at [email protected] with the subject: "ECO 395 Project Prospectus: (your names)."

The prospectus is optional. Ideally you should address the question, proposed methods, and data sources you will pursue. But really, just be as specific as you can. If you can't address all these questions, that's OK. This is not for a grade; its just an opportunity to bounce your ideas off me, and if you can't find the time to send in a prospectus, that's OK.

I've also posted the fourth and final set of exercises this semester, on unsupervised learning techniques. These are due at 5 PM on Friday, April 26.

Note on topic order: we're skipping chapter 7 in the interests of time, although it's good stuff to know. We will come back to chapter 8 (and possibly 9) after we've done Chapter 10.

Weeks 9-10: Model selection and regularization

Slides here.

Reading: chapter 6 of Introduction to Statistical Learning.

Week 8: Resampling methods (CV, bootstrap)

Slides here.

In class:

Week 7: Classification, continued (multinomial logit, Bayes)

Slides same as last week.

In class:

Week 6: Classification

Slides here.

Reading: Chapter 4 of "Introduction to Statistical Learning."

In class:

Weeks 4-5: Linear regression

Slides here.

Reading: Chapter 3 of "Introduction to Statistical Learning."

In class:

Week 3: Basic concepts in statistical learning

Slides here.

Reading: Chapters 1-2 of "Introduction to Statistical Learning."

In class:

Week 2: data visualization and practice with R

Contingency tables and bar plots; basic plots for numerical data (scatterplot, boxplot, histogram, line graphs); lattice plots. Introduction to ggplot2.

Examples of bad graphics. Baby set of slides here.

Some software walkthroughs that show some of the capabilities of basic R graphics:

If you really want to get good at plotting in R, you should learn ggplot2. Here are two references, written by the ggplot2 package author (Hadley Wickham), that are pretty useful at getting the basics:

Some examples of ggplot2 in action, from the basic to the advanced (and truly beautiful):

Further references:

  • excerpts from my course notes on data science. We'll look at some example graphics in Chapter 1.
  • Good graphics: scan through some of the New York Times' best data visualizations. Lots of good stuff here but for our purposes, the best things to look at are those in the "Data Visualizations" section, about 60% of the way down the page. Control-F for "Data Visualization" and you'll find it. Here are three examples:
  1. Low-income students in college
  2. The French presidential election
  3. LeBron James's playoff scoring record

Week 1: the data scientist's toolbox

Slides here.

Topics: Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

The first thing to do is to install R and then RStudio on your own computer. Detailed instructions for installing these two programs can be found here. Both are free.

R is the underlying data-analysis program we'll use in this course, while RStudio provides a nice front-end interface to R that makes certain repetitive steps (e.g. loading data, saving plots) very simple. I will use RStudio in class most days this semester, and you will use it most weeks for your homework. RStudio depends upon having R available behind the scenes, so make sure you install both, even though you won't need to interact directly with R.

Please install these on your own computer; you'll need them for the second day of class. At some point before class next week, complete the following R walkthroughs if you need an R refresher. If you're comfortable with R, you can safely skip these.

Important links:

Looking ahead to next week: data visualization. The following software walkthroughs will help you get your feet wet -- a lot of this will probably be a reminder!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.