Coder Social home page Coder Social logo

jwarmenhoven / islr-python Goto Github PK

View Code? Open in Web Editor NEW
4.2K 204.0 2.4K 21.36 MB

An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013): Python code

License: MIT License

Jupyter Notebook 100.00%
machine-learning predictive-modeling islr statistical-learning islr-python

islr-python's Introduction

ISLR-python

This repository contains Python code for a selection of tables, figures and LAB sections from the first edition of the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).

For Bayesian data analysis using PyMC3, take a look at this repository.

2018-01-15:
Minor updates to the repository due to changes/deprecations in several packages. The notebooks have been tested with these package versions. Thanks @lincolnfrias and @telescopeuser.

2016-08-30:
Chapter 6: I included Ridge/Lasso regression code using the new python-glmnet library. This is a python wrapper for the Fortran library used in the R package glmnet.

Chapter 3 - Linear Regression
Chapter 4 - Classification
Chapter 5 - Resampling Methods
Chapter 6 - Linear Model Selection and Regularization
Chapter 7 - Moving Beyond Linearity
Chapter 8 - Tree-Based Methods
Chapter 9 - Support Vector Machines
Chapter 10 - Unsupervised Learning

Extra: Misclassification rate simulation - SVM and Logistic Regression

This great book gives a thorough introduction to the field of Statistical/Machine Learning. The book is available for download (see link below), but I think this is one of those books that is definitely worth buying. The book contains sections with applications in R based on public datasets available for download or which are part of the R-package ISLR. Furthermore, there is a Stanford University online course based on this book and taught by the authors (See course catalogue for current schedule).

Since Python is my language of choice for data analysis, I decided to try and do some of the calculations and plots in Jupyter Notebooks using:

  • pandas
  • numpy
  • scipy
  • scikit-learn
  • python-glmnet
  • statsmodels
  • patsy
  • matplotlib
  • seaborn

It was a good way to learn more about Machine Learning in Python by creating these notebooks. I created some of the figures/tables of the chapters and worked through some LAB sections. At certain points I realize that it may look like I tried too hard to make the output identical to the tables and R-plots in the book. But I did this to explore some details of the libraries mentioned above (mostly matplotlib and seaborn). Note that this repository is not a standalone tutorial and that you probably should have a copy of the book to follow along. Suggestions for improvement and help with unsolved issues are welcome! See Hastie et al. (2009) for an advanced treatment of these topics.

References:

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, Springer Science+Business Media, New York. https://www.statlearning.com/

James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R, Second Edition, Springer Science+Business Media, New York. https://www.statlearning.com/

Hastie, T., Tibshirani, R., Friedman, J. (2009). Elements of Statistical Learning, Second Edition, Springer Science+Business Media, New York. http://statweb.stanford.edu/~tibs/ElemStatLearn/

islr-python's People

Contributors

jwarmenhoven avatar njannasch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

islr-python's Issues

glmnet load error

Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgfortran.5.dylib
Reason: image not found

Chapter 3 - Figure 3.2 : RSS contour plot not symmetrical

What seems to be the problem with this plot? I think I created the meshgrid correctly, but it does not have the symmetrical shape as the one in the book.

Python
fig3_2_python

ISL
fig3_2

I contacted Trevor Hastie to ask him about the R code for plot on the left. He was so kind to send me the following code. You will need to load the advertising data first and skip the first two lines and the last line.

load("Chapter3.RData")
postscript(file="../Figs/leastsqexample1.ps",width=7,height=7,pointsize=14,horizontal=F)
set.seed(22)
par(mfrow=c(1,1),mar=c(5,5,2,2))
g=50
x=advertising$TV-mean(advertising$TV)
y=advertising$Sales
b=sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
a=mean(y)-b*mean(x)
RSS.min=sum((y-as.vector(cbind(1,x)%*%c(a,b)))^2)/100000
a.grid=seq(a-2,a+2,length=g)
b.grid=seq(b-.02,b+.02,length=g)
grid=as.matrix(expand.grid(a.grid,b.grid))

RSS=rep(0,g^2)
for (i in 1:(g^2)){
yhat=as.vector(cbind(1,x)%*%grid[i,])
RSS[i]=sum((y-yhat)^2)/1000
}
RSS=matrix(RSS,g,g)
m=which.min(RSS)

contour(a.grid-b*mean(advertising$TV),b.grid,RSS,xlab=expression(beta[0]),ylab=expression(beta[1]),levels=c(2.11,2.15,2.2,2.3,2.5,3),axes=T,frame.plot=T,col=4,drawlabels=T,cex.lab=1.5,labcex=1.3)

points(a-b*mean(advertising$TV),b,col=2,pch=19,cex=1.5)

dev.off()

Chapter 8: Fix pydot and output from graphviz

I moved my local repository to another environment and need to fix the graphviz/pydot setup to be able to create the graphical representation of the decision trees. I accidently pushed an update to GitHub.

AttributeError in Chapter-8 notebook

Hi, when I exercuted the following snippet within chapter-8 Tree-based Methods.

graph2 = print_tree(clf, features=X2.columns, class_names=['No', 'Yes'])
Image(graph2.create_png())

Ipython gave me some error message

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-fc4f6f6b365b> in <module>()
      1 graph2 = print_tree(clf, features=X2.columns, class_names=['No', 'Yes'])
----> 2 Image(graph2.create_png())

AttributeError: 'list' object has no attribute 'create_png'

My python enviroment is anaconda python 3.5 + pydot 1.2.3. Any suggestion to fix this problem? Thanks

Have you tried rpy2?

I've also transcribed the Statistical Learning material into Python but only for my own reference. Yours is very elegant and I wish I'd discovered it at the beginning of the course rather than the end.

I noticed that you've gone to the extra step of dumping the R data to file and then loading into the Python environment in Jupyter. In my notes I wanted to have the R and Python code on top of each other for easy reference, so I installed an R virtualenv and built a "bilingual" Jupyter kernel that handles both languages. So far I've had good results using this version of rpy2 with Python 3.5 to invoke R magics in Jupyter and intersperse the two languages in the same notebook.

See my attached notebook, in case you find it useful.
StatLearning_Chapter4R_inPython.ipynb.zip
ch4_screen_shot

AttributeError: module 'glmnet' has no attribute 'ElasticNet'

Hi,
I am trying to run a code same as for chapter 6 provided here:

When I reach the point:
In[7]:

     grid = 10**np.linspace(10,-2,100)

     ridge3 = gln.ElasticNet(alpha=0, lambda_path=grid)
     ridge3.fit(X, y)

I get this error:

AttributeError Traceback (most recent call last)
in ()
1 grid = 10**np.linspace(10,-2,100)
2
----> 3 ridge3 = gln.ElasticNet(alpha=0, lambda_path=grid)
4 ridge3.fit(X, y)

AttributeError: module 'glmnet' has no attribute 'ElasticNet'

What can I do about it?

Chapter 6: In[7] path

X_train = pd.read_csv('Data/Hitters_X_train.csv', index_col=0)
y_train = pd.read_csv('Data/Hitters_y_train.csv', index_col=0)
X_test = pd.read_csv('Data/Hitters_X_test.csv', index_col=0)
y_test = pd.read_csv('Data/Hitters_y_test.csv', index_col=0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.