The cv2-mod4-sec28-extensions-to-linear-models-lesson from erdosn

Questions

Objectives

YWBAT

explain bias/variance tradeoff
explain ridge regression
explain lasso regression
explain AIC and BIC

What are the assumptions of linear regression?

Features and Target

Linear Relationship between the features and the target
Multicollinearity - features cannot have multicollinearity

Assumptions on your Residuals

Normality Assumption
Homoskedacicity - want this to be true of the residuals
Autocorrelation - no correlation between features and residuals

Outline

import pandas as pd
import numpy as np

import statsmodels.api as sm

from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.datasets import california_housing
from sklearn.model_selection import train_test_split


import matplotlib.pyplot as plt

cal_housing = california_housing.fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /Users/rcarrasco/scikit_learn_data

y = cal_housing.target
X = cal_housing.data
features = cal_housing.feature_names

df = pd.DataFrame(X, columns=features)
df['target'] = y
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Why don't we want multicollinearity? What does this cause?

Takes away from linear equation

if 2 features f1 and f2 are correlated

yhat = b0 + b1f1 + b2f2

giving these some numbers

gallons_per_mile = 2.5 x car_weight + 3.8 x engine_size

increase car_weight by 1 -> gallons_per_mile increasing by 2.5

because these are multicollinear the 2.5 and 3.8 don't mean anything.

# let's build an OLS model using statsmodels (baseline)
ols = sm.OLS(y, df.drop("target", axis=1))
results = ols.fit()

results.summary()

OLS Regression Results

Dep. Variable:	y	R-squared (uncentered):	0.892
Model:	OLS	Adj. R-squared (uncentered):	0.892
Method:	Least Squares	F-statistic:	2.137e+04
Date:	Thu, 12 Sep 2019	Prob (F-statistic):	0.00
Time:	17:33:56	Log-Likelihood:	-24087.
No. Observations:	20640	AIC:	4.819e+04
Df Residuals:	20632	BIC:	4.825e+04
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
MedInc	0.5135	0.004	120.594	0.000	0.505	0.522
HouseAge	0.0157	0.000	33.727	0.000	0.015	0.017
AveRooms	-0.1825	0.006	-29.673	0.000	-0.195	-0.170
AveBedrms	0.8651	0.030	28.927	0.000	0.806	0.924
Population	7.792e-06	5.09e-06	1.530	0.126	-2.19e-06	1.78e-05
AveOccup	-0.0047	0.001	-8.987	0.000	-0.006	-0.004
Latitude	-0.0639	0.004	-17.826	0.000	-0.071	-0.057
Longitude	-0.0164	0.001	-14.381	0.000	-0.019	-0.014

Omnibus:	4353.392	Durbin-Watson:	0.909
Prob(Omnibus):	0.000	Jarque-Bera (JB):	14087.489
Skew:	1.069	Prob(JB):	0.00
Kurtosis:	6.436	Cond. No.	1.03e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Skewness of 1.069 is a bit positively skewed. But pretty close to 0.

Kurtosis of 6.436 means that we have a lot of outliers.

df.corr()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
MedInc	1.000000	-0.119034	0.326895	-0.062040	0.004834	0.018766	-0.079809	-0.015176	0.688075
HouseAge	-0.119034	1.000000	-0.153277	-0.077747	-0.296244	0.013191	0.011173	-0.108197	0.105623
AveRooms	0.326895	-0.153277	1.000000	0.847621	-0.072213	-0.004852	0.106389	-0.027540	0.151948
AveBedrms	-0.062040	-0.077747	0.847621	1.000000	-0.066197	-0.006181	0.069721	0.013344	-0.046701
Population	0.004834	-0.296244	-0.072213	-0.066197	1.000000	0.069863	-0.108785	0.099773	-0.024650
AveOccup	0.018766	0.013191	-0.004852	-0.006181	0.069863	1.000000	0.002366	0.002476	-0.023737
Latitude	-0.079809	0.011173	0.106389	0.069721	-0.108785	0.002366	1.000000	-0.924664	-0.144160
Longitude	-0.015176	-0.108197	-0.027540	0.013344	0.099773	0.002476	-0.924664	1.000000	-0.045967
target	0.688075	0.105623	0.151948	-0.046701	-0.024650	-0.023737	-0.144160	-0.045967	1.000000

X = df.drop(["target", "AveRooms", "Latitude", "Longitude"], axis=1)
y = df.target

ols = sm.OLS(y, X)
results = ols.fit()

results.summary()

OLS Regression Results

Dep. Variable:	target	R-squared (uncentered):	0.884
Model:	OLS	Adj. R-squared (uncentered):	0.884
Method:	Least Squares	F-statistic:	3.140e+04
Date:	Thu, 12 Sep 2019	Prob (F-statistic):	0.00
Time:	17:38:19	Log-Likelihood:	-24870.
No. Observations:	20640	AIC:	4.975e+04
Df Residuals:	20635	BIC:	4.979e+04
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
MedInc	0.4210	0.003	165.642	0.000	0.416	0.426
HouseAge	0.0160	0.000	45.980	0.000	0.015	0.017
AveBedrms	-0.0185	0.010	-1.902	0.057	-0.038	0.001
Population	1.665e-05	4.6e-06	3.618	0.000	7.63e-06	2.57e-05
AveOccup	-0.0047	0.001	-8.713	0.000	-0.006	-0.004

Omnibus:	4262.669	Durbin-Watson:	0.758
Prob(Omnibus):	0.000	Jarque-Bera (JB):	9935.375
Skew:	1.167	Prob(JB):	0.00
Kurtosis:	5.471	Cond. No.	3.16e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

What is the goal of linear regression?

Predict a target based on features

What are we using to make these predictions?

Parameters, also known as, coefficients, also known as, weights

How do we find the best parameters?

Something to do with smallest error...yes that is true
Least Mean Squared Error...
The best way to find Parameters is using Gradient Descent

What is Gradient Descent?

is a Process
What are the ingredients for Gradient Descent?
- initial guess of our Parameters
- Loss Function -> Way of Calculating Error
- You update weights based on gradient of Error w/ respect to Parameters
- Then weights with the lowest error are chosen

What happens in linear regression is if I add hundreds of features...

You overfit, but your r2 goes up and error goes down
- and gradient descent is trying to minimize error
this is where ridge and lasso come in
- these punish using a lot of parameters
- what else do these do?
  - ensures optimal parameters
  - prevents us from overfitting

xtrain, xtest, ytrain, ytest = train_test_split(df.drop('target', axis=1), df.target, test_size=0.20)

Out of the box linear regerssion

linreg = LinearRegression()
linreg.fit(xtrain, ytrain)
linreg.score(xtest, ytest)

0.6200923803673022

plt.bar(features, linreg.coef_)
plt.xticks(range(len(linreg.coef_)), features, rotation=70)
plt.show()

Ridge regression

ridge = Ridge(alpha=10.0)
ridge.fit(xtrain, ytrain)

ridge.score(xtest, ytest)

0.620053583047234

ridge.coef_.sum()

0.08746926231461982

plt.bar(features, ridge.coef_)
plt.xticks(range(len(ridge.coef_)), features, rotation=70)
plt.show()

Lasso

lasso = Lasso(alpha=0.5)
lasso.fit(xtrain, ytrain)

lasso.score(xtest, ytest)

0.4601413921538754

plt.bar(features, lasso.coef_)
plt.xticks(range(len(lasso.coef_)), features, rotation=70)
plt.show()

Some Formulas

AIC/ BIC

Lasso Regression

Ridge Regression

Still need to plan

Some deep thinking

Why would one want to use ridge over lasso over no penalty regression? What do these affect? Why are these important?

Assessment

What did we learn

The issues with linear regression
- residuals need to be normal
- We need multicollinearity
Learned about Gradient Descent
- That it's a process of finding the least amount of error by finding the best parameters
Lasso and Ridge Regression
- Help us find best parameters by penalizing the number of parameters we use
- Prevent overfitting

erdosn / cv2-mod4-sec28-extensions-to-linear-models-lesson Goto Github PK

cv2-mod4-sec28-extensions-to-linear-models-lesson's Introduction