YWBAT
- explain bias/variance tradeoff
- explain ridge regression
- explain lasso regression
- explain AIC and BIC
Features and Target
- Linear Relationship between the features and the target
- Multicollinearity - features cannot have multicollinearity
Assumptions on your Residuals
- Normality Assumption
- Homoskedacicity - want this to be true of the residuals
- Autocorrelation - no correlation between features and residuals
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.datasets import california_housing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
cal_housing = california_housing.fetch_california_housing()
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /Users/rcarrasco/scikit_learn_data
y = cal_housing.target
X = cal_housing.data
features = cal_housing.feature_names
df = pd.DataFrame(X, columns=features)
df['target'] = y
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | target | |
---|---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
Takes away from linear equation
if 2 features f1 and f2 are correlated
yhat = b0 + b1f1 + b2f2
giving these some numbers
gallons_per_mile = 2.5 x car_weight + 3.8 x engine_size
increase car_weight by 1 -> gallons_per_mile increasing by 2.5
because these are multicollinear the 2.5 and 3.8 don't mean anything.
# let's build an OLS model using statsmodels (baseline)
ols = sm.OLS(y, df.drop("target", axis=1))
results = ols.fit()
results.summary()
Dep. Variable: | y | R-squared (uncentered): | 0.892 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.892 |
Method: | Least Squares | F-statistic: | 2.137e+04 |
Date: | Thu, 12 Sep 2019 | Prob (F-statistic): | 0.00 |
Time: | 17:33:56 | Log-Likelihood: | -24087. |
No. Observations: | 20640 | AIC: | 4.819e+04 |
Df Residuals: | 20632 | BIC: | 4.825e+04 |
Df Model: | 8 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
MedInc | 0.5135 | 0.004 | 120.594 | 0.000 | 0.505 | 0.522 |
HouseAge | 0.0157 | 0.000 | 33.727 | 0.000 | 0.015 | 0.017 |
AveRooms | -0.1825 | 0.006 | -29.673 | 0.000 | -0.195 | -0.170 |
AveBedrms | 0.8651 | 0.030 | 28.927 | 0.000 | 0.806 | 0.924 |
Population | 7.792e-06 | 5.09e-06 | 1.530 | 0.126 | -2.19e-06 | 1.78e-05 |
AveOccup | -0.0047 | 0.001 | -8.987 | 0.000 | -0.006 | -0.004 |
Latitude | -0.0639 | 0.004 | -17.826 | 0.000 | -0.071 | -0.057 |
Longitude | -0.0164 | 0.001 | -14.381 | 0.000 | -0.019 | -0.014 |
Omnibus: | 4353.392 | Durbin-Watson: | 0.909 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 14087.489 |
Skew: | 1.069 | Prob(JB): | 0.00 |
Kurtosis: | 6.436 | Cond. No. | 1.03e+04 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Skewness of 1.069 is a bit positively skewed. But pretty close to 0.
Kurtosis of 6.436 means that we have a lot of outliers.
df.corr()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | target | |
---|---|---|---|---|---|---|---|---|---|
MedInc | 1.000000 | -0.119034 | 0.326895 | -0.062040 | 0.004834 | 0.018766 | -0.079809 | -0.015176 | 0.688075 |
HouseAge | -0.119034 | 1.000000 | -0.153277 | -0.077747 | -0.296244 | 0.013191 | 0.011173 | -0.108197 | 0.105623 |
AveRooms | 0.326895 | -0.153277 | 1.000000 | 0.847621 | -0.072213 | -0.004852 | 0.106389 | -0.027540 | 0.151948 |
AveBedrms | -0.062040 | -0.077747 | 0.847621 | 1.000000 | -0.066197 | -0.006181 | 0.069721 | 0.013344 | -0.046701 |
Population | 0.004834 | -0.296244 | -0.072213 | -0.066197 | 1.000000 | 0.069863 | -0.108785 | 0.099773 | -0.024650 |
AveOccup | 0.018766 | 0.013191 | -0.004852 | -0.006181 | 0.069863 | 1.000000 | 0.002366 | 0.002476 | -0.023737 |
Latitude | -0.079809 | 0.011173 | 0.106389 | 0.069721 | -0.108785 | 0.002366 | 1.000000 | -0.924664 | -0.144160 |
Longitude | -0.015176 | -0.108197 | -0.027540 | 0.013344 | 0.099773 | 0.002476 | -0.924664 | 1.000000 | -0.045967 |
target | 0.688075 | 0.105623 | 0.151948 | -0.046701 | -0.024650 | -0.023737 | -0.144160 | -0.045967 | 1.000000 |
X = df.drop(["target", "AveRooms", "Latitude", "Longitude"], axis=1)
y = df.target
ols = sm.OLS(y, X)
results = ols.fit()
results.summary()
Dep. Variable: | target | R-squared (uncentered): | 0.884 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.884 |
Method: | Least Squares | F-statistic: | 3.140e+04 |
Date: | Thu, 12 Sep 2019 | Prob (F-statistic): | 0.00 |
Time: | 17:38:19 | Log-Likelihood: | -24870. |
No. Observations: | 20640 | AIC: | 4.975e+04 |
Df Residuals: | 20635 | BIC: | 4.979e+04 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
MedInc | 0.4210 | 0.003 | 165.642 | 0.000 | 0.416 | 0.426 |
HouseAge | 0.0160 | 0.000 | 45.980 | 0.000 | 0.015 | 0.017 |
AveBedrms | -0.0185 | 0.010 | -1.902 | 0.057 | -0.038 | 0.001 |
Population | 1.665e-05 | 4.6e-06 | 3.618 | 0.000 | 7.63e-06 | 2.57e-05 |
AveOccup | -0.0047 | 0.001 | -8.713 | 0.000 | -0.006 | -0.004 |
Omnibus: | 4262.669 | Durbin-Watson: | 0.758 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 9935.375 |
Skew: | 1.167 | Prob(JB): | 0.00 |
Kurtosis: | 5.471 | Cond. No. | 3.16e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Predict a target based on features
What are we using to make these predictions?
- Parameters, also known as, coefficients, also known as, weights
How do we find the best parameters?
- Something to do with smallest error...yes that is true
- Least Mean Squared Error...
- The best way to find Parameters is using Gradient Descent
- is a Process
- What are the ingredients for Gradient Descent?
- initial guess of our Parameters
- Loss Function -> Way of Calculating Error
- You update weights based on gradient of Error w/ respect to Parameters
- Then weights with the lowest error are chosen
- You overfit, but your r2 goes up and error goes down
- and gradient descent is trying to minimize error
- this is where ridge and lasso come in
- these punish using a lot of parameters
- what else do these do?
- ensures optimal parameters
- prevents us from overfitting
xtrain, xtest, ytrain, ytest = train_test_split(df.drop('target', axis=1), df.target, test_size=0.20)
linreg = LinearRegression()
linreg.fit(xtrain, ytrain)
linreg.score(xtest, ytest)
0.6200923803673022
plt.bar(features, linreg.coef_)
plt.xticks(range(len(linreg.coef_)), features, rotation=70)
plt.show()
ridge = Ridge(alpha=10.0)
ridge.fit(xtrain, ytrain)
ridge.score(xtest, ytest)
0.620053583047234
ridge.coef_.sum()
0.08746926231461982
plt.bar(features, ridge.coef_)
plt.xticks(range(len(ridge.coef_)), features, rotation=70)
plt.show()
lasso = Lasso(alpha=0.5)
lasso.fit(xtrain, ytrain)
lasso.score(xtest, ytest)
0.4601413921538754
plt.bar(features, lasso.coef_)
plt.xticks(range(len(lasso.coef_)), features, rotation=70)
plt.show()
Why would one want to use ridge over lasso over no penalty regression? What do these affect? Why are these important?
- The issues with linear regression
- residuals need to be normal
- We need multicollinearity
- Learned about Gradient Descent
- That it's a process of finding the least amount of error by finding the best parameters
- Lasso and Ridge Regression
- Help us find best parameters by penalizing the number of parameters we use
- Prevent overfitting