ds-skills-ridge-lasso-intro's Introduction

import pandas as pd
import warnings
df = pd.read_csv('Housing_Prices/train.csv')
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows ร— 81 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
#Previous Naive Models
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_squared_log_error

features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]
#Impute null values
for col in X:
#     avg = X[col].mean()
    X[col] = X[col].fillna(value=0)
y = df.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X,y)
ols = LinearRegression(), y_train)
print('Training r^2:', ols.score(X_train, y_train))
print('Testing r^2:', ols.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ols.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ols.predict(X_test)))
Training r^2: 0.823908907201
Testing r^2: 0.785239568446
Training MSE: 1083602888.48
Testing MSE: 1448610037.61

Model Tuning


Another preprocessing techinique is called normalization. (Don't confuse this with regularization below!) Normalization takes all of your variables and norms them to a consistent scale. Traditionally, this tends to be converting variables to a scale of 0 to 1, although many other normalizations are possible. Most sklearn objects make this incredibly easy by simply passing the parameter normalize=True to the regression object when initializing. For example:

features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]
#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)
y = df.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X,y)
ols = LinearRegression(normalize=True), y_train)
print('Training r^2:', ols.score(X_train, y_train))
print('Testing r^2:', ols.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ols.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ols.predict(X_test)))
Training r^2: 0.800636556693
Testing r^2: 0.853560391708
Training MSE: 1342224346.23
Testing MSE: 736512635.577

Feature Engineering Dummy Variables

df = pd.get_dummies(df)
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 1 60 65.0 8450 7 5 2003 2003 196.0 706 ... 0 0 0 1 0 0 0 0 1 0
1 2 20 80.0 9600 6 8 1976 1976 0.0 978 ... 0 0 0 1 0 0 0 0 1 0
2 3 60 68.0 11250 7 5 2001 2002 162.0 486 ... 0 0 0 1 0 0 0 0 1 0
3 4 70 60.0 9550 7 5 1915 1970 0.0 216 ... 0 0 0 1 1 0 0 0 0 0
4 5 60 84.0 14260 8 5 2000 2000 350.0 655 ... 0 0 0 1 0 0 0 0 1 0

5 rows ร— 290 columns

X = df.drop('SalePrice', axis=1)
#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)
y = df.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X,y)
ols = LinearRegression(normalize=True), y_train)
print('Training r^2:', ols.score(X_train, y_train))
print('Testing r^2:', ols.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ols.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ols.predict(X_test)))
Training r^2: 0.934139831372
Testing r^2: -2.03880441798e+26
Training MSE: 412016810.023
Testing MSE: 1.31683626299e+36


Notice the severe overfitting above; our training r^2 is quite high, but the testing r^2 is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.


The coefficients of our regression model the same as simple single-variable regression.

Mathematically we have,

$\hat{y} = b_0 + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_nX_n$

where n is the number of features (or columns) in X.

Thus our coefficient weights are what you would multiply the value of each of our features by to get the predicted output value. With our current model, this is done by minimizing the squarred errors between the model and the desired output. Remember that this is done only with the training data. In other words, we first made the train test split. Then, the .fit() method of our regression object optimized the coefficients ($b_0$ through $b_n$ above)to minimize the squarred errors (residuals) between the the model, $\hat{y_train}$ (read "y hat train") and the actual data, y_train.


The primary tool for dealing with overfitting is a process called regularization.

Regularization works by changing what we are trying to optimize via regression. While ordinary regression typically minimizes squarred error, regularization techniques add a penalty for the size of coefficients. This acts as a method of feature selection and reduces overfitting. The two most common penalities like this are known as Lasso and Ridge regression.

Lasso regression adds an additional penalty equal to the absolute value of each model coefficient. This helps prevent one feature from dominating the regression and reduces overfitting.

Recall that our predictions $\hat{y}$ are given by: $\hat{y} = b_0 + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_nX_n$

and so our error function to be minimized when we are training our model becomes:

$\sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda|b_i|$

Here lambda ($\lambda$), is the strength of regularization we want to perform. Lower values add less penalty and will more closely resemble ordinary regression while higher values will increase the penalty and lead to more regularization.

Similarly, ridge regression adds an additional penalty equal to the squarred value of each model coefficient.

$\sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda(b_i)^2$

Finding the Minimum via Gradient Descent

In this form, the algorithm also gets a little more complicated in the underlying solution to this equation. Previously, in the ordinary least squares case, there was a closed form solution, which our LinearRegression() object from sklearn used to compute the optimal model. Having a closed form solution means there is a standard procedure or formula to follow and you are gauranteed to reach the optimal solution.

Unfortunately, in the form of ridge and lasso regression, such a closed form solution does not exist and we are reduced to searching, albeit intelligently, for an optimal solution. This intelligent searching is done through a process called gradient descent. Gradient descent calculates the derivative, or rate of change, of the error function for a given point. This given point are the current coefficient values ($b_o$ through $b_n$). Gradient descent then makes small changes to these model coefficient values based on this derivative in order to minimize our error function. This process continues on until we reach a minimum.

Even this comes with caveats though. While our iterative algorithm might converge to a minimum, there is no gaurantee that this is an absolute minimum. For example, our algorithm might converge to the valley on the left which is a minimum, but the global minimum is the valley on the right.

As such, it is often also advisable to run the algorithm multiple times from different starting points and take the minimum value between these multiple runs. Let's take a look at the most important aspects of tuning a lasso or ridge regression.

Tuning Parameters

  • Alpha; Regularization parameter
  • Max Iterations

The first, and arguably most important, parameter to tune is alpha, the strength of regularization parameter. This is set when you first initialize the Lasso regression object. The second important parameter for you to be aware of is the max_iterations. Max iterations will force the gradient descent algorithm to terminate early if it has not reached a minimum after that many steps. Setting max_iterations to a lower number will make the function run faster but can impeded results by terminating the algorithm early.

from sklearn.linear_model import Lasso, Ridge
L1 = Lasso() #Lasso is also known as the L1 norm., y_train)
print('Training r^2:', L1.score(X_train, y_train))
print('Testing r^2:', L1.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, L1.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, L1.predict(X_test)))
Training r^2: 0.941150558085
Testing r^2: 0.647003659993
Training MSE: 368158172.604
Testing MSE: 2279955728.58

Notice while our training is still far superior to testing, the r^2 coefficient is positive and the testing MSE is at least of a comparable magnitude.

L1 = Lasso(alpha=5) #Lasso is also known as the L1 norm., y_train)
print('Training r^2:', L1.score(X_train, y_train))
print('Testing r^2:', L1.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, L1.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, L1.predict(X_test)))
Training r^2: 0.940744502048
Testing r^2: 0.65690522635
Training MSE: 370698432.016
Testing MSE: 2216002847.54


Iterate over a range of alpha value such as np.linspace(start=0.01, stop=10, num=25) and fit a Lasso regression using that as your regularization parameter. Store the training and testing r^2 for each of these and plot them on a graph where the x-axis is the alpha parameter used for the model and the y-axis is the r^2 coefficient values.

# Your code here

Repeat this process for Ridge regression.

Compare the performance between the two.

#Your code here
# Your comparisons / observations here

