Cross validation is very useful for determining optimal model parameters such as our regularization parameter alpha. It first divides the training set into subsets (by default the sklearn package uses 3) and then selects an optimal hyperparameter (in this case alpha, our regularization parameter) based on test performance. For example, if we have 3 splits: A, B and C, we can do 3 training and testing combinations and then average test performance as an overall estimate of model performance for those given parameters. (The three combinations are: Train on A+B test on c, train on A+C test on B, train on B+C test on A.) We can do this across various alpha values in order to determine an optimal regularization parameter. By default, sklearn will even estimate potential alpha for you, or you can explicit check the performance of specific alpha.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
df = pd.read_csv('Housing_Prices/train.csv')
print(len(df))
df.head()
1460
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows ร 81 columns
from sklearn.linear_model import LassoCV, RidgeCV
#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]
X = df[feats].drop('SalePrice', axis=1)
#Impute null values
for col in X:
avg = X[col].mean()
X[col] = X[col].fillna(value=avg)
y = df.SalePrice
print('Number of X features: {}'.format(len(X.columns)))
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV()
print('Model Details:\n', L1)
L1.fit(X_train, y_train)
print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
if num == 0:
count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
Optimal alpha: 198489.80980228688
First 5 coefficients:
[-2.80735194 -0. -0. 0.25507382 0. ]
25
Number of coefficients set to zero: 25
The Lasso technique also has a very important and profound effect: feature selection. That is, many of your feature coefficients will be optimized to zero, effectively removing their impact on the model. This can be a useful application in practice when trying to reduce the number of features in the model. Note that which variables are set to zero can change if multicollinearity is present in the data. That is, if two features within the X space are highly correlated, then which takes precendence in the model is somewhat arbitrary, and as such, coefficient weights between multiple runs of .fit()
could lead to substantially different coefficient values.
#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]
X = df[feats].drop('SalePrice', axis=1)
#Impute null values
for col in X:
avg = X[col].mean()
X[col] = X[col].fillna(value=avg)
y = df.SalePrice
print('Number of X features: {}'.format(len(X.columns)))
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV(normalize = True)
print('Model Details:\n', L1)
L1.fit(X_train, y_train)
print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
if num == 0:
count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=True, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
Optimal alpha: 141.0984264427501
First 5 coefficients:
[ -0.00000000e+00 -5.95275404e+01 0.00000000e+00 1.60217484e-01
2.00649624e+04]
21
Number of coefficients set to zero: 21
Calculate the mean squarred error between both of the models above and the test set.
# Your code here
# Your code here
Use some of our previous techniques including normalization, feature engineering, and dummy variables on the dataset. Then, repeat fitting and tuning a model, observing the performance impact compared to above.
# Your code here