Cross Validation

Cross validation is very useful for determining optimal model parameters such as our regularization parameter alpha. It first divides the training set into subsets (by default the sklearn package uses 3) and then selects an optimal hyperparameter (in this case alpha, our regularization parameter) based on test performance. For example, if we have 3 splits: A, B and C, we can do 3 training and testing combinations and then average test performance as an overall estimate of model performance for those given parameters. (The three combinations are: Train on A+B test on c, train on A+C test on B, train on B+C test on A.) We can do this across various alpha values in order to determine an optimal regularization parameter. By default, sklearn will even estimate potential alpha for you, or you can explicit check the performance of specific alpha.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split

df = pd.read_csv('Housing_Prices/train.csv')
print(len(df))
df.head()

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

from sklearn.linear_model import LassoCV, RidgeCV

#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV()
print('Model Details:\n', L1)

L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))

Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 198489.80980228688
First 5 coefficients:
 [-2.80735194 -0.         -0.          0.25507382  0.        ]
25
Number of coefficients set to zero: 25

Notes on Coefficients and Using Lasso for Feature Selection

The Lasso technique also has a very important and profound effect: feature selection. That is, many of your feature coefficients will be optimized to zero, effectively removing their impact on the model. This can be a useful application in practice when trying to reduce the number of features in the model. Note that which variables are set to zero can change if multicollinearity is present in the data. That is, if two features within the X space are highly correlated, then which takes precendence in the model is somewhat arbitrary, and as such, coefficient weights between multiple runs of .fit() could lead to substantially different coefficient values.

With Normalization

#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV(normalize = True)
print('Model Details:\n', L1)
L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))

Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=True, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 141.0984264427501
First 5 coefficients:
 [ -0.00000000e+00  -5.95275404e+01   0.00000000e+00   1.60217484e-01
   2.00649624e+04]
21
Number of coefficients set to zero: 21

Calculate the Mean Squarred Error

Calculate the mean squarred error between both of the models above and the test set.

# Your code here

Repeat this Process for the Ridge Regression Object

# Your code here

Practice Preprocessing and Feature Engineering

Use some of our previous techniques including normalization, feature engineering, and dummy variables on the dataset. Then, repeat fitting and tuning a model, observing the performance impact compared to above.

# Your code here

bhargavp86 / ds-skills-cv-nyc-ds-100218 Goto Github PK

ds-skills-cv-nyc-ds-100218's Introduction

Cross Validation

Notes on Coefficients and Using Lasso for Feature Selection

With Normalization

Calculate the Mean Squarred Error

Repeat this Process for the Ridge Regression Object

Practice Preprocessing and Feature Engineering

ds-skills-cv-nyc-ds-100218's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent