In this lab, you'll learn how to evaluate your model results and you'll learn how to select the appropriate features using stepwise selection.
You will be able to:
- Use stepwise selection methods to determine the most important features for a model
- Use recursive feature elimination to determine the most important features for a model
import pandas as pd
import numpy as np
ames = pd.read_csv('ames.csv')
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
ames_cont = ames[continuous]
# log features
log_names = [f'{column}_log' for column in ames_cont.columns]
ames_log = np.log(ames_cont)
ames_log.columns = log_names
# normalize (subract mean and divide by std)
def normalize(feature):
return (feature - feature.mean()) / feature.std()
ames_log_norm = ames_log.apply(normalize)
# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)
preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)
The function for stepwise selection is copied below. Use this provided function on your preprocessed Ames Housing data.
import statsmodels.api as sm
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out = 0.05,
verbose=True):
"""
Perform a forward-backward feature selection
based on p-value from statsmodels.api.OLS
Arguments:
X - pandas.DataFrame with candidate features
y - list-like with the target
initial_list - list of features to start with (column names of X)
threshold_in - include a feature if its p-value < threshold_in
threshold_out - exclude a feature if its p-value > threshold_out
verbose - whether to print the sequence of inclusions and exclusions
Returns: list of selected features
Always set threshold_in < threshold_out to avoid infinite looping.
See https://en.wikipedia.org/wiki/Stepwise_regression for the details
"""
included = list(initial_list)
while True:
changed=False
# forward step
excluded = list(set(X.columns)-set(included))
new_pval = pd.Series(index=excluded, dtype='float64')
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
new_pval[new_column] = model.pvalues[new_column]
best_pval = new_pval.min()
if best_pval < threshold_in:
best_feature = new_pval.idxmin()
included.append(best_feature)
changed=True
if verbose:
print('Add {:30} with p-value {:.6}'.format(best_feature, best_pval))
# backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
# use all coefs except intercept
pvalues = model.pvalues.iloc[1:]
worst_pval = pvalues.max() # null if pvalues is empty
if worst_pval > threshold_out:
changed=True
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
if verbose:
print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
if not changed:
break
return included
# Your code here
# Your code here
Use feature ranking to select the 5 most important features
# Your code here
Fit the linear regression model again using the 5 selected columns
# Your code here
Now, predict .predict()
in scikit-learn.
# Your code here
Now, using the formulas of R-squared and adjusted R-squared below, and your Python/numpy knowledge, compute them and contrast them with the R-squared and adjusted R-squared in your statsmodels output using stepwise selection. Which of the two models would you prefer?
# Your code here
# r_squared is 0.239434
# adjusted_r_squared is 0.236818
- Perform variable selection using forward selection, using this resource: https://planspace.org/20150423-forward_selection_with_statsmodels/. Note that this time features are added based on the adjusted R-squared!
- Tweak the code in the
stepwise_selection()
function written above to just perform forward selection based on the p-value
Great! You practiced your feature selection skills by applying stepwise selection and recursive feature elimination to the Ames Housing dataset!