Multicollinearity of Features - Lab
Introduction
In this lab you'll identify multicollinearity in the Boston Housing Data set.
Objectives
You will be able to:
- Plot heatmaps for the predictors of the Boston dataset
- Understand and calculate correlation matrices
Correlation matrix for the Boston Housing data
Let's reimport the Boston Housing data and let's use the data with the categorical variables for tax_dummy
and rad_dummy
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
# first, create bins for based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()
# first, create bins for based on the values observed. 5 values will result in 4 bins
bins = [0, 250, 300, 360, 460, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()
tax_dummy = pd.get_dummies(bins_tax, prefix="TAX")
rad_dummy = pd.get_dummies(bins_rad, prefix="RAD")
boston_features = boston_features.drop(["RAD","TAX"], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
boston_features.head()
Scatter matrix
Create the scatter matrix for the Boston Housing data.
This took a while to load. Not surprisingly, the categorical variables didn't really provide any meaningful result. remove the categorical columns associated with "RAD" and "TAX" from the data again and look at the scatter matrix again.
Correlation matrix
Next, let's look at the correlation matrix
Return "True" for positive or negative correlations that are bigger than 0.75.
Remove the most problematic feature from the data.
Summary
Good job! You've now edited the Boston Housing Data so highly correlated variables are removed.