Coder Social home page Coder Social logo

ds-skills-multicoll-dummy-london-ds-091018's Introduction

Correlation, Regression and Multicollinearity

Here you'll practice some of the concepts from the previous lesson and begin to investigate multicollinearity in your data. Multicollinearity is when features in your X space have correlation with each other. This can lead to some interesting problems in regression where it is difficult to determine precise coefficient weights for the various features. The general idea is that with the highly correlated features, you can get nearly identical performance with different combinations of those correlated features; 1 of this and 1 of that, 2 of this and none of that, 1.5 of this and .5 of that, 50 of this and -48 of that all would produce the same result if our two features are perfectly correlated.

In the following, we break down and demonstrate these concepts in more detail.

Standard Package Imports

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Practice Merging

As with the previous class, practice importing and merging two datasets. The two datasets are in two files: Walmart_Sales_Forecasting.csv and features.csv. Import both of these with pandas and then merge the two into a single master dataframe. (The two columns you should merge on are date and store number.)

# Your code here
# Import and merge the two files.

pd.plotting.scater_matrix()

Another very useful built in pandas feature is the scatter matrix method.
You can use this to both examine the pairwise relations between columns as well as the distribution of individual features along the diagonal.

For example, notice how 'MarkDown1 and MarkDown4 appear to have some substantial correlation. Also notice the histograms along the diagonal; these are the distributions of the various variables. For example, Store in the upper left is relatively uniform (we have roughly equal amounts of data for the various store numbers), and Temperature is roughly normally distributed. In contrast, most of the MarkDown features have a high degree of skew (there are a few high outliers but most of the data falls within a much smaller range).

#Warning: This will take several minutes to generate! (May also freeze computers with limited specs)
pd.plotting.scatter_matrix(df.drop('IsHoliday', axis=1), figsize=(15,15));

png

Correlation

As we can see, there's quite a bit of correlated features here!
We can also further investigate a single relationship between two variables with the plt.scatter(x,y) method or calculte the pearson correlation coefficient with numpy's built in np.corrcoeff() method:

x , y = df.MarkDown1, df.MarkDown4
print(np.corrcoef(x,y))
plt.scatter(x,y)
[[nan nan]
 [nan nan]]





<matplotlib.collections.PathCollection at 0x1a1cd16c88>

png

temp = df[(~df.MarkDown1.isnull())
         & (~df.MarkDown4.isnull())]
x , y = temp.MarkDown1, temp.MarkDown4
print(np.corrcoef(x,y))
plt.scatter(x,y)
[[1.         0.81923816]
 [0.81923816 1.        ]]





<matplotlib.collections.PathCollection at 0x1a1cdc04e0>

png

Correlation versus causation

As you may have heard before, correlation does not equal causation. One fun example of this is ice cream sales and shark attacks. We have a bizarre dataset recording ice cream sales and shark attacks on given days at various beach towns, and plotting the two we notice a distinct correlation between the two. This does not mean that more ice sales causes more shark attacks to happen. In this case, both variables (ice cream sales and shark attacks) are correlated with a third feature we have yet to examine: temperature. In summer, as the temperature rises, both ice cream sales and shark attacks increase while in winter, there are comparitively few of both. In sum, don't assume that just because two variables are correlated that there is any direct causal relation between the two.

Multicollinearity

Multicollinearity is when we have multiple predictive variables which are highly correlated. This leads to a number of issues when we then go to perform regression (which we will investigate in more depth!)

In our current example, MarkDown1 and MarkDown4 were highly correlated which will greatly impact our regression analysis. Let's investigate this briefly.

Regression

Regression algorithms are designed to predict a numeric value. That could be the anticipated number of votes for a candidate, the gross sales for a product, the value of a home, or the number of yelp review for a restaurant. If you're trying to predict a numerical value, you'll want to use a regression algorithm. You may have seen simple linear regression in a previous class. This type of regression is known as ordinary least squares; it determines a line of best fit by minimizing the sum of squares of the errors between the models predictions and the actual data. In algebra and statistics classes, this is often limited to the simple 2 variable case of y=mx+b, but this process can be generalized to use multiple predictive variables. While there can be a lot of underlying mathematics, Python and SKlearn make generating and tuning these models incredibly straightforward. Here's a general outline:

  • Import packages
  • Define X and y; X can be multiple columns, y is what you want to predict
    • X must have null values removed
  • Initialize a regression object
  • Fit a regression model
  • Use your regression model to generate predictions
  • Measure the accuracy of your model's predictions
df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday Dept Weekly_Sales
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False 1 24924.50
1 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False 2 50605.27
df.columns
Index(['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday', 'Dept', 'Weekly_Sales'],
      dtype='object')
#1) Import packages
from sklearn.linear_model import LinearRegression

#2) Define X and y
X = df[['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
        'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Dept']] #Only numeric features work

y = df['Weekly_Sales']

#3) Initialize a regression object
linreg = LinearRegression()

# 4) Fit the model
linreg.fit(X, y)

# 5) Use the model to predict outputs
df['Estimated_Weekly_Sales'] = linreg.predict(X)

# 6) Measure performance
# Here we print the model's R^2 to measure overall performance; the correlation between our model and the data
print('R^2 model score:', linreg.score(X,y), '\n')
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-51-1c69de66056d> in <module>()
     12 
     13 # 4) Fit the model
---> 14 linreg.fit(X, y)
     15 
     16 # 5) Use the model to predict outputs


~/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py in fit(self, X, y, sample_weight)
    480         n_jobs_ = self.n_jobs
    481         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 482                          y_numeric=True, multi_output=True)
    483 
    484         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:


~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,


~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
    454 
    455     shape_repr = _shape_repr(array.shape)


~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Handling Null Values

As you can see, the model won't run with Null values in our dataset. There are many ways to handle null values including imputing methods such as using the average or median. For now, we'll simply remove all null values from X.

df = df.dropna()

Reruning the Modeling Process

# Code from above

#1) Import packages
from sklearn.linear_model import LinearRegression

#2) Define X and y
X = df[['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
        'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Dept']] #Only numeric features work

y = df['Weekly_Sales']

#3) Initialize a regression object
linreg = LinearRegression()

# 4) Fit the model
linreg.fit(X, y)

# 5) Use the model to predict outputs
df['Estimated_Weekly_Sales'] = linreg.predict(X)

# 6) Measure performance
# Here we print the model's R^2 to measure overall performance; the correlation between our model and the data
print('R^2 model score:', linreg.score(X,y), '\n')

#Print the coefficients for the model's formula
print('Model feature coefficients:')
weight_dict = dict(zip(X.columns, linreg.coef_))
for feat, coeff in list(zip(X.columns, linreg.coef_)):
    print(feat, 'Coefficient weight: {}'.format(round(coeff, 4)))
R^2 model score: 0.034802219578679794 

Model feature coefficients:
Store Coefficient weight: -108.1617
Temperature Coefficient weight: 38.4918
Fuel_Price Coefficient weight: -827.9562
MarkDown1 Coefficient weight: 0.1411
MarkDown2 Coefficient weight: 0.0635
MarkDown3 Coefficient weight: 0.1624
MarkDown4 Coefficient weight: -0.0512
MarkDown5 Coefficient weight: 0.2407
CPI Coefficient weight: -30.2301
Unemployment Coefficient weight: -615.576
Dept Coefficient weight: 112.2021

Problems with multicollinearity

There are a few considerations to keep in mind when it comes to interpreting regression models based on underlying data with multicollinearity. One is that the coefficients in the model themselves lose interpretability. Under ideal conditions, we would like to interpret our coefficients literally. For example, the coefficient associated with temperature is 38.49. If this coefficient were stable, we could say something along the lines "as a the temperature goes up by 1, the weekly sales goes up by 38.49. Unfortunately, with multicollinearity, we cannot make such claims. That is because the coefficients associated with weights that are highly correlated may vary widely. For example, observe what happens to the coefficients when we remove a feature with correlation to others:

# Code from above

#1) Import packages
from sklearn.linear_model import LinearRegression

#2) Define X and y
#Removed MarkDown4
X = df[['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
        'MarkDown5', 'CPI', 'Unemployment', 'Dept']] #Only numeric features work

y = df['Weekly_Sales']

#3) Initialize a regression object
linreg = LinearRegression()

# 4) Fit the model
linreg.fit(X, y)

# 5) Use the model to predict outputs
df['Estimated_Weekly_Sales'] = linreg.predict(X)

# 6) Measure performance
# Here we print the model's R^2 to measure overall performance; the correlation between our model and the data
print('R^2 model score:', linreg.score(X,y), '\n')

#Save results
new_weight_dict = dict(zip(X.columns, linreg.coef_))

#Print the coefficients for the model's formula
print('Model feature coefficients:')
for feat, coeff in list(zip(X.columns, linreg.coef_)):
    print(feat, 'Coefficient weight: {}'.format(round(coeff, 4)))
R^2 model score: 0.03473682576876003 

Model feature coefficients:
Store Coefficient weight: -108.0325
Temperature Coefficient weight: 39.7291
Fuel_Price Coefficient weight: -692.1175
MarkDown1 Coefficient weight: 0.1078
MarkDown2 Coefficient weight: 0.0658
MarkDown3 Coefficient weight: 0.1621
MarkDown5 Coefficient weight: 0.2407
CPI Coefficient weight: -29.8958
Unemployment Coefficient weight: -612.1828
Dept Coefficient weight: 112.1943

What this goes to demonstrate is that some of these coefficients are unstable and depend on what other features are incorporated into the model. Adding additional features that are correlated with features already added will never reduce the overall performance of the model, but can be thought of as 'not adding much new information'. In this way, adding correlated features is unlikely to drastically increase model performance to any substantial degree. This also makes it difficult to judge the importance of a particular variable; the importance of the variable depends on what variables are already present (if other variables already exist which are highly correlated, then the variable will again add little predictive information).

weights = pd.DataFrame.from_dict(weight_dict, orient='index').reset_index()
weights.columns = ['Feature', 'Original_Weight']
weights['New_Weight'] = weights.Feature.map(new_weight_dict)
weights['Change'] = np.abs(weights['New_Weight'] - weights['Original_Weight']) #Net change (absolute value)
weights = weights.sort_values(by='Change', ascending=False)
weights['MkDwn4_Corr'] = weights.Feature.map(lambda feat: np.corrcoef(df.MarkDown4, df[feat])[0][1])
weights['Percent_Change'] = weights.Change / weights.Original_Weight
weights.sort_values(by='Percent_Change', ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Feature Original_Weight New_Weight Change MkDwn4_Corr Percent_Change
3 MarkDown1 0.141131 0.107823 0.033308 0.828928 0.236005
4 MarkDown2 0.063488 0.065768 0.002280 -0.017517 0.035913
1 Temperature 38.491844 39.729088 1.237244 -0.057895 0.032143
5 MarkDown3 0.162446 0.162095 0.000351 -0.080216 0.002158
7 MarkDown5 0.240696 0.240669 0.000027 0.101114 0.000114
10 Dept 112.202109 112.194276 0.007833 0.004650 0.000070
0 Store -108.161678 -108.032453 0.129226 0.002050 -0.001195
9 Unemployment -615.575960 -612.182834 3.393126 0.018817 -0.005512
8 CPI -30.230099 -29.895812 0.334287 -0.041662 -0.011058
2 Fuel_Price -827.956231 -692.117515 135.838715 -0.025784 -0.164065
6 MarkDown4 -0.051211 NaN NaN 1.000000 NaN

Also notice how MarkDown1, the feature with the highest correlation to MarkDown4, which we removed, is the coefficient that changes the largest proportionally. This shows how the original coefficient weight was impacted by the highly correlated feature.

Dummy Variables

Another huge problem with our current model is the data we have fed into it. Specifically, we passed two columns, Store and department which were numeric, but are really more of categorical variables then quantities on a discrete or continuous scale. For example, store 2 and store 1 aren't really numeric quantities, they are simply labels designating which store A or B, we are talking about. The proper way to encode this data is as binary flags (0 or 1) designating 'Is_This_Store_A?' and 'Is_This_Store_B'. This will lead to a large number of new columns, one for each possible value that a categorical variable would take on. For example, we have 45 stores in our dataset, so this will require 45 new columns, one for each store. Each of these columns will then have a binary variable (0 or 1) that will indicate whether that particular data point is associated with that particular store. While this may sound like a lot of tedious work in practice, pandas has a useful built in feature called get_dummies() which makes this process easy.

Before Transformation

df.Store.head()
6587    1
6588    1
6589    1
6590    1
6591    1
Name: Store, dtype: int64

After Transformation

pd.get_dummies(df.Store).head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
1 2 3 4 5 6 7 8 9 10 ... 36 37 38 39 40 41 42 43 44 45
6587 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6588 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6589 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6590 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6591 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows ร— 45 columns

df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday Dept Weekly_Sales Estimated_Weekly_Sales
6587 1 2011-11-11 59.11 3.297 10382.9 6115.67 215.07 2406.62 6551.42 217.998085 7.866 False 1 18689.54 14168.563441
6588 1 2011-11-11 59.11 3.297 10382.9 6115.67 215.07 2406.62 6551.42 217.998085 7.866 False 2 44936.47 14280.757717
df.columns
Index(['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday', 'Dept', 'Weekly_Sales', 'Estimated_Weekly_Sales'],
      dtype='object')

Similarly, we will map a numeric scale to our binary variable

df.IsHoliday.value_counts()
False    87064
True      9992
Name: IsHoliday, dtype: int64
df.IsHoliday = df.IsHoliday.map({True:1, False:0})
df.IsHoliday.value_counts()
0    87064
1     9992
Name: IsHoliday, dtype: int64

If your categorical variables are non numeric you can do multiple at a time

for col in ['Store', 'Dept']:
    df[col] = df[col].astype(str) #Make nonumeric
X = pd.get_dummies(df[['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
        'MarkDown5', 'CPI', 'Unemployment', 'Dept', 'IsHoliday']])
X.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown5 CPI Unemployment IsHoliday Store_1 ... Dept_90 Dept_91 Dept_92 Dept_93 Dept_94 Dept_95 Dept_96 Dept_97 Dept_98 Dept_99
6587 59.11 3.297 10382.9 6115.67 215.07 6551.42 217.998085 7.866 0 1 ... 0 0 0 0 0 0 0 0 0 0
6588 59.11 3.297 10382.9 6115.67 215.07 6551.42 217.998085 7.866 0 1 ... 0 0 0 0 0 0 0 0 0 0
6589 59.11 3.297 10382.9 6115.67 215.07 6551.42 217.998085 7.866 0 1 ... 0 0 0 0 0 0 0 0 0 0
6590 59.11 3.297 10382.9 6115.67 215.07 6551.42 217.998085 7.866 0 1 ... 0 0 0 0 0 0 0 0 0 0
6591 59.11 3.297 10382.9 6115.67 215.07 6551.42 217.998085 7.866 0 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows ร— 135 columns

Practice

Using your newly defined X above, fit a new linear regression model and print the model's r^2 score.

#Your code here

ds-skills-multicoll-dummy-london-ds-091018's People

Contributors

mathymitchell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.