Coder Social home page Coder Social logo

dsc-linear-transformations's Introduction

Linear Transformations

Introduction

Linear transformations are a valuable tool to help make your linear regression model more interpretable. They can involve transforming the scale, the mean, or both.

Objectives

You will be able to:

  • Determine if a linear transformation would be useful for a specific model or set of data
  • Identify an appropriate linear transformation technique for a specific model or set of data
  • Apply linear transformations to independent and dependent variables in linear regression
  • Interpret the coefficients of variables that have been transformed using a linear transformation

Why Apply Linear Transformations?

Linear transformations don't impact the overall model performance metrics of an ordinary least-squares linear regression. So why apply them?

The main reason to apply a linear transformation is so that the modeling results are more useful or interpretable to a stakeholder. There are also some machine learning models that assume that variables have been transformed to have the same scale, although this is not applicable to the regression models we are currently using.

For each common type of linear transformation we'll walk through a reason why it might be useful, how to apply it, and how to interpret the resulting coefficients.

Scaling

Let's say we have this model, using the Auto MPG dataset:

import pandas as pd
data = pd.read_csv("auto-mpg.csv")
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
388 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
389 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
390 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
391 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10

392 rows × 9 columns

import statsmodels.api as sm

y_initial = data["mpg"]
X_initial = data[["cylinders", "weight", "model year"]]

initial_model = sm.OLS(y_initial, sm.add_constant(X_initial))
initial_results = initial_model.fit()
initial_results.rsquared_adj
0.8069069309563753
initial_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64

You are preparing to present these findings to your stakeholders when you realized that all of the units are imperial but your stakeholders are used to the metric system. None of your coefficients are going to make any sense to them!

To address this issue, you can apply scaling to the variables. This just means multiplying the variables by an appropriate value. We'll use pandas broadcasting to multiply everything in a column at once.

Scaling a Feature

First we'll start by scaling the weight predictor so that the units are kilograms rather than pounds.

X_metric = X_initial.copy()
# 1 lb = 0.45 kg
X_metric["weight"] = X_metric["weight"] * 0.453592

X_metric
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
0 8 1589.386368 70
1 8 1675.115256 70
2 8 1558.542112 70
3 8 1557.181336 70
4 8 1564.438808 70
... ... ... ...
387 4 1265.521680 82
388 4 966.150960 82
389 4 1040.993640 82
390 4 1190.679000 82
391 4 1233.770240 82

392 rows × 3 columns

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()
sns.histplot(data=X_initial, x="weight", label="Imperial", ax=ax, color="blue")
sns.histplot(data=X_metric, x="weight", label="Metric", color="orange", ax=ax)
ax.legend();

png

kg_model = sm.OLS(y_initial, sm.add_constant(X_metric))
kg_results = kg_model.fit()

print(f"""
Initial model adjusted R-Squared:      {initial_results.rsquared_adj}
Weight in kg model adjusted R-Squared: {kg_results.rsquared_adj}
""")
Initial model adjusted R-Squared:      0.8069069309563753
Weight in kg model adjusted R-Squared: 0.8069069309563753

We have just built the "same" model, as you can see from the comparison of adjusted R-Squared values. But now let's look at the coefficients:

initial_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64
kg_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.014034
model year     0.752020
dtype: float64

They are all the same except for weight. The coefficient for weight still has the same sign (negative) but it's representing a different scale of weight.

The initial model is saying:

For each increase of 1 lb in weight, we see an associated decrease of about .006 in MPG

The second model is saying:

For each increase of 1 kg in weight, we see an associated decrease of about .014 in MPG

This is telling you the exact same information, just expressed in different units.

We actually could have calculated this without building an entire new model! We would just apply the inverse of the same transformation to the coefficient that we applied to the feature.

kg_results.params["weight"]
-0.014033972159816015
initial_results.params["weight"] / 0.453592
-0.014033972159816346

Scaling the Target

But you'll notice that even though we adjusted the units of weight, the target units are still in miles per gallon, which uses imperial units. The conventional metric units for fuel economy are kilometers per liter, not miles per gallon. For this to make sense to our stakeholders we need to make sure that all of the units are metric, not imperial.

So let's transform the units of y as well:

# 1 mpg = 0.425 km/L
y_metric = data["mpg"] * 0.425144
# "mpg" is no longer an accurate name, so rename
y_metric.name = "km/L"

y_metric
0       7.652592
1       6.377160
2       7.652592
3       6.802304
4       7.227448
         ...    
387    11.478888
388    18.706336
389    13.604608
390    11.904032
391    13.179464
Name: km/L, Length: 392, dtype: float64
fig, ax = plt.subplots(figsize=(15,5))
sns.histplot(data=y_initial, label="Imperial", ax=ax)
sns.histplot(data=y_metric, label="Metric", color="orange", ax=ax)
ax.legend();

png

metric_model = sm.OLS(y_metric, sm.add_constant(X_metric))
metric_results = metric_model.fit()

print(f"""
Initial model adjusted R-Squared:      {initial_results.rsquared_adj}
Weight in kg model adjusted R-Squared: {kg_results.rsquared_adj}
Metric model adjusted R-Squared:       {metric_results.rsquared_adj}
""")
Initial model adjusted R-Squared:      0.8069069309563753
Weight in kg model adjusted R-Squared: 0.8069069309563753
Metric model adjusted R-Squared:       0.8069069309563753
kg_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.014034
model year     0.752020
dtype: float64
metric_results.params
const        -5.912735
cylinders    -0.064507
weight       -0.005966
model year    0.319717
dtype: float64

We are still getting the same adjusted R-Squared, but our coefficients look quite different now that we've transformed the target.

Interpreting the weight coefficient specifically, the second model was saying:

For each increase of 1 kg in weight, we see an associated decrease of about .014 in MPG

Whereas the model with a transformed target was saying:

For each increase of 1 kg in weight, we see an associated decrease of about 0.006 in km/L

Again, this is the same information, except now both the predictor and the target are expressed in metric units.

Once again we could have just transformed the coefficients rather than building a new model, but the math gets more complicated:

metric_results.params["weight"]
-0.00596645905991282
# For target transformations, don't invert
# --> We multiplied the target, so multiply the coefficient
kg_results.params["weight"] * 0.425144
-0.00596645905991282
# To go from the original values, both divide and multiply
# --> We multiplied the predictor, so divide the coefficient
# --> We multiplied the target, so multiply the coefficient
initial_results.params["weight"] / 0.453592 * 0.425144
-0.005966459059912961

The key takeaway here is that when you apply scaling, you are building the same model, except it is expressed using different units. You can change the units either before or after fitting the model.

Shifting

While scaling means multiplying/dividing variables, shifting means adding or subtracting a value from the variable. Scaling impacts the predictor coefficients, whereas shifting impacts the constant coefficient (i.e. the intercept).

Shifting to Improve Dataset Interpretability

One example of shifting might be if we wanted to change the values of model year so that instead of being "years since 1900" (e.g. 70) they are just "years CE" (e.g. 1970).

X_years_ce = X_initial.copy()
X_years_ce["model year"] = X_years_ce["model year"] + 1900
X_years_ce
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
0 8 3504 1970
1 8 3693 1970
2 8 3436 1970
3 8 3433 1970
4 8 3449 1970
... ... ... ...
387 4 2790 1982
388 4 2130 1982
389 4 2295 1982
390 4 2625 1982
391 4 2720 1982

392 rows × 3 columns

fig, ax = plt.subplots(figsize=(15,5))
sns.histplot(data=X_initial, x="model year", label="Years since 1900", ax=ax)
sns.histplot(data=X_years_ce, x="model year", label="Years CE", color="orange", ax=ax)
ax.legend();

png

This makes the dataset easier to understand, and potentially avoids some Y2K type errors. The resulting model has the same adjusted R-Squared and the same coefficients for everything except const:

years_ce_model = sm.OLS(y_initial, sm.add_constant(X_years_ce))
years_ce_results = years_ce_model.fit()

print(f"""
Initial model adjusted R-Squared:     {initial_results.rsquared_adj}
Years in CE model adjusted R-Squared: {kg_results.rsquared_adj}
""")
Initial model adjusted R-Squared:     0.8069069309563753
Years in CE model adjusted R-Squared: 0.8069069309563753
initial_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64
years_ce_results.params
const        -1442.745699
cylinders       -0.151729
weight          -0.006366
model year       0.752020
dtype: float64

The intercept (const coefficient) is not particularly interpretable either way.

In the first model, it is saying:

For a car with 0 cylinders, weighing 0 lbs, and built in 1900, we expect an MPG of about -14

In the newest model, it is saying:

For a car with 0 cylinders, weighing 0 lbs, and built in 0, we expect an MPG of about -1400

Neither of those hypothetical cars are particularly realistic, so we haven't really "broken" anything by making our dataset more interpretable.

It is possible to compute this change after the fact, but the math is more complex so we won't be demonstrating it. If you want to shift your data, it makes more sense to do it before building the model.

Shifting to Improve Intercept Interpretability

In all of the examples so far, the intercept has been an impossible value (negative fuel economy) resulting from an impossible set of predictor values (e.g. weight of 0). What if we want to calculate a more interpretable intercept instead?

To do this we'll shift the predictors so that a value of 0 represents the mean rather than representing 0. This specific approach is typically called zero-centering (or simply centering) and some machine learning models work much better with data centered around 0.

Note that this is essentially the opposite approach of shifting to make the dataset more interpretable. You will need to consider whether an interpretable dataset matters more, or an interpretable intercept matters more, for your particular context. You also might find that you need to build and report on multiple models to express these different aspects.

X_initial.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
count 392.000000 392.000000 392.000000
mean 5.471939 2977.584184 75.979592
std 1.705783 849.402560 3.683737
min 3.000000 1613.000000 70.000000
25% 4.000000 2225.250000 73.000000
50% 4.000000 2803.500000 76.000000
75% 8.000000 3614.750000 79.000000
max 8.000000 5140.000000 82.000000
X_centered = X_initial.copy()

for col in X_centered.columns:
    X_centered[col] = X_centered[col] - X_centered[col].mean()
    
X_centered.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
count 3.920000e+02 3.920000e+02 3.920000e+02
mean -7.250436e-17 3.712223e-14 -4.640279e-15
std 1.705783e+00 8.494026e+02 3.683737e+00
min -2.471939e+00 -1.364584e+03 -5.979592e+00
25% -1.471939e+00 -7.523342e+02 -2.979592e+00
50% -1.471939e+00 -1.740842e+02 2.040816e-02
75% 2.528061e+00 6.371658e+02 3.020408e+00
max 2.528061e+00 2.162416e+03 6.020408e+00
fig, axes = plt.subplots(nrows=3, figsize=(15,15))

for index, col in enumerate(X_initial.columns):
    sns.histplot(data=X_initial, x=col, label="Initial", ax=axes[index])
    sns.histplot(data=X_centered, x=col, label="Centered", color="orange", ax=axes[index])
    axes[index].legend()

png

Note that the means of each column went from about 5.5, about 3k, and about 76 to being about 0 for each. (Due to floating point rounding the actual means are very small positive or negative values, but you can consider them to equal 0.)

The counts and standard deviations are the same, but the minimum, maximum, and percentile values are also shifted to reflect the new mean.

Let's build a model with these centered predictors:

centered_model = sm.OLS(y_initial, sm.add_constant(X_centered))
centered_results = centered_model.fit()

print(f"""
Initial model adjusted R-Squared:  {initial_results.rsquared_adj}
Centered model adjusted R-Squared: {centered_results.rsquared_adj}
""")
Initial model adjusted R-Squared:  0.8069069309563753
Centered model adjusted R-Squared: 0.8069069309563753
initial_results.params
const        -13.907606
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64
centered_results.params
const         23.445918
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64

As expected, our coefficients for the predictors are the same. For example, for each increase of 1 lb in weight, we see an associated decrease of about 0.006 in MPG.

However we now have a more meaningful intercept. In our initial model, the intercept interpretation was this:

For a car with 0 cylinders, weighing 0 lbs, and built in 1900, we would expect an MPG of about -13.9

That is an impossible MPG, for an impossible car.

In our zero-centered model, the intercept interpretation is this:

For a car with the average number of cylinders, average weight, and average model year, we would expect an MPG of about 23.4

That makes a lot more sense! Now the intercept is something that might be worth reporting to stakeholders.

However you should also consider that this "average" car might be impossible as well. For example, if we look at the cylinders average, it is:

data["cylinders"].mean()
5.471938775510204

Can a car actually have 5.5 cylinders? Probably not! So this intercept interpretation is really only 100% realistic if all of the predictors are continuous variables. But you still may find it relevant for stakeholders, so long as you report it with the right caveats.

Standardizing: Centering + Scaling

Standardization is a combination of zero-centering the variables and dividing by the standard deviation.

$$x' = \dfrac{x - \bar x}{\sigma}$$

After performing this transformation, $x'$ will have mean of 0 and a standard deviation of 1.

X_initial.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
count 392.000000 392.000000 392.000000
mean 5.471939 2977.584184 75.979592
std 1.705783 849.402560 3.683737
min 3.000000 1613.000000 70.000000
25% 4.000000 2225.250000 73.000000
50% 4.000000 2803.500000 76.000000
75% 8.000000 3614.750000 79.000000
max 8.000000 5140.000000 82.000000
X_standardized = X_initial.copy()

for col in X_standardized:
    X_standardized[col] = (X_standardized[col] - X_standardized[col].mean()) \
                            / X_standardized[col].std()
    

X_standardized.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cylinders weight model year
count 3.920000e+02 3.920000e+02 3.920000e+02
mean -7.250436e-17 3.625218e-17 -1.232574e-15
std 1.000000e+00 1.000000e+00 1.000000e+00
min -1.449152e+00 -1.606522e+00 -1.623241e+00
25% -8.629108e-01 -8.857216e-01 -8.088504e-01
50% -8.629108e-01 -2.049490e-01 5.540071e-03
75% 1.482053e+00 7.501341e-01 8.199306e-01
max 1.482053e+00 2.545808e+00 1.634321e+00
fig, axes = plt.subplots(nrows=3, figsize=(15,15))

for index, col in enumerate(X_initial.columns):
    sns.histplot(data=X_initial, x=col, label="Initial", ax=axes[index])
    sns.histplot(
        data=X_standardized,
        x=col,
        label="Standardized",
        color="orange",
        ax=axes[index]
    )
    axes[index].legend()

png

In linear regression analysis, the most common reason for standardizing data is so that you can compare the coefficients to each other.

In our centered model, the coefficients are all using different units:

centered_results.params
const         23.445918
cylinders     -0.151729
weight        -0.006366
model year     0.752020
dtype: float64

model year has the largest magnitude, but can we say that it "matters most" or "has the most impact"? Probably not, because it is measured in years whereas the other features are measured in cylinders and pounds. How can we compare those?

Standardization changes the units of the coefficients so that they are in standard deviations rather than the specific units of each predictor. This allows us to make just that comparison:

standardized_model = sm.OLS(y_initial, sm.add_constant(X_standardized))
standardized_results = standardized_model.fit()

print(f"""
Centered model adjusted R-Squared:     {centered_results.rsquared_adj}
Standardized model adjusted R-Squared: {standardized_results.rsquared_adj}
""")
Centered model adjusted R-Squared:     0.8069069309563753
Standardized model adjusted R-Squared: 0.8069069309563752
standardized_results.params
const         23.445918
cylinders     -0.258817
weight        -5.407040
model year     2.770244
dtype: float64

We have the same intercept as the zero-centered model (since this model's features were also centered), but now the coefficients look quite different. We can interpret them like this:

For each increase of 1 standard deviation in the number of cylinders, we see an associated decrease of about 0.26 MPG

For each increase of 1 standard deviation in the weight, we see an associated decrease of about 5.4 MPG

For each increase of 1 standard deviation in the model year, we see an associated increase of about 2.8 MPG

Comparing these three, we can conclude that weight "is the most important" or "has the most impact" because it has the largest coefficient. This might be surprising because the previous model had the smallest coefficient for weight, but that was because it was measured in pounds, with a much larger standard deviation than the other two predictors:

data["weight"].std()
849.4025600429492
data["cylinders"].std()
1.7057832474527845
data["model year"].std()
3.6837365435778295

(Every model is different, and sometimes the largest coefficient before standardizing will also be the largest after standardizing. This is just an example of how much of a difference it can make!)

Also, just like you can get transformed coefficients from un-transformed data by applying the inverse of the transformation, you can get un-transformed coefficients from transformed data by applying the same transformation to the coefficient.

For example, let's say you have this standardized model as your final model, because you knew that stakeholders would want to know which feature was most important:

standardized_results.params
const         23.445918
cylinders     -0.258817
weight        -5.407040
model year     2.770244
dtype: float64

You have answered the question about which is most important (weight) but now the stakeholder wants you to interpret the coefficient. You start to say "Each increase of 1 standard deviation..." but that is too confusing. A typical business stakeholder might not have a clear sense of what a "standard deviation" is.

Fortunately to get those coefficients to be the same as the un-transformed version (i.e. units of cylinders, pounds, and years respectively), just divide each of them by the standard deviation:

standardized_results.params["cylinders"] / data["cylinders"].std()
-0.15172901259381377
standardized_results.params["weight"] / data["weight"].std()
-0.006365697499915229
standardized_results.params["model year"] / data["model year"].std()
0.7520200488347161

These are now the same as the initial model params!

initial_results.params[1:]
cylinders    -0.151729
weight       -0.006366
model year    0.752020
dtype: float64

Other Popular Transformations

We won't go through code examples for them, but it's helpful to be aware that there are other transformations you might see.

Min-Max Scaling

When performing min-max scaling, you can transform x to get the transformed $x'$ by using the formula:

$$x' = \dfrac{x - \min(x)}{\max(x)-\min(x)}$$

This way of scaling brings all values between 0 and 1. There are also implementations that bring in ranges other than 0 to 1 by performing an extra scaling step.

Mean Normalization

When performing mean normalization, you use the following formula:

$$x' = \dfrac{x - \text{mean}(x)}{\max(x)-\min(x)}$$

The distribution will have values between -1 and 1, and a mean of 0.

Unit Vector Transformation

When performing unit vector transformations, you use the following formula:

$$x'= \dfrac{x}{{||x||}}$$

Recall that the norm of x $||x||= \sqrt{(x_1^2+x_2^2+...+x_n^2)}$.

This will also create a distribution between 0 and 1.

Additional Information

Scikit-learn provides tools to apply various feature transformations:

Have a look at these built-in functions and some code examples here!

To learn more about feature scaling in general, you can have a look at this blog post: https://sebastianraschka.com/Articles/2014_about_feature_scaling.html (up until "bottom-up approaches").

Summary

In this lesson, you learned about why linear transformations are useful, how to apply them, and how to interpret the resulting coefficients.

dsc-linear-transformations's People

Contributors

cheffrey2000 avatar hoffm386 avatar loredirick avatar mas16 avatar mathymitchell avatar peterbell avatar sumedh10 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-linear-transformations's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.