Project: Regression Modeling with the Boston Housing Dataset

Introduction

In this final lab, you'll apply the regression analysis and diagnostics techniques covered in this section to the famous "Boston Housing" dataset. You performed a detailed EDA for this dataset earlier on, and hopefully, you more or less recall how this data is structured! In this lab, you'll use some of the features in this dataset to create a linear model to predict the house price!

Objectives

You will be able to:

Build many linear models with the Boston housing data using OLS
Analyze OLS diagnostics for model validity
Visually explain the results and interpret the diagnostics from Statsmodels
Comment on the goodness of fit for a simple regression model

Let's get started

Import necessary libraries and load 'BostonHousing.csv' as a pandas dataframe

# Your code here

The columns in the Boston housing data represent the dependent and independent variables. The dependent variable here is the median house value MEDV. The description of the other variables is available on KAGGLE.

Inspect the columns of the dataset and comment on type of variables present

# Your code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

# Record your observations here

Create histograms for all variables in the dataset and comment on their shape (uniform or not?)

# Your code here

# You observations here

Based on this, we preselected some features for you which appear to be more 'normal' than others.

Create a new dataset with `['crim', 'dis', 'rm', 'zn', 'age', 'medv']`

# Your code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	crim	dis	rm	zn	age	medv
0	0.00632	4.0900	6.575	18.0	65.2	24.0
1	0.02731	4.9671	6.421	0.0	78.9	21.6
2	0.02729	4.9671	7.185	0.0	61.1	34.7
3	0.03237	6.0622	6.998	0.0	45.8	33.4
4	0.06905	6.0622	7.147	0.0	54.2	36.2

Check for linearity assumption for all chosen features with target variable using scatter plots

# Your code here

# Your observations here

Clearly, your data needs a lot of preprocessing to improve the results. This key behind a Kaggle competition is to process the data in such a way that you can identify the relationships and make predictions in the best possible way. For now, we'll the dataset untouched and just move on with the regression. The assumptions are exactly all fulfilled, but they still hold to a level that we can move on.

Let's do Regression

Now, let's perform a number of simple regression experiments between the chosen independent variables and the dependent variable (price). You'll do this in a loop and in every iteration, you should pick one of the independent variables. Perform the following steps:

Run a simple OLS regression between independent and dependent variables
Plot a regression line on the scatter plots
Plot the residuals using sm.graphics.plot_regress_exog()
Plot a Q-Q plot for regression residuals normality test
Store following values in array for each iteration:
- Independent Variable
- r_squared'
- intercept'
- 'slope'
- 'p-value'
- 'normality (JB)'
Comment on each output

# Your code here

Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~crim
-------------------------------------------------------------------------------------

Press Enter to continue...
Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~dis
-------------------------------------------------------------------------------------

Press Enter to continue...
Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~rm
-------------------------------------------------------------------------------------

Press Enter to continue...
Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~zn
-------------------------------------------------------------------------------------

Press Enter to continue...
Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~age
-------------------------------------------------------------------------------------

Press Enter to continue...

pd.DataFrame(results)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4	5
0	ind_var	r_squared	intercept	slope	p-value	normality (JB)
1	crim	0.15078	24.0331	-0.41519	1.17399e-19	295.404
2	dis	0.0624644	18.3901	1.09161	1.20661e-08	305.104
3	rm	0.483525	-34.6706	9.10211	2.48723e-74	612.449
4	zn	0.129921	20.9176	0.14214	5.71358e-17	262.387
5	age	0.142095	30.9787	-0.123163	1.56998e-18	456.983

#Your observations here

Clearly, the results are not very reliable. The best R-Squared is witnessed with rm, so in this analysis, this is uour best predictor.

How can you improve these results?

Preprocessing

This is where preprocessing of data comes in. Dealing with outliers, normalizing data, scaling values etc. can help regression analysis get more meaningful results from the given data.

Advanced Analytical Methods

Simple regression is a very basic analysis technique and trying to fit a straight line solution to complex analytical questions may prove to be very inefficient. Later on, you'll explore at multiple regression where you can use multiple features at once to define a relationship with the outcome. You'll also look at some preprocessing and data simplification techniques and revisit the Boston dataset with an improved toolkit.

Level up - Optional

Apply some data wrangling skills that you have learned in the previous section to pre-process the set of independent variables we chose above. You can start off with outliers and think of a way to deal with them. See how it affects the goodness of fit.

Summary

In this lab, you applied your skills learned so far on a new data set. You looked at the outcome of your analysis and realized that the data might need some preprocessing to see a clear improvement in results. You'll pick this back up later on, after learning about more preprocessing techniques and advanced modeling techniques.

learn-co-students / dsc-enterprise-hsbc-day-4-project-hsbc-ds-081319 Goto Github PK

dsc-enterprise-hsbc-day-4-project-hsbc-ds-081319's Introduction

Project: Regression Modeling with the Boston Housing Dataset

Introduction

Objectives

Let's get started

Import necessary libraries and load 'BostonHousing.csv' as a pandas dataframe

Inspect the columns of the dataset and comment on type of variables present

Create histograms for all variables in the dataset and comment on their shape (uniform or not?)

Create a new dataset with ['crim', 'dis', 'rm', 'zn', 'age', 'medv']

Check for linearity assumption for all chosen features with target variable using scatter plots

Let's do Regression

How can you improve these results?

Level up - Optional

Summary

dsc-enterprise-hsbc-day-4-project-hsbc-ds-081319's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Create a new dataset with `['crim', 'dis', 'rm', 'zn', 'age', 'medv']`