This is the final project of CEBD-1160 course, based on Boston housing dataset.
Name | Date |
---|---|
Arwa Sheraky | 28 March, 2019 |
This repository includes:
- Simple pseudocode:
Pseudocode.md
- Python script for boston-housing-data analysis:
boston_analysis.py
- Results figures:
Figures/
- Dockerfile for experiment:
Dockerfile
- runtime-instructions:
RUNME.md
Knowing the average prices of houses and the features that could affect them, could we predict the average prices of new houses, having their 13 features? What would be the accuracy of that prediction?
The dataset used in this project is publically shared on scikit-learn datasets, and could be explicitly imported into any python app, from sklearn
library.
The data was collected in suburbs of Boston from the 1970s, including 13 features of 506 instances:
CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
Here's a simple pseudocode of what we're doing in this analysis: Pseudocode, but the main goal of this project is to build a strong regression model to predict the prices of houses. This is done based on existing trainig data using the 13 features of each house. The following will explain how we tried to acheive this goal.
-
By applying different regressors and comparing their performance using R-squared and MSE, we can find the best one to solve the problem. The used regressors are:
-
After that, using cross validation to build the model would make it stronger and increase its performance. So, with
k-fold = [3, 5, 7, 10, 15, 20]
, we chose the most accurate k and finally built the model using these information (Regressor and k).
Model | Linear Regression | Bayesian Ridge | Lasso | Gradient Boosting |
---|---|---|---|---|
MSE | 24.275 | 25.968 | 32.288 | 8.355 |
As shown in the previous subplots and performance table, Gradient Boosting Regressor was the best model to predict the average prices as close to the real values as possible, with minimum MSE and maximum R-Squared. In addition, there is a very wide gap between the accuracy of this model and the others!
By applying cross validation to the chosen regressor and comparing 6 different values of k-fold, we noticed that k = 10
has the least MSE and the most accurate results.
According to the previous observations and calculations, the ideal regressor is Gradient Boosting and the best k-fold of Cross Validation modeling is k=10. Using these data we can easily build the best model that could predict the price values as close as possible. The plot below shows the final model, with MSE = 17.062
:
- The algorithm of the Gradient Boosting regressor can be found here, and how the regressor works on python is on scikit-learn documentation, mentioned above.
- The Cross Validation method used here, is
cross_val_predict
, built by scikit-learn.
Gradient Boosting was the best at solving the problem among 4 randomly chosen regressors, with a Mean Square Error of 17.062
. There might be better methods to solve this problem with a less chance of error and a better accuracy. These methods could be discoverd later by understanding the dataset and studying Data Science field in a deeper way.
- The main used libraries in this analysis are:
pandas
andnumpy
: Creating Dataframes and calculating statistical summary.matplotlib
,seaborn
andplotly
: Plotting histograms, scatterplots and regression lines.sklearn
: Importing the dataset, splitting data, applying regressors and CV and calculating performance.
All refrences is included above.