Coder Social home page Coder Social logo

etownbetty / real_estate_analysis Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 39.11 MB

Analysis and recommendations for most important features to predict housing prices based on a publicly-available data set

Jupyter Notebook 0.53% Python 99.47%

real_estate_analysis's Introduction

Real Estate Analysis

A number of variables were available and potentially related to the price of houses in the SF bay area, as well as the variable to be used as the outcome variable = price (price at last sale). A quick look shows that there were variables related to house and lot size (sqft_living, sqft_above, sqft_basement, sqft_lot, sqft_living15, sqft_lot15), location, whether or not there was a view or was on the waterfront, whether the house was renovated, and some general condition information. All of these could be related to the house value, but we can potentially find objectively if the variables are important is determining the house sale price and how important. I engineered some variables that could also be predictive of house price, like age of the house, and whether it was renovated (as opposed to when it was renovated), and I collapsed the zip code variable into larger bins, based on the first 4 numbers of the zip code which will group houses in nearby zipcodes.

Exploratory

Before fitting a model, I wanted to see how the variables were distributed and if there were any apparent relationships between them and the outcome price variable. I plotted histograms of the continuous variables, a scatterplot matrix of the continuous variables and box plots of the categorical variables with respect to the sale price of the houses.

A simple model

As a starting point, I fit a model of price vs sqft, which had an rSquared value of 0.492, and showed that sqft was a significant predictor or house price, but upon inspection of the residual plots, there is a pattern to the residuals - the larger the house price, the larger the errors, seen below.

Simple Model Residuals

So I took a log transform of the house price and refit the model. The sqft feature stayed significant, and the residuals were more randomly distributed around a mean of zero, with the exception of some outliers, which were also seen in the original residual plot.

Simple Model Log-transformed Residuals

A full model

I then added some additional features that appeared to have a relatively linear relationship with the log price variable, according to the correlation plot matrix and box plots against the log price outcome, and didn't have any redundancy with each other. The features that I added, age of house, whether it had been renovated, the grade, the view, and waterfront property status were all significantly associated with the log price outcome and more of the variance was explained in the model - the adjusted R-Square measure went up to 0.638. The residual plot showed that the errors were randomly distributed around a mean of zero and the outliers appear to have been explained better with a more complex model.

Full Model Log-transformed Residuals

A complex model

I also wanted to add some information about the location of the houses with respect to price, so I added a categorical simplified zip code variable into my model - this further improved the amount of variance explained with an adjusted R-Squared value of 0.691.

Lasso for the win

I cross-checked my model using a lasso procedure to do feature selection from all the available features. Using an alpha=0.01, I found that my complex model had all significant features that didn't have any redundancy to them in it. I also added a bathroom feature, which was the number of bathrooms, this further increased the adjusted R-Squared value to 0.696.

My final model to predict housing prices is:

ln_price = bathrooms + age + renovated + sqft_living + grade + view + ziplarge + waterfront

Where ziplarge is my feature with super-zip codes, i.e. 4 number zip codes

real_estate_analysis's People

Stargazers

Trieu Nguyen avatar

Watchers

Karey Shumansky avatar

Forkers

uspa-technology

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.