Coder Social home page Coder Social logo

home-buying's Introduction

Home Buying

photo

Objective

The housing market is nutoriouosly inefficient. Identifying the true value of a home compared to the price gives us an opportunity to use this inefficiency to our advantage. The goal is to build an algorythm that can identify how much a house is worth so that we can find opportunities for arbitrage when a home hits the market below its worth. If we can move fast enough via web scrapping thousands of for sale records and generating this output, we may be able make money before other bidders drive up the price.

Used tools

Programs:

  1. Jupiter - python
  2. Tableau
  3. Powerpoint

Methods:

  1. Linear Regression
  2. Random Forest
  3. Neural Networks
  4. Web Scrapping
  5. API Connection 
  2. Histograms, distribution plots, heat maps, box-whisker plots, bar plots, Chi squared testing, Hyperparameter tuning

Libraries:

  1. Normalizer
  2. StandardScaler
  3. Min Max Scaler
  4. math
  5. statistics 
  6. matplotlib.pyplot
  7. seaborn
  8. statsmodels.api
  9. train_test_split
  10. one hot encoder
  11. numpy
  12. pandas
  13. statsmodels.formula.api
  14. linear_model
  15. is_numeric_dtype
  16. Redfin
  17. scipy
  18. requests
  19. urllib.parse
  20. chi2_contingency
  21. DBSCAN
  22. LinearRegression
  23. sklearn.metrics
  24. sklearn.feature_selection
  25. sklearn.linear_model
  26. mean_squared_error
  27. r2_score
  28. mean_absolute_error
  29. statsmodels.api 
  30. tensorflow
  31. train_test_split
  32. RandomForestRegressor
  33. make_regression
  34. variance_inflation_factor
  35. GridSearchCV
  36. sklearn.decomposition.PCA

Workflow

  1. Gather the Data 
  2. Explore the Data 
  3. Dealing with Nulls 
  4. Clean the Data 
      a. Categorical data: 
        i. needs to be grouped into fewer buckets
        ii. dropped if there are too many instances
        iii. cleaned if there are various values with the same meaning 
      b. Numerical data: 
        i. needs to be converted to number if object (may require functions for data transformations)
  5. Dealing with Outliers 
  6. Transform numerical columns to conform to a more normal standard distribution
  7. Check Multicollinearity
      a. Numeric- VIF and correlation matrix
      b. Categorical - chi squared testing
  8. Normalize, Min-max, or standardize the data 
  9. Build Clusters using DBScan
  10. Apply Train-Test Split 
  11. Hyoerparameter tuning for feature selection
  12. Train and Test random forest (since extra features don't hinder the model)
  12. Remove extra features based on feature importance
  11. Train Model Linear and Neural Network models
  12. Test Model Linear and Neural Network models

Transformations

Tested a variety of methods to convert my continuos variables into normal distributions:
1. log
2. square root
3. square
4. cubed 
Transformation Numerical Factor
sqrt totalfavoritescount
sqrt taxesdue
sqrt publicschoolrate
sqrt basement_square_feet
sqrt land_square_footage
sqrt acres
sqrt taxrate
sqr totalamenities
none latitude
none longitude
none daysonmarket
none publicstudentteacherratio
none publicstudentcnt
none privatestudentcnt
log squarefeet
log viewcount
log totaltourcount
log privaterate
log garage_parking_square_feet
cbd yearbuilt
cbd yearrenovated

Scaling

Tested a variety of scaling methods:
  1. Normalizer
  2. Standard Scaler
  3. Min โ€“ Max Scaler

Feature Importance:

Feature Regression P-Val Random Forest Feature Importance
daysonmarket 0.518 0.00592214
publicstudentteacherratio 0.344 0.004352648
publicstudentcnt 0.946 0.005487916
privatestudentcnt 0.385 0.006992907
squarefeet 0 0.02421013
totaltourcount 0.161 0.001894489
privaterate 0.128 0.005642923
garage_parking_square_feet 0.009 0.0081063
totalfavoritescount 0 0.008720093
publicschoolrate 0.001 0.006671266
basement_square_feet 0.038 0.001758376
land_square_footage 0.002 0.01105095
acres 0.319 0.02747586
taxrate 0 0.04166951
totalamenities 0.488 0.005520898
x2_Condo&Manufactured 0.19 0.002256557
x2_Multi-Family 0.195 0.001505547
Foundation_Continuous Footing 0 0.02464942
Foundation_other 0 0.007086104
Garage_Finished 0.02 0.002233992
Garage_other 0.328 0.00693243
Building_Quality_0 0 0.4383543
Building Quality_3 0 0.08868372
Walls_0 0.005 0.002632515
Walls_1 0.006 0.004978199
Walls_4 0 0.007193782
Roof_0 0.065 0.06866194
Roof_1 0 0.005233747
Roof_2 0 0.006545969
Roof_3 0 0.1469811
Location_2 0.068 0.00129702
Out of all the methods, Random Forest and Min-max scaler produced the highest accuracy:

photo

Results and conlusions

  The accuracy of the random forest model on test set is: 44,124 and on the for sale set is: 50,423. Assuming there is no difference between the homes in the two sets, it stands to reason the additional discrepenct from test to sale is the market inefficiency.

photo

Dataset

The dataset after gathering the data is finalDatecsv. The dataset after all cleaning and including all the models' predictions is withpredictions.xlsx.

This data set includes:

  1. Home Buying data:

    daysonmarket basement_square_feet
    publicstudentteacherratio land_square_footage
    publicstudentcnt acres
    privatestudentcnt taxrate
    squarefeet totalamenities
    totaltourcount x2_Condo&Manufactured
    privaterate x2_Multi-Family
    garage_parking_square_feet Foundation_Continuous Footing
    totalfavoritescount Foundation_other
    publicschoolrate Garage_Finished

Tableau

A full analysis with visuals breaking down key drivers of home value can be found in Home Buying.

Presentation

A powerpoint containing the agenda and tying it all together can be found in Home Buying.pptx.

Jupyter Notebook

A python code for data gathering can be found in Data Gathering.ipynb. A python code for the analysis can be found in Home-buying-Analysis.ipynb.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.