Purchasing a home is one of the largest purchases individuals will generally make in their lifetime, understanding the value of the house is an important step in the process of purchasing or selling of a house. Zillow, an online real-estate database allows for consumers to research estimated market values in the different areas. The estimated market value is calculated daily using public housing information and user inputted data in the calculation, Zillow calls this estimate a “Zestimate.” While Zestimates are not available in all areas, Zillow has released that since their launch the median margin of error has improved from 14% to approximately 5.5%.
Zillow in an effort to improve the median margin of error of their algorithm has released some of their data for a Kaggle competition. The goal of the competition is to create an algorithm that will improve the median margin of error from its current value of 5.5%. We will be working within the goals of the Kaggle competition and will use the Kaggle scores for submissions and the median margin of error to judge the effectiveness of the prediction algorithm.
In this paper, we will be performing the initial exploratory data analysis. We will use statistics and visualizations to explore and gain an understanding of the data for future analysis.
We will be using the data set properties_2016.csv from the Zillow Prize: Zillow's Home Value Prediction (Zestimate) Kaggle competition. The dataset contains all properties from Los Angeles County, Orange County and Ventura County in California, along with the home features for 2016. The data is provided by Zillow with what is gathered through public records and through user entered data points.
The data set has 58 variables describing the different features or characteristics of a home, including number of bedrooms, location detail, square footage, number of bathrooms, type of heating/cooling sytems, and ect.
this is where i am trying to show the list of variables and their values.