Coder Social home page Coder Social logo

francescomariottini / residential-real-estate-analysis Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 6.0 11.86 MB

The real estate company "ImmoEliza" wanted to create a machine learning model to predict prices on Belgium's sales. A complete analysis and interpretation of the dataset was provided.

Python 0.36% Jupyter Notebook 99.64%
belgium residential real-estate analysis price

residential-real-estate-analysis's Introduction

Cleaning, preliminary analysis and interpretation of residential real estate sales in Belgium (What)

The real estate company "ImmoEliza" wanted to create a machine learning model to predict prices on Belgium's sales. A complete analysis and interpretation of the dataset was provided.

The Mission (Why)

  • Be able to use pandas
  • Be able to use Data visualisation libraries.(Matplotlib or Seaborn)
  • Be able to establish conclusions about a dataset.

Features

Hereby follow the project results by section. The related presentation, including graphs obtained through Matplotlib/Seaborn is available here.

Everything need to be updloaded non-exhaustive Friday 23/10/20.

A cleaned dataset

The provided dataset (available here) is cleaned of:

  • duplicates
  • blank spaces (ex: " I love python " => "I love python")
  • errors
  • empty values

Data analysis overview (QUESTIONS TO BE REPLACED WITH ANSWERS)

Hereby follow the main results from the preliminary data analysis:

  • Which variable is the target ?
  • How many rows and columns ?
  • What is the correlation between variable/target ? (Why?)
  • What is the correlation between the variables/variables ? (Why?)
  • Which variables have the greatest influence on the target ?
  • Which variables have the least influence on the target ?
  • How many qualitative and quantitative variable is there ? How would you transform these values into numerical values ?
  • Percentage of missing values per column ?

Data interpretation questions (non-exhaustive list)(QUESTIONS TO BE REPLACED WITH ANSWERS)

Hereby follow the main results from the data interpretation:

  • Are there any outliers? If yes, which ones and why?
  • Which variables would you delete and why ?
  • In your opinion, which 5 variables are the most important and why?
  • What are the most expensive municipalities in Belgium? (Average price, median price, price per square meter)
  • What are the most expensive municipalities in Wallonia? (Average price, median price, price per square meter)
  • What are the most expensive municipalities in Flanders? (Average price, median price, price per square meter)
  • What are the less expensive municipalities in Belgium? (Average price, median price, price per square meter)
  • What are the less expensive municipalities in Wallonia? (Average price, median price, price per square meter)
  • What are the less expensive municipalities in Flanders? (Average price, median price, price per square meter)

Presentation (26/10/20)

Presentation is available here.

SY prepared the first draft of the presentation and the template. It was agreed to have 1 or 2 slides per person to kept the total presentation time within 5 minutes. No code was included in the presentation.

Who did the project (Who):

Contributors : Joachim Kotek (JK), Francesco Mariottini (FM), Orhan Nurkan (ON), Saba Yahyaa (SY)

Development (How)

Communication and Management

Communication went mainly through live discussion on-site and, to a smaller extent, on Discord. Project management was mainly carried on Trello with each person adding indipendently the labels and tasks as well as involving other team members on them.

Merging datasets from different sources (How)

Different indipendent teams worked on a merged dataset to be used by all the team. On the first day (21/10/20) CUDA team splitted the sources (5 was excluded not being good enough) as follows: JK worked on source 3 and 4, FM worked on source 1 and 7, ON worked on source 2, 3 and 6. Group 3 required collaboration. Additional cleaning work was carried on the 22/10/20 by JK to improve the merged dataset for all the teams.

Pycharm & Github training (How)

At least 2 person days were spent on technical teaching (and installation) and clarifications about pycharm (FM), git (FM, JK) and statistics (FM) to allow everybody to work on the project. Additional self-training was spent by SY on understanding and replicating the code already developed by the team.

Code merging (How)

JK toke sole responsibility for merging the code in order to effectively implement code from different sources (git and jupyter files) and eventually reviewing the code if necessary.

Data formatting and values cleaning (How)

Data cleaning was splitted into two main groups: initial formatting for similar types of columns (FM) and additional specific formatting for particularly complex cleaning. FM toke reponsibility for the overall cleaning including: formatting to the required types, identification of string representing na (and replacement by na), extraction of simple numbers from text. ON worked on cleaning and aggregating the categorical values with multiple text values like subtype of property, location and state of the building. JK cleaned the postcode and toke over on price cleaning.

The resulting dataset before the first analysis is the following:

information column name variable type example(s) or description
Source (team) source int from 1 to 7
Hyperlink hyperlink str
Locality locality str
Postcode postcode int
Type of property (House/apartment) house_is bool
Subtype of property property_subtype str Bungalow, Chalet, Mansion, ...
Price price int
Type of sale (Exclusion of life sales) sale str
Number of rooms rooms_number int
Area area int
Fully equipped kitchen (Yes/No) kitchen_has bool
Furnished (Yes/No) furnished bool
Open fire (Yes/No) open_fire bool
Terrace (Yes/No) terrace bool
Terrace Area terrace_area int
Garden (Yes/No) garden bool
Garden Area garden_area int
Surface of the land land_surface int
Surface area of the plot of land land_plot_surface int
Number of facades facades_number int
Swimming pool (Yes/No) swimming_pool_has bool
State of the building building_state str (New, to be renovated, ...)

Data interpretation

TBD

Future improvements

TBD

Take over

  1. Excel may be an effective solution on a single table analysis but joining different tables through pandas could be more effective.
  2. Task(s) must be fully clarified and agreed to avoid overlaps.
  3. Teaching and self-training (code undersanding and replication) should be limited in the amount of time and effort spent during a project.

Collecting Data (When)

  • Repository: challenge-data-analysis
  • Type of Challenge: Consolidation
  • Duration: ``4 people * 3 days ` plus out of hours working
  • Deadline: 23/10/2020 17:00
  • Presentation: 26/10/2020 9:00
  • Team challenge : 4

residential-real-estate-analysis's People

Contributors

francescomariottini avatar jotwo avatar orhannurkan avatar sabayahyaa avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

residential-real-estate-analysis's Issues

Test Create Issue on Github App

1)Planned code cells (22/10/20):

1a) importing dataset according requested types. If errors warning to be fixed later.

1b) type formatting, np.NaN if not possible. None kept as None.
Ref: based on fm formatting function for Elissa

1c) describe complete data frame and dataframe per group (numerical and boolean values only)
Objectives:

  • double check obtained formatting
  • overview of null
  • check consistency of numerical values (median, outliers)
  • identify possible outliers

1d) advanced formatting (regex?) to harmonize/aggregate object type values (e.g. categories).

Data analysis Considerations

AFTER removing duplicates

DATA ANALYSIS CONSIDERATIONS
a) Correlation between price and area or rooms too good but also too obvious. I would suggest to introduce price/area as target variable.
b) Categorical variables may be also important but we need to rank them before trying an explicit correlation. If results make no sense we could adjust the ranking.
b1) for status from 5 (new) to 1(to be renovated)?
b2) for building type from 5(single detached house) to 1(big apartment block). Number of facades could be even better (the more, the more indipendent) but I am not sure how many data we have.
b3) to rank postcode we would need some additional from internet I would probably skip it.
c) Filtering before Correlation.
c1) At least filter by house or apartment.
d) Correlation analysis for variables using also filters requiree
e) Additional filters could be considered if price change too much with some variables.
e1) range of price, range of area, determined through the percentile (describe function)
e2) In addition to postcode, building type and building status could be considered

To do:
a) add price/area column.
b) agree on categorical variables ranking.
d) check Pearson correlatio
e) create data bins if too high variability

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.