Alberghi Classification

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

In this project I made Exploratory Data Analysis, Data Visualisation and lastly Modelling which I did with 9 different models. In Exploratory Data Analysis I cleaned irrelevant data,NaN values and change data types for easiness. In second part, I would like to show data in plots.Such as, number of places according to their types and according to province. Later these processes I looked Pearson Correlation and Spearman Correlation which they gave very similar result as expected. Before modelling I prepared data training and testing. My testing size is 0.33. Then I applied for each model, the algorithms I used for these project are Logistic Regression, K Neighbors Classification, Decision Tree Classification, Random Forest Classification, AdaBoost Classification, Gradient Boosting Classification, XGB Classification, ExtraTrees Classification, Bagging Classification. Finally, Random Forest Classifier gives the best result but tuning with algorithms or cleaning the data more(I believe it will decrease the size of dataset alot) can be effective.

2 - Data

Dataset contains 6775 rows and 25 columns. Description and Type of Each Column;

ID int64 - id
PROVINCIA object - id of state
COMUNE object - name of city
LOCALITA object - name of town
CAMERE int64 - number or rooms
SUITE int64 - number of suites
LETTI int64 - number of beds
BAGNI int64 - number of bathrooms
PRIMA_COLAZIONE int64 - breakfast included or not
IN_ABITATO float64 - building or not
SUL_LAGO float64 - close to lake or not
VICINO_ELIPORTO float64 - close to heliport or not
VICINO_AEREOPORTO float64 - close to airport or not
ZONA_CENTRALE float64 - in the central or not
VICINO_IMP_RISALITA float64 -
ZONA_PERIFERICA float64 - suburb or not
ZONA_STAZIONE_FS float64 - close to station or not
ATTREZZATURE_VARIE object - equipment types( elevator, park, restaurants etc.)
CARTE_ACCETTATE object - accepted credit cards (visa,mastercard etc.)
LINGUE_PARLATE object - spoken languages by host or hotel
SPORT object - sport options (football, table tennis etc.)
CONGRESSI object - congress room(s)
LATITUDINE float64 - latitude
LONGITUDINE float64 - longitude
OUTPUT object - types of places

3 - Exploratory Data Analysis

Firslty, I checked data types and number of Nan in each columns. Later this process I decided which columns I will delete and which rows should I delete. So I deleted LOCALITA - SPORT - CONGRESSI - LATITUDINE - LONGITIDUNE columns and I deleted in NaN rows in IN_ABITATO -SUL_LAGO - VICINO_ELIPORTO - VICINO_AEREOPORTO - ZONA_CENTRALE - VICINO_IMP_RISALITA - ZONA_PERIFERICA - ZONA_STAZIONE_FS columns. But I keep 3 columns which contains very high number of NaN values because data they contains could be helpful for future works.

Pearson Correlation

Spearman Correlation

4 - Data Visualization

Number of Places According to Their Types

Number of Hotels According to Province

Number of Room Comparing to Bed

Importances of Columns

5 - Modelling

5.1 - Logistic Regression

is used to predict the categorical dependent variable using a given set of independent variables.

Logistic Regression

5.2 - K Neighbors Classification

non-parametric classification method.

K Neighbors Classification

5.3 Decision Tree Classification

breaks the data smaller subsets in form of tree structure

Decision Tree Classification

5.4 - Random Forest Classification

consist many decision tree but using bagging and randomness. then look at average/voting and gives the result.

Random Forest Classification

5.5 - AdaBoost Classification

is an meta-estimator, that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset.

AdaBoost Classification

5.6 - Gradient Boosting Classification

combine weak learning models to create strong model.

Gradient Boosting Classification

5.7 - XGB Classification

implementation of gradient boosted decision trees but more effective in performance.

XGB Classification

5.8 - ExtraTrees Classification

implements a meta-estimator which fits number of random decision trees on various subsets of dataset and it uses average/voting to improve prediction.

ExtraTrees Classification

5.9 - Bagging Classification

an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Bagging Classification

6 - Result & Future Work

Logistic Regression Score: 0.6460699681962744
K Neighbors Classifier Score: 0.6338028169014085
DecisionTree Classifier Score: 0.8391640163562017
Random Forest Classifier Score: 0.8727850976828714
AdaBoost Classifier Score: 0.5483870967741935
Gradient Boosting Classifier Score: 0.8714220808723308
XGB Classifier Score: 0.7878237164925034
ExtraTree Classifier Score: 0.8632439800090868
Bagging Classifier Score: 0.8514311676510677

According the scores,Random Forest Classifier gives best result with 0.872785. Also Gradient Boosting is gives very close to Random Forest Classifier with 0.871422, and finally AdaBoost is give the worst performance with 0.548387. In the end Random Forest Classifier gives the best result but maybe tuning with XGB increase its score.

haluksumen / classification_alberghi Goto Github PK

classification_alberghi's Introduction

Alberghi Classification

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

2 - Data

3 - Exploratory Data Analysis

4 - Data Visualization

5 - Modelling

6 - Result & Future Work

classification_alberghi's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent