Coder Social home page Coder Social logo

haluksumen / classification_alberghi Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 972 KB

Classification with 9 different algorithms in Hotel Dataset

Jupyter Notebook 100.00%
machine-learning classification logistic-regression knn-classification decision-trees random-forest gradient-boosting adaboost xgboost extratree bagging

classification_alberghi's Introduction

Alberghi Classification

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

In this project I made Exploratory Data Analysis, Data Visualisation and lastly Modelling which I did with 9 different models. In Exploratory Data Analysis I cleaned irrelevant data,NaN values and change data types for easiness. In second part, I would like to show data in plots.Such as, number of places according to their types and according to province. Later these processes I looked Pearson Correlation and Spearman Correlation which they gave very similar result as expected. Before modelling I prepared data training and testing. My testing size is 0.33. Then I applied for each model, the algorithms I used for these project are Logistic Regression, K Neighbors Classification, Decision Tree Classification, Random Forest Classification, AdaBoost Classification, Gradient Boosting Classification, XGB Classification, ExtraTrees Classification, Bagging Classification. Finally, Random Forest Classifier gives the best result but tuning with algorithms or cleaning the data more(I believe it will decrease the size of dataset alot) can be effective.

2 - Data

Dataset contains 6775 rows and 25 columns. Description and Type of Each Column;

  • ID int64 - id
  • PROVINCIA object - id of state
  • COMUNE object - name of city
  • LOCALITA object - name of town
  • CAMERE int64 - number or rooms
  • SUITE int64 - number of suites
  • LETTI int64 - number of beds
  • BAGNI int64 - number of bathrooms
  • PRIMA_COLAZIONE int64 - breakfast included or not
  • IN_ABITATO float64 - building or not
  • SUL_LAGO float64 - close to lake or not
  • VICINO_ELIPORTO float64 - close to heliport or not
  • VICINO_AEREOPORTO float64 - close to airport or not
  • ZONA_CENTRALE float64 - in the central or not
  • VICINO_IMP_RISALITA float64 -
  • ZONA_PERIFERICA float64 - suburb or not
  • ZONA_STAZIONE_FS float64 - close to station or not
  • ATTREZZATURE_VARIE object - equipment types( elevator, park, restaurants etc.)
  • CARTE_ACCETTATE object - accepted credit cards (visa,mastercard etc.)
  • LINGUE_PARLATE object - spoken languages by host or hotel
  • SPORT object - sport options (football, table tennis etc.)
  • CONGRESSI object - congress room(s)
  • LATITUDINE float64 - latitude
  • LONGITUDINE float64 - longitude
  • OUTPUT object - types of places

3 - Exploratory Data Analysis

Firslty, I checked data types and number of Nan in each columns. Later this process I decided which columns I will delete and which rows should I delete. So I deleted LOCALITA - SPORT - CONGRESSI - LATITUDINE - LONGITIDUNE columns and I deleted in NaN rows in IN_ABITATO -SUL_LAGO - VICINO_ELIPORTO - VICINO_AEREOPORTO - ZONA_CENTRALE - VICINO_IMP_RISALITA - ZONA_PERIFERICA - ZONA_STAZIONE_FS columns. But I keep 3 columns which contains very high number of NaN values because data they contains could be helpful for future works.

Pearson Correlation

Spearman Correlation

4 - Data Visualization

Number of Places According to Their Types

Number of Hotels According to Province

Number of Room Comparing to Bed

Importances of Columns

5 - Modelling

  • 5.1 - Logistic Regression

is used to predict the categorical dependent variable using a given set of independent variables.

Logistic Regression

  • 5.2 - K Neighbors Classification

non-parametric classification method.

K Neighbors Classification

  • 5.3 Decision Tree Classification

breaks the data smaller subsets in form of tree structure

Decision Tree Classification

  • 5.4 - Random Forest Classification

consist many decision tree but using bagging and randomness. then look at average/voting and gives the result.

Random Forest Classification

  • 5.5 - AdaBoost Classification

is an meta-estimator, that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset.

AdaBoost Classification

  • 5.6 - Gradient Boosting Classification

combine weak learning models to create strong model.

Gradient Boosting Classification

  • 5.7 - XGB Classification

implementation of gradient boosted decision trees but more effective in performance.

XGB Classification

  • 5.8 - ExtraTrees Classification

implements a meta-estimator which fits number of random decision trees on various subsets of dataset and it uses average/voting to improve prediction.

ExtraTrees Classification

  • 5.9 - Bagging Classification

an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Bagging Classification

6 - Result & Future Work

  • Logistic Regression Score: 0.6460699681962744
  • K Neighbors Classifier Score: 0.6338028169014085
  • DecisionTree Classifier Score: 0.8391640163562017
  • Random Forest Classifier Score: 0.8727850976828714
  • AdaBoost Classifier Score: 0.5483870967741935
  • Gradient Boosting Classifier Score: 0.8714220808723308
  • XGB Classifier Score: 0.7878237164925034
  • ExtraTree Classifier Score: 0.8632439800090868
  • Bagging Classifier Score: 0.8514311676510677

According the scores,Random Forest Classifier gives best result with 0.872785. Also Gradient Boosting is gives very close to Random Forest Classifier with 0.871422, and finally AdaBoost is give the worst performance with 0.548387. In the end Random Forest Classifier gives the best result but maybe tuning with XGB increase its score.

classification_alberghi's People

Contributors

haluksumen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.