This GitHub repository contains all the codes that have been used during the Report n.3 of the Digital Data Analysis course. The codes are described in the following lines following the natural workflow that has guided the writing session of the Report n.3 too.
In this notebook a single plot has been presented, showing the reason why a Machine Learning approach has been followed
The Principal Component Analysis approach has been performed and described. The curios case of P.C.A. explainability has been deepened here and in this article: https://towardsdatascience.com/p-c-a-meets-explainability-ba1ba5e4636
Four dataset has been used to develop a Support Vector Machine linear algorithm:
- P.C.A. most informative features dataset
- P.C.A. dataset and other features dataset
- Original dataset
- Most informative features original dataset
Stated that P.C.A. dataset performs better, a non linear Support Vector Machine Algorithm has been applied
The three component P.C.A. dataset has been here used to perform a non linear Support Vector Machine Algorithm for two classes classification
The three component P.C.A. dataset has been here used to perform a non linear Support Vector Machine Algorithm for three classes classification
In this notebook, the decision Tree and Random Forest approach has been used to perform a three classes classification algorithm, hyperparameters tuning has been made in order to get the best random forest algorithm
The best hypertuned SVM algorithm coming from the previous notebooks has been used to perform a first classification. Then this classification has been bosted by performing the best Random Forest algorithm from the previous notebook, that has been applied in the most wrong prediction area of the Support Vector Machine algorithm. A total of 82% of accuracy has been obtained.
The Report
The dataset is the same as Report 1 found in this GitHub repository
Chierichini Simone Paialunga Piero