Wine Quality

Predict the quality of wine (red wine) based on different attributes.

Goal

Use Random Forest algorithm to create machine learning models
Evaluate the model using cross-validation
Select the best model using grid-search
Predict using the best model

Background

A typical machine learning process involves training different models on the dataset, evaluating the performance of algorithm and selecting the one with best performance.

There are several factors that determine which algorithm performs the best:

Performance of the algorithm on cross validation set
Choice of hyperparameters for the algorithm

Cross Validation for model accuracy

The training set is used to train the model and the test set is used to evaluate the performance of the model. However, this may lead to variance problem where the accuracy obtained on one test set is very different to accuracy obtained on another test set using the same algorithm.

The solution to this is the process of K-Fold Cross-Validation:

Divide the data into K folds.
Out of the K folds, K-1 sets are used for training while the remaining set is used for testing.
The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training.
Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set.

Grid Search for Hyperparameter selection

Randomly selecting the hyperparameters for the algorithm can be exhaustive. It is also not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Grid Search is an algorithm which automatically finds the best parameters for a particular model.

Create a dictionary of all the hyperparameters and their corresponding set of values that are set to test for best performance.
- The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.
Create an instance of the algorithm class.
Pass the values for the hyperparameter from the dictionary.
Check the hyperparameters that return the highest accuracy.
Find the accuracy obtained using the best parameters.

Dependencies

Pandas
Scikit-learn

pip install -r requirements.txt

Dataset

UCI archive ML data: https://archive.ics.uci.edu/ml/datasets/wine+quality
Saved in: data/winequality-red.csv

Data Preprocessing

Separate the features and labels
Prepare data for cross-validation
- All the data is kept in the training set
Scale the training data

Implementing the algorithm

Random Forest Classifier
Estimators = 300

Implementing Cross Validation

Cross Validation Accuracy
Number of Folds = 5

Parameter selection for best model

Grid Search
Estimators: 100, 300, 500, 700, 1000
Criteria: gini, entropy
With and without bootstrap

Conclusion

K-Fold Cross-Validation is used to evaluate performance of a model by handling the variance problem of the result set.
To identify the best algorithm and best parameters, the Grid Search algorithm is used.

likarajo / wine_quality Goto Github PK

wine_quality's Introduction

Wine Quality

Goal

Background

Cross Validation for model accuracy

Grid Search for Hyperparameter selection

Dependencies

Dataset

Data Preprocessing

Implementing the algorithm

Implementing Cross Validation

Parameter selection for best model

Conclusion

wine_quality's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent