Coder Social home page Coder Social logo

likarajo / wine_quality Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 271 KB

Predict the quality of the wine based on different attributes

Jupyter Notebook 100.00%
cross-validation grid-search random-forest-classifier python3 sklearn jupyter-notebook regression

wine_quality's Introduction

Wine Quality

Predict the quality of wine (red wine) based on different attributes.

Goal

  • Use Random Forest algorithm to create machine learning models
  • Evaluate the model using cross-validation
  • Select the best model using grid-search
  • Predict using the best model

Background

A typical machine learning process involves training different models on the dataset, evaluating the performance of algorithm and selecting the one with best performance.

There are several factors that determine which algorithm performs the best:

  • Performance of the algorithm on cross validation set
  • Choice of hyperparameters for the algorithm

Cross Validation for model accuracy

The training set is used to train the model and the test set is used to evaluate the performance of the model. However, this may lead to variance problem where the accuracy obtained on one test set is very different to accuracy obtained on another test set using the same algorithm.

The solution to this is the process of K-Fold Cross-Validation:

  • Divide the data into K folds.
  • Out of the K folds, K-1 sets are used for training while the remaining set is used for testing.
  • The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training.
  • Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set.

Grid Search for Hyperparameter selection

Randomly selecting the hyperparameters for the algorithm can be exhaustive. It is also not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Grid Search is an algorithm which automatically finds the best parameters for a particular model.

  • Create a dictionary of all the hyperparameters and their corresponding set of values that are set to test for best performance.
    • The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.
  • Create an instance of the algorithm class.
  • Pass the values for the hyperparameter from the dictionary.
  • Check the hyperparameters that return the highest accuracy.
  • Find the accuracy obtained using the best parameters.

Dependencies

  • Pandas
  • Scikit-learn

pip install -r requirements.txt

Dataset

UCI archive ML data: https://archive.ics.uci.edu/ml/datasets/wine+quality
Saved in: data/winequality-red.csv

Data Preprocessing

  • Separate the features and labels
  • Prepare data for cross-validation
    • All the data is kept in the training set
  • Scale the training data

Implementing the algorithm

  • Random Forest Classifier
  • Estimators = 300

Implementing Cross Validation

  • Cross Validation Accuracy
  • Number of Folds = 5

Parameter selection for best model

  • Grid Search
  • Estimators: 100, 300, 500, 700, 1000
  • Criteria: gini, entropy
  • With and without bootstrap

Conclusion

  • K-Fold Cross-Validation is used to evaluate performance of a model by handling the variance problem of the result set.
  • To identify the best algorithm and best parameters, the Grid Search algorithm is used.

wine_quality's People

Contributors

likarajo avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.