Coder Social home page Coder Social logo

dsc-3-32-09-gradient-boosting-lab-online-ds-pt-011419's Introduction

Gradient Boosting - Lab

Introduction

In this lab, we'll learn how to use both Adaboost and Gradient Boosting Classifiers from scikit-learn!

Objectives

You will be able to:

  • Compare and contrast Adaboost and Gradient Boosting
  • Use adaboost to make predictions on a dataset
  • Use Gradient Boosting to make predictions on a dataset

Getting Started

In this lab, we'll learn how to use Boosting algorithms to make classifications on the Pima Indians Dataset. You will find the data stored within the file pima-indians-diabetes.csv. Our goal is to use boosting algorithms to classify each person as having or not having diabetes. Let's get started!

We'll begin by importing everything we need for this lab. In the cell below:

  • Import numpy, pandas, and matplotlib.pyplot, and set the standard alias for each. Also set matplotlib visualizations to display inline.
  • Set a random seed of 0 by using np.random.seed(0)
  • Import train_test_split and cross_val_score from sklearn.model_selection
  • Import StandardScaler from sklearn.preprocessing
  • Import AdaboostClassifier and GradientBoostingClassifier from sklearn.ensemble
  • Import accuracy_score, f1_score, confusion_matrix, and classification_report from sklearn.metrics

Now, use pandas to read in the data stored in pima-indians-diabetes.csv and store it in a DataFrame. Display the head to inspect the data we've imported and ensure everything loaded correctly.

df = None

Cleaning, Exploration, and Preprocessing

The target we're trying to predict is the 'Outcome' column. A 1 denotes a patient with diabetes.

By now, you're quite familiar with exploring and preprocessing a dataset, so we won't hold your hand for this step.

In the following cells:

  • Store our target column in a separate variable and remove it from the dataset
  • Check for null values and deal with them as you see fit (if any exist)
  • Check the distribution of our target
  • Scale the dataset
  • Split the dataset into training and testing sets, with a test_size of 0.25
target = None
scaler = None
scaled_df = None
scaled_df.head()
X_train, X_test, y_train, y_test = None

Training the Models

Now that we've cleaned and preprocessed our dataset, we're ready to fit some models!

In the cell below:

  • Create an AdaBoostClassifier
  • Create a GradientBoostingClassifer
adaboost_clf = None
gbt_clf = None

Now, train each of the classifiers using the training data.

Now, let's create some predictions using each model so that we can calculate the training and testing accuracy for each.

adaboost_train_preds = None
adaboost_test_preds = None
gbt_clf_train_preds = None
gbt_clf_test_preds = None

Now, complete the following function and use it to calculate the training and testing accuracy and f1-score for each model.

def display_acc_and_f1_score(true, preds, model_name):
    acc = None
    f1 = None
    print("Model: {}".format(None))
    print("Accuracy: {}".format(None))
    print("F1-Score: {}".format(None))
    
print("Training Metrics")
display_acc_and_f1_score(y_train, adaboost_train_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_train, gbt_clf_train_preds, model_name='Gradient Boosted Trees')
print("")
print("Testing Metrics")
display_acc_and_f1_score(y_test, adaboost_test_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_test, gbt_clf_test_preds, model_name='Gradient Boosted Trees')

Let's go one step further and create a confusion matrix and classification report for each. Do so in the cell below.

adaboost_confusion_matrix = None
adaboost_confusion_matrix
gbt_confusion_matrix = None
gbt_confusion_matrix
adaboost_classification_report = None
print(adaboost_classification_report)
gbt_classification_report = None
print(gbt_classification_report)

Question: How did the models perform? Interpret the evaluation metrics above to answer this question.

Write your answer below this line:


As a final performance check, let's calculate the cross_val_score for each model! Do so now in the cells below.

Recall that to compute the cross validation score, we need to pass in:

  • a classifier
  • All training Data
  • All labels
  • The number of folds we want in our cross validation score.

Since we're computing cross validation score, we'll want to pass in the entire (scaled) dataset, as well as all of the labels. We don't need to give it data that has been split into training and testing sets because it will handle this step during the cross validation.

In the cells below, compute the mean cross validation score for each model. For the data, use our scaled_df variable. The corresponding labels are in the variable target. Also set cv=5.

print('Mean Adaboost Cross-Val Score (k=5):')
print(None)
# Expected Output: 0.7631270690094218
print('Mean GBT Cross-Val Score (k=5):')
print(None)
# Expected Output: 0.7591715474068416

These models didn't do poorly, but we could probably do a bit better by tuning some of the important parameters such as the Learning Rate.

Summary

In this lab, we learned how to use scikit-learn's implementations of popular boosting algorithms such as AdaBoost and Gradient Boosted Trees to make classification predictions on a real-world dataset!

dsc-3-32-09-gradient-boosting-lab-online-ds-pt-011419's People

Contributors

loredirick avatar mike-kane avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.