Phase 3 Review

To solidify our knowledge of gradient descent, we will use Sklearn's stochastic gradient descent algorithm for regression SGDRegressor. Sklearn classifiers share many methods and parameters, such as fit/predict, but some have useful additions. SGDRegressor has a new method called partial_fit, which will allow us to inspect the calculated coefficients after each step of gradient descent.
We will use the diabetes dataset for this task.

from sklearn.datasets import load_diabetes
import numpy as np

data = load_diabetes(as_frame=True)
X = data['data']
y = data['target']

X.shape

X.head()

from sklearn.linear_model import SGDRegressor

# Instantiate a SGDRegressor object and run partial fit on X and y. For now, pass the argument `penalty=None`

one_random_student(quanggang)

# Inspect the coefficient array

one_random_student(quanggang)

# Import mean_squared_error from metrics, and pass in the true ys, an array of predictions
# and the agrument squared = False

one_random_student(quanggang)

# Repeat the partial fit. Inspect, RMSE, coefficients.

one_random_student(quanggang)

Pick a coefficient, and explain the gradient descent update.

Together, let's plot the trajectory of one coefficient against the loss.

# code

Compare that to a full fit of the SGDRegressor.

# code

Logistic Regression and Modeling

What type of target do we feed the logistic regression model?

one_random_student(quanggang)

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True)
X = data['data']
y = data['target']

# Perform a train-test split

one_random_student(quanggang)

Question: What is the purpose of train/test split?

one_random_student(quanggang)

Question: Why should we never fit to the test portion of our dataset?

one_random_student(quanggang)

# Scale the training set using a standard scaler
ss = None
X_train_scaled = None

one_random_student(quanggang)

X_train_scaled.head()

Question: Why is scaling our data important? For part of your answer, relate to one of the advantages of logistic regression over another classifier.

# fit model with logistic regression to the appropriate portion of our dataset

one_random_student(quanggang)

Now that we have fit our classifier, the object lr has been filled up with information about the best fit parameters. Take a look at the coefficients held in the lr object. Interpret what their magnitudes mean.

# Inspect the .coef_ attribute of lr and interpret

one_random_student(quanggang)

Logistic regression has a predict method just like linear regression. Use the predict method to generate a set of predictions (y_hat_train) for the training set.

# use predict to generate a set of predictions
y_hat_train = None

one_random_student(quanggang)

Confusion Matrix

Confusion matrices are a great way to visualize the performance of our classifiers.

Question: What does a good confusion matrix look like?

one_random_student(quanggang)

# create a confusion matrix for our logistic regression model fit on the scaled training data

one_random_student(quanggang)

Accuracy/Precision/Recall/F_1 Score

We have a bunch of additional metrics, most of which we can figure out from the CM

Question: Define accuracy. What is the accuracy score of our classifier?

# Confirm accuracy in code

one_random_student(quanggang)

Question: Why might accuracy fail to be a good representation of the quality of a classifier?

one_random_student(quanggang)

Question: Define recall. What is the recall score of our classifier?

# Confirm recall in code

one_random_student(quanggang)

Question: Define precision? What is the precision score of our classifier?

# Confirm precision in code

one_random_student(quanggang)

Question: Define f1 score? What is the f1 score score of our classifier?

one_random_student(quanggang)

Auc_Roc

The AUC_ROC curve can't be deduced from the confusion matrix. Describe what the AUC_ROC curve shows. Look here for some nice visualizations of AUC_ROC. Describe the AUC_ROC curve. What does a good AUC_ROC curve look like? What is a good AUC_ROC score?

one_random_student(quanggang)

One of the advantages of logistic regression is that it generates a set of probabilities associated with each prediction. What is the default threshold? How would decrease or increasing your threshold affect true positive and false positive rates?

For our scaled X_train, generate an array of probabilities associated with the probability of the positive class.

# your code here

one_random_student(quanggang)

Now, using those probabilities, create two arrays, one which converts the probabilities to label predictions using the default threshold, and one using a threshold of .4. How does it affect our metrics?

# Plot the AUC_ROC curve for our classifier

More Algorithms

Much of the sklearn syntax is shared across classifiers and regressors. Fit, predict, score, and more are methods associated with all sklearn classifiers. They work differently under the hood. KNN's fit method simply stores the training set in memory. Logistic regressions .fit() does the hard work of calculating coefficients.

However, each algo also has specific parameters and methods associated with it. For example, decision trees have feature importances and logistic has coefficients. KNN has n_neighbors and decision trees has max_depth.

Getting to know the algo's and their associated properties is an important area of study.

That being said, you now are getting to the point that no matter which algorithm you choose, you can run the code to create a model as long as you have the data in the correct shape. Most importantly, the target is the appropriate form (continuous/categorical) and is isolated from the predictors.

Here are the algos we know so far.

Linear Regression
Lasso/Ridge Regression
Logistic Regression
Naive-Bayes
KNN
Decision Trees

Note that KNN and decision trees also have regression classes in sklearn.

Here are two datasets from seaborn and sklearn. Let's work through the process of creating simple models for each.

import seaborn as sns
penguins = sns.load_dataset('penguins')
penguins.head()

Question: What algorithm would be appropriate based on the target

# split target from predictors

one_random_student(quanggang)

For the first simple model, let's just use the numeric predictors.

one_random_student(quanggang)

# isolate numeric predictors

one_random_student(quanggang)

# Scale appropriately

one_random_student(quanggang)

# instantiate appropriate model and fit to appropriate part of data.

one_random_student(quanggang)

# Create a set of predictions

y_hat_train = None
y_hat_test = None

one_random_student(quanggang)

# Create and analyze appropriate metrics

one_random_student(quanggang)

from sklearn.datasets import load_boston
data = load_boston()
X = pd.DataFrame(data['data'], columns = data['feature_names'])
y = data['target']

Question: What algorithm would be appropriate based on the target?

one_random_student(quanggang)

# split target from predictors

one_random_student(quanggang)

For the first simple model, let's just use the numeric predictors.

# isolate numeric predictors

one_random_student(quanggang)

# Scale appropriately

one_random_student(quanggang)

# instantiate appropriate model and fit to appropriate part of data.

one_random_student(quanggang)

# Create a set of predictions

y_hat_train = None
y_hat_test = None

one_random_student(quanggang)

# Create and analyze appropriate metrics

one_random_student(quanggang)

learn-co-students / phase_3_review_quanggang Goto Github PK

phase_3_review_quanggang's Introduction

Phase 3 Review

TOC

Gradient Descent

Logistic Regression and Modeling

Confusion Matrix

Accuracy/Precision/Recall/F_1 Score

Auc_Roc

More Algorithms

phase_3_review_quanggang's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent