Breast Cancer Diagnosis with Naive Bayes Classifier - Lab

Introduction

Breast cancer is the most common form of cancer in women, and the second most common form of cancer worldwide. The American Cancer Society states that 1.688.780 cancer cases occurred in the United States in 2017, 35.6% of which led to death. The early diagnosis and prognosis of breast cancer involves detection and classification of cancerous cells. This has led biomedical and bioinformatics specialists to become interested in the application of Machine Learning and other AI methods. These predictive methods proved to be very effective in identifying pathological conditions in cells and organs.

Objectives:

You will be able to:

Perform a detailed classification experiment with SciKitLearn's implementation of Naive Bayes and Wisconsin Breast Cancer Dataset
Perform necessary data cleaning and pre-processing for machine learning tasks
Observe the accuracy of NB classifier and take steps for improving accuracy

Load necessary libraries

#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Dataset

The Breast Cancer dataset, first obtained from Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison, is composed of 30 continuous variables and 569 observations. The dataset is based on ten original features describing cancerous cell nuclei derived from a digitized image of a fine needle aspirate of a breast mass. For each of these ten features, the mean, standard error and the ’worst’ value (defined as the mean of the three largest values) have been calculated, resulting in a total of 30 continuous features. The original variable "area", for example, has been split into three separate features, area_mean, area_SE and area_worst. The dataset reported only these derived features, not the original variables. The response variable is a categorical variable indicating whether the tumour is malignant (M) or benign (B). The dataset contains 357 benign and 212 malignant examples. The distribution of all variables with respect to response variable is shown as violin plot below.

Further details of dataset can be viewed at UCI machine learning repo . We have downloaded this for you as a CSV file: data.csv.

Import `data.csv` as Pandas Dataframe. Split the dataset to create X (all features) and Y (Target variable)

#importing the dataset 
dataset = None
# print("Cancer data set dimensions : {}".format(dataset.shape))
# print(dataset.head())
X = None
Y = None

Find the dimensions of the data set using the panda dataset ‘shape’ attribute.

# Your code here

# Cancer data set dimensions : (569, 33)

Identify "Malignant" and "Benign" cases in the dataset

# Your code here

# diagnosis
# B    357
# M    212
# dtype: int64

Visualize the dataset, showing distributions of all features with respect to both target classes

#Visualization of data

# Code here

diagnosis
B    [[AxesSubplot(0.125,0.779333;0.103333x0.100667...
M    [[AxesSubplot(0.125,0.779333;0.103333x0.100667...
dtype: object

Categorical Data

The data pre-processing for this experiment requires standardizing all variables to a [0,1] interval and coding the categorical response variable to a binary vector (equal to 1 if the tumor is malignant, and 0 otherwise)

We will use SciKitLearn's LabelEncoder to label the categorical data. Label Encoder is used to convert categorical data, or text data into numbers, which our predictive models can better understand.

Click here for more details on Label Encoder

Encode "Malignant" and "Benign" in Y to 0/1

# Code here

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0])

Data Splitting for Hold-out Validation Testing

Perform an 80/20 train/test split to X and Y arrays

# Split the dataset into the Training set and Test set for X and Y 
# Code here

Feature Scaling

Our dataset contains features highly varying in magnitudes, units and range (do a dataset.describe to inspect this). We need to bring all features to the same level of magnitudes. This can be achieved by scaling i.e. transforming data so that it fits within a specific scale, like 0–100 or 0–1.

We will use SciKitLearn's StandardScaler method to standardize features by removing the mean and scaling to unit variance. Click here to learn more on StandardScalar

Apply `StandardScalar()` to all features in `X_train` and `X-test`

#Feature Scaling
# Code here

Model Development

With our pre-processing in place, Let's build our model. We shall use the GaussianNB to model our data. For this you need to

Initialize an instance of classifier
Fit the model to the X_train and Y_train datasets

This step is same for pretty much all models in SciKitLearn. Here is the official doc with a few code examples to get you going.

Fit the Naive Bayes Classifier

#Fitting Naive_Bayes
# Code here

The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different labels in your training sample.

# Uncomment below to run
# classifier.class_prior_


# array([0.63736264, 0.36263736])

Now we can use the model.predict(test_set) to make predictions for our test data. Here is some help o making predictions in scikit learn. As mentioned earlier, this process is almost same for all models in skLearn.

Make predictions from trained classifier

# Make Predictions
Y_pred = None
Y_pred


# array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
#        1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
#        1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
#        1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
#        0, 1, 1, 0])

Calculate Accuracy

Great now we can bring in our Y_test and compare it against Y_pred to check the accuracy.

You simply measure the number of correct decisions your classifier makes, divide by the total number of test examples, and the result is the accuracy of your classifier.

## Calculate accuracy using formula 
acc= None
print( acc)

# 0.9035087719298246

None

Scikit learn has built in methods to do this. Check here on how to use this.

# Calculate accuracy using scikit learn
# Code here 

# 0.9035087719298246

Level up

Predicting single example
Train the classifier using 5-fold cross validation to monitor any improvement/reduction in accuracy
Run this dataset with the Numpy implementation in last lab, and compare results

Summary

In this lab we learned to train and predict from a Naive Bayes Classifier in ScikitLearn. We also calculated accuracy partially, as we could deeper into calculating Type and 2 errors i.e. true positives and false positives to check for Sensitivity and Specificity. We shall leave out detailed evaluation for a later lesson in classification. Next we shall learn a more popular use case of Naive Bayes i.e. Text classification and NLP.

learn-co-students / dsc-2-22-08-nb-sklearn-online-ds-sp-000 Goto Github PK

dsc-2-22-08-nb-sklearn-online-ds-sp-000's Introduction