Coder Social home page Coder Social logo

dsc-2-22-08-nb-sklearn-online-ds-sp-000's Introduction

Breast Cancer Diagnosis with Naive Bayes Classifier - Lab

Introduction

Breast cancer is the most common form of cancer in women, and the second most common form of cancer worldwide. The American Cancer Society states that 1.688.780 cancer cases occurred in the United States in 2017, 35.6% of which led to death. The early diagnosis and prognosis of breast cancer involves detection and classification of cancerous cells. This has led biomedical and bioinformatics specialists to become interested in the application of Machine Learning and other AI methods. These predictive methods proved to be very effective in identifying pathological conditions in cells and organs.

Objectives:

You will be able to:

  • Perform a detailed classification experiment with SciKitLearn's implementation of Naive Bayes and Wisconsin Breast Cancer Dataset
  • Perform necessary data cleaning and pre-processing for machine learning tasks
  • Observe the accuracy of NB classifier and take steps for improving accuracy

Load necessary libraries

#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Dataset

The Breast Cancer dataset, first obtained from Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison, is composed of 30 continuous variables and 569 observations. The dataset is based on ten original features describing cancerous cell nuclei derived from a digitized image of a fine needle aspirate of a breast mass. For each of these ten features, the mean, standard error and the ’worst’ value (defined as the mean of the three largest values) have been calculated, resulting in a total of 30 continuous features. The original variable "area", for example, has been split into three separate features, area_mean, area_SE and area_worst. The dataset reported only these derived features, not the original variables. The response variable is a categorical variable indicating whether the tumour is malignant (M) or benign (B). The dataset contains 357 benign and 212 malignant examples. The distribution of all variables with respect to response variable is shown as violin plot below.

Further details of dataset can be viewed at UCI machine learning repo . We have downloaded this for you as a CSV file: data.csv.

Import data.csv as Pandas Dataframe. Split the dataset to create X (all features) and Y (Target variable)

#importing the dataset 
dataset = None
# print("Cancer data set dimensions : {}".format(dataset.shape))
# print(dataset.head())
X = None
Y = None

Find the dimensions of the data set using the panda dataset ‘shape’ attribute.

# Your code here

# Cancer data set dimensions : (569, 33)

Identify "Malignant" and "Benign" cases in the dataset

# Your code here

# diagnosis
# B    357
# M    212
# dtype: int64

Visualize the dataset, showing distributions of all features with respect to both target classes

#Visualization of data

# Code here 
diagnosis
B    [[AxesSubplot(0.125,0.779333;0.103333x0.100667...
M    [[AxesSubplot(0.125,0.779333;0.103333x0.100667...
dtype: object

png

png

Categorical Data

The data pre-processing for this experiment requires standardizing all variables to a [0,1] interval and coding the categorical response variable to a binary vector (equal to 1 if the tumor is malignant, and 0 otherwise)

We will use SciKitLearn's LabelEncoder to label the categorical data. Label Encoder is used to convert categorical data, or text data into numbers, which our predictive models can better understand.

Click here for more details on Label Encoder

Encode "Malignant" and "Benign" in Y to 0/1

# Code here
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0])

Data Splitting for Hold-out Validation Testing

Perform an 80/20 train/test split to X and Y arrays

# Split the dataset into the Training set and Test set for X and Y 
# Code here 

Feature Scaling

Our dataset contains features highly varying in magnitudes, units and range (do a dataset.describe to inspect this). We need to bring all features to the same level of magnitudes. This can be achieved by scaling i.e. transforming data so that it fits within a specific scale, like 0–100 or 0–1.

We will use SciKitLearn's StandardScaler method to standardize features by removing the mean and scaling to unit variance. Click here to learn more on StandardScalar

Apply StandardScalar() to all features in X_train and X-test

#Feature Scaling
# Code here 

Model Development

With our pre-processing in place, Let's build our model. We shall use the GaussianNB to model our data. For this you need to

  • Initialize an instance of classifier
  • Fit the model to the X_train and Y_train datasets

This step is same for pretty much all models in SciKitLearn. Here is the official doc with a few code examples to get you going.

Fit the Naive Bayes Classifier

#Fitting Naive_Bayes
# Code here 

The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different labels in your training sample.

# Uncomment below to run
# classifier.class_prior_


# array([0.63736264, 0.36263736])

Now we can use the model.predict(test_set) to make predictions for our test data. Here is some help o making predictions in scikit learn. As mentioned earlier, this process is almost same for all models in skLearn.

Make predictions from trained classifier

# Make Predictions
Y_pred = None
Y_pred


# array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
#        1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
#        1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
#        1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
#        0, 1, 1, 0])

Calculate Accuracy

Great now we can bring in our Y_test and compare it against Y_pred to check the accuracy.

  • You simply measure the number of correct decisions your classifier makes, divide by the total number of test examples, and the result is the accuracy of your classifier.
## Calculate accuracy using formula 
acc= None
print( acc)

# 0.9035087719298246
None

Scikit learn has built in methods to do this. Check here on how to use this.

# Calculate accuracy using scikit learn
# Code here 

# 0.9035087719298246

Level up

  • Predicting single example
  • Train the classifier using 5-fold cross validation to monitor any improvement/reduction in accuracy
  • Run this dataset with the Numpy implementation in last lab, and compare results

Summary

In this lab we learned to train and predict from a Naive Bayes Classifier in ScikitLearn. We also calculated accuracy partially, as we could deeper into calculating Type and 2 errors i.e. true positives and false positives to check for Sensitivity and Specificity. We shall leave out detailed evaluation for a later lesson in classification. Next we shall learn a more popular use case of Naive Bayes i.e. Text classification and NLP.

dsc-2-22-08-nb-sklearn-online-ds-sp-000's People

Contributors

loredirick avatar shakeelraja avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.