Coder Social home page Coder Social logo

machinelearningproject's Introduction

Practical Machine Learning Project

Goal

The goal of the project is to utilize the Weight Lifting Exercise Dataset from http://groupware.les.inf.puc-rio.br/har in order to create a machine learning algorithm to rate the quality of a repetition of an activity. To put it more technically, we want to utilize the previously mentioned data set to create a machine learning model that can predict the "classe" variable. To the end we will discuss the decisions we made to clean the data, how we implement cross validation, and our estimated out of sample error rate.

Preliminaries

We utilize the caret package and the doParallel package. We utilize the caret package to train our model and we utilize the doParallel package to distribute the model training across all the cores on the machine generating the model. Finally, we set the seed to 4567 in order to allow the reader to generate the exact same results as the author. The code to accomplish the previous tasks is below.

library(caret)
library(doParallel)

set.seed(4567)

cl <- makeCluster(detectCores()) 
registerDoParallel(cl)

Getting and Cleaning the Data

Upon preliminary examination of the data set we notice that many values associated with kurtosis columns have a value of #DIV/0! which means when we load the data set we want to ensure the #DIV/0! values are set to NA. To ensure #DIV/0!, blanks, and "NA" all become NA we load the data utilizing the function below

getTrainingData <- function() {
  training <- read.csv("machinelearningproject/training.csv", header = T, na.strings = c("#DIV/0!", "", "NA"))
}

We further clean the data with the function below. After showing the function we will discuss why we cleaned the data the way we did.

cleanData <- function(dataToBeCleaned) {
  hasNoNa <- apply(dataToBeCleaned, 2, function(column) { sum(is.na(column)) == 0 })
  dataToBeCleaned <- dataToBeCleaned[, hasNoNa]
  columnsToGetRidOf <- c("X", "user_name", "raw_timestamp_part_1","raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window")
  dataToBeCleaned <- dataToBeCleaned[,!(names(dataToBeCleaned) %in% columnsToGetRidOf)]
}

We noticed the training data has many columns which are mostly filled with NA. We wish to filter out those columns we select only the columns without and NA values. Next, there are several columns that do not relate to the core data set so we will also remove those columns from the training set. Finally, the cleanData function accepts a data frame so that we can clean the test set the same way. After cleaning our data we are left with 53 columns of 19,622 variables.

Cross Validation

We split our training data into 70% training data and 30% validation data with the commands below.

  training <- cleanData(getTrainingData())
  inTraining <- createDataPartition(training$classe, p=.7, list=F)
  
  trainingSubset <- training[inTraining,]
  trainingSubsetForValidation <- training[-inTraining,]

We decided to utilize the random forest model because we have many variables with unknown relationships to each other. Thus, we create a model with the function below.

createModel <- function(dataToFit) {
  modelFit <- train(classe ~. , model="rf", data=dataToFit, prox=T)
}

Once we generate our model we then generate a prediction on the validation set using the code below.

validationPrediction <- predict(modelFit, trainingSubsetForValidation)

Then we generate a confusion matrix with the below command.

confusionMatrix(validationPrediction, trainingSubsetForValidation$classe)

Which generates the following analysis.

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1673    7    0    0    0
         B    0 1128    6    1    0
         C    1    4 1013   12    2
         D    0    0    7  951    5
         E    0    0    0    0 1075

Overall Statistics
                                          
               Accuracy : 0.9924          
                 95% CI : (0.9898, 0.9944)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9903          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9994   0.9903   0.9873   0.9865   0.9935
Specificity            0.9983   0.9985   0.9961   0.9976   1.0000
Pos Pred Value         0.9958   0.9938   0.9816   0.9875   1.0000
Neg Pred Value         0.9998   0.9977   0.9973   0.9974   0.9985
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2843   0.1917   0.1721   0.1616   0.1827
Detection Prevalence   0.2855   0.1929   0.1754   0.1636   0.1827
Balanced Accuracy      0.9989   0.9944   0.9917   0.9920   0.9968

From the above analysis we see that the out of sample accuracy rate is 0.9924 which implies the estimated out of sample error rate of 0.0076.

machinelearningproject's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.