-> Change your working directory to the folder where 'UCI HAR Dataset' (Samsung data) folder is located
-> Place the script 'run_analysis.R' in this folder
-> Load the following library: dplyr
-> Source the 'run_analysis.R' script
-> The script will perform the steps mentioned in the project and will write the ouput of the Step 5 to a txt file 'newData.txt' and place it in the current wokring directory
-> For the explanation of the code see below:
#read data sets and then combine them all in one main data set
x1 <- read.table(paste(getwd(),"/UCI HAR Dataset/test/X_test.txt",sep = ""))
y1 <- read.table(paste(getwd(),"/UCI HAR Dataset/test/y_test.txt",sep = ""))
z1 <- read.table(paste(getwd(),"/UCI HAR Dataset/test/subject_test.txt",sep = ""))
x2 <- read.table(paste(getwd(),"/UCI HAR Dataset/train/X_train.txt",sep = ""))
y2 <- read.table(paste(getwd(),"/UCI HAR Dataset/train/y_train.txt",sep = ""))
z2 <- read.table(paste(getwd(),"/UCI HAR Dataset/train/subject_train.txt",sep = ""))
combinedData1 <- cbind(z1,y1,x1)
combinedData2 <- cbind(z2,y2,x2)
combinedData <- rbind(combinedData1,combinedData2)
The X_test data set is stored in x1. X_train is stored in x2. Similarly, Y_test, Y_train are in y1, y2 respectively. Same for subject_test and subject_train.
combinedData1 contains subject_test, y_test, X_test, combined by column. combinedData1 contains subject_train, y_train, X_train, combined by column
Finally, combinedData is formed from binding combinedData1&2 by row
Step 1 of the project is over
In the first part the feature names and the activity names are read from the text file and then the fearure names are assigned the column names of the combinedData
#read data sets for feature names and activity names
featureName <- read.table(paste(getwd(),"/UCI HAR Dataset/features.txt",sep = ""))
activityLabels <- read.table(paste(getwd(),"/UCI HAR Dataset/activity_labels.txt",sep = ""))
#read data set for column names and assign it to the column names of the main data set
columnNames <- c("SubjectID","ActivityID",as.vector(featureName$V2))
names(combinedData) <- columnNames
The below code first checks for duplicate columsn and then removes them from the combinedData, keeping only the unique column names and data.
#remove duplicate columns
duplicateColumnName <- duplicated(colnames(combinedData))
combinedData <- combinedData[,!duplicateColumnName]
#Extract columns with mean or std in their name
meanStdCheck <- grep("mean|std",colnames(combinedData))
meanStdData <- cbind(combinedData[,(1:2)],combinedData[,meanStdCheck])
combinedData <- meanStdData
Once that is done, the column names with 'mean' or 'std' in them are extracted along with subject id and activity id and stored in combinedData
This completes Step 2.
Activity ids in the combinedData are matched with the activity ids in the activity labels data set and replaced with the corresponding activity name
#Uses descriptive activity names to name the activities in the data set
combinedData[,2] <- activityLabels[match(combinedData[,2],activityLabels[,1]),2]
Step 3 is complete.
Column names are substituted with their decriptive versions
Example: std is replaced with StandardDeviation
#Appropriately labels the data set with descriptive variable names.
colnames(combinedData) <- gsub("\\(|\\)","",colnames(combinedData))
colnames(combinedData) <- gsub("std","StandardDeviation",colnames(combinedData))
colnames(combinedData) <- gsub("mad","MedianDeviation",colnames(combinedData))
colnames(combinedData) <- gsub("sma","SignalMagnitudeArea",colnames(combinedData))
colnames(combinedData) <- gsub("iqr","InterquartileRange",colnames(combinedData))
colnames(combinedData) <- gsub("arCoeff","AutorregresionCoeff",colnames(combinedData))
colnames(combinedData) <- gsub("maxInds","LargestMagnitudeIndex",colnames(combinedData))
Step 4 is complete.
#Create data set grouped by subject and activity and containing average of each feature
lastDataSet <- combinedData[order(combinedData$SubjectID, decreasing = FALSE),]
lastDataSet <- lastDataSet %>% group_by(SubjectID,ActivityID) %>% summarise_each(funs(mean(.,)))
#write the data set just created in a file
write.table(lastDataSet, file = "newData.txt", row.names = FALSE)
After calculating the average the data set is written to a text file called 'newData.txt' and stored in the current working directory
Step 5 is complete.