Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. The goal of this project is to predict the manner in which the participants did the exercise. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used to build a prediction model. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Cross-validation is done by subsampling the training data set without replacements in two samples, ‘training’ & ‘validation’. The model is built on training data set. The model is then tested on the validation data set. Model which performs well on validation data set.
We have to predict ‘classe’ variable, when given other variables. Various models will be built to predict ‘classe’ using various features. Model with maximum accuracy and minimum out of sample error will be selected as the final model. For this particular problem, ‘Random Forests’ with 52 features is the best model.
The expected out-of-sample error will correspond to 1-Accuracy on the validation data set.
The outcome variable “classe” is an unordered factor variable. Hence, error is chosen as ‘1-accuracy’.The large sample size enables further subdiving of the training data set in to two data sets, ‘training’ & ‘validation’. These two data sets are used for cross validation. To improve the model building, only relevant features are included. Variables with NA’s & variables with near zero variablity are not included. Variables like name, datestamp, window are also excluded.
Loading the required packages
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Downloading the data
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', 'train.csv')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', 'test.csv')
Reading the data
training <- read.csv('train.csv')
test <- read.csv('test.csv')
Splitting the data to create a validation set
trainIndex <- createDataPartition(training$classe, p = 0.75, list = FALSE)
validation <- training[-trainIndex,]
training <- training[trainIndex,]
Removing columns with NA values
nacol <- numeric()
for(i in 1:length(training)){
if(any(is.na(training[i]))){nacol <- append(nacol, i)}
}
training <- training[-nacol]
Removing non-numeric columns
training <- training[-(1:7)]
Removing columns with near zero variance
nzv <- nearZeroVar(training)
training <- training[-nzv]
Training a model using random forests
model <- train(classe~., data = training, method = 'rf')
Summarizing the model
model
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9894646 0.9866746 0.001999372 0.002518893
## 27 0.9898770 0.9871967 0.001759753 0.002219028
## 52 0.9840419 0.9798175 0.003342155 0.004216335
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
In sample error is 1.02
valRes <- predict(model, newdata = validation)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
confusionMatrix(valRes, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 2 0 0 0
## B 0 947 0 0 0
## C 0 0 855 3 0
## D 0 0 0 800 0
## E 0 0 0 1 901
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9973, 0.9996)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9979 1.0000 0.9950 1.0000
## Specificity 0.9994 1.0000 0.9993 1.0000 0.9998
## Pos Pred Value 0.9986 1.0000 0.9965 1.0000 0.9989
## Neg Pred Value 1.0000 0.9995 1.0000 0.9990 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1931 0.1743 0.1631 0.1837
## Detection Prevalence 0.2849 0.1931 0.1750 0.1631 0.1839
## Balanced Accuracy 0.9997 0.9989 0.9996 0.9975 0.9999
Out of Sample error is 0.57
answers <- predict(model, newdata = test)
answers
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E