Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. The goal of this project is to predict the manner in which the participants did the exercise. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used to build a prediction model. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Cross-validation

Cross-validation is done by subsampling the training data set without replacements in two samples, ‘training’ & ‘validation’. The model is built on training data set. The model is then tested on the validation data set. Model which performs well on validation data set.

Model Building

We have to predict ‘classe’ variable, when given other variables. Various models will be built to predict ‘classe’ using various features. Model with maximum accuracy and minimum out of sample error will be selected as the final model. For this particular problem, ‘Random Forests’ with 52 features is the best model.

Expected out-of-sample error

The expected out-of-sample error will correspond to 1-Accuracy on the validation data set.

Justification of choices

The outcome variable “classe” is an unordered factor variable. Hence, error is chosen as ‘1-accuracy’.The large sample size enables further subdiving of the training data set in to two data sets, ‘training’ & ‘validation’. These two data sets are used for cross validation. To improve the model building, only relevant features are included. Variables with NA’s & variables with near zero variablity are not included. Variables like name, datestamp, window are also excluded.

Analysis

Data Loading & Processing

Loading the required packages

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

Downloading the data

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', 'train.csv')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', 'test.csv')

Reading the data

training <- read.csv('train.csv')
test <- read.csv('test.csv')

Splitting the data to create a validation set

trainIndex <- createDataPartition(training$classe, p = 0.75, list = FALSE)
validation <- training[-trainIndex,]
training <- training[trainIndex,]

Removing columns with NA values

nacol <- numeric()
for(i in 1:length(training)){
  if(any(is.na(training[i]))){nacol <- append(nacol, i)}
}

training <- training[-nacol]

Removing non-numeric columns

training <- training[-(1:7)]

Removing columns with near zero variance

nzv <- nearZeroVar(training)
training <- training[-nzv]

Model Fitting

Training a model using random forests

model <- train(classe~., data = training, method = 'rf')

Summarizing the model

model
## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9894646  0.9866746  0.001999372  0.002518893
##   27    0.9898770  0.9871967  0.001759753  0.002219028
##   52    0.9840419  0.9798175  0.003342155  0.004216335
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

In sample error is 1.02

Testing the model on validation set

valRes <- predict(model, newdata = validation)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
confusionMatrix(valRes, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    2    0    0    0
##          B    0  947    0    0    0
##          C    0    0  855    3    0
##          D    0    0    0  800    0
##          E    0    0    0    1  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9973, 0.9996)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9979   1.0000   0.9950   1.0000
## Specificity            0.9994   1.0000   0.9993   1.0000   0.9998
## Pos Pred Value         0.9986   1.0000   0.9965   1.0000   0.9989
## Neg Pred Value         1.0000   0.9995   1.0000   0.9990   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1931   0.1743   0.1631   0.1837
## Detection Prevalence   0.2849   0.1931   0.1750   0.1631   0.1839
## Balanced Accuracy      0.9997   0.9989   0.9996   0.9975   0.9999

Out of Sample error is 0.57

Testing the model on test set

answers <- predict(model, newdata = test)
answers
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E