The data

We work with a Human Activity Recognition dataset from Ugulino et. al. (2012) consisting of the activity variable “classe” (sitting-down, standing-up, standing, walking, and sitting) as well as several variables from the wearable accelerometers. 4 healthy subjects participated on 8 hours of activities.

Data preparation

The training data (pml-training.csv) was partioned further into a training part (65%), test part (15%) and a validation part (10%). We neither can (classe is missing) or should use the real test data (pml-testing.csv) to build the model.

The train part from the pml-training.csv is the bulk data used for building the model. The test part from pml-training was used to compare performance of different algorithms and fine-tune hyperparameters of the chosen final algorithm. To get at more realistic estimate of the out-of-sample error rate we also created a validation part.

A number of columns containing only summary statistics were removed since the same information is already present in other columns but as raw data.

The following heatmap shows the data we intend to use for training by the values of the outcome variable classe, but rescaled so that the mean of each variable is 0 and standard deviation 1. Variables with stronger colors (red or blue) may be more helpful for prediction than the other variables.

In the model building below however, I decided to proceed with unscaled data.

Choice of algorithm, cross validation and final model

I first run a multinomial logistic regression on the training data, to get a accuracy score for benchmark. “Classe” was used as outcome and predictors selected by a stepwise feature selection (both directions).

I then tried naive bayes and random forest algorithms with cross validation, k=10, and otherwise standard hyperparameters. This implies that the dataset is trained on 9 (k-1) folds and validated on 1 fold. The random forest algorithm appeared to be far superior in terms of performance (table 1).

Table 1. Model performance
Model Accuracy Bal.Accuracy Sensitivity Specificity
3 Random forest (500 trees) 0.9990 0.9993 0.9989 0.9998
2 Naive Bayes 0.7493 0.8333 0.7318 0.9348
1 Multinomial logistic 0.6945
Table 2. Performance of random forest model for different number of trees
No_Trees Bal_Accuracy Accuracy Specificity Sensitivity
Model 1 100 0.9990 0.9985 0.9996 0.9984
Model 2 200 0.9992 0.9988 0.9997 0.9987
Model 3 300 0.9993 0.9990 0.9997 0.9989
Model 4 400 0.9990 0.9985 0.9996 0.9984
Model 5 500 0.9993 0.9990 0.9998 0.9988
Model 6 600 0.9993 0.9990 0.9998 0.9988
Model 7 700 0.9997 0.9995 0.9999 0.9995
Model 8 800 0.9997 0.9995 0.9999 0.9995
Model 9 900 0.9995 0.9993 0.9998 0.9992
Model 10 1,000 0.9995 0.9993 0.9998 0.9992

I decided to go forward with the random forest model and next ran 10 models with different number of trees (table 2). In general, the model does not seem to be particularly sensitive for number of trees. The model with 500 trees seems to suffice as the scores does not increase for a higher number of trees. With 500 trees, the highest accuracy was achieved with a little less than 30 predictors (graph 2).

Expected out-of-sample error

Brieman and Cutler (2023) note that “in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error.” Thus we should be able to rely on the accuracy scores already achieved, implying that the out-of-sample error should be close to zero. In order to confirm this we predict the validation dataset using the final random forest model. Table 3 shows the results, confirming that out-of-sample error (1-accuracy) is indeed close to zero.

Table 3. Model performance
Bal_Accuracy Accuracy Specificity Sensitivity
Accuracy 0.9992 0.9989 0.9997 0.9987

References

Brieman and Cutler (2023), “Random Forests”, available online: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Ugulino W., Cardador D., Vega K., Velloso E., Milidiu R., Fuks H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Appendix

Table A1. Min and max for train dataset and mean by outcome value
A B C D E Min Max
raw_timestamp_part_1 132.3 132.3 132.3 132.3 132.3 132.2 132.3
raw_timestamp_part_2 497.1 498.8 494.2 506.1 516.7 0.3 998.8
roll_belt 59.8 64.8 63.8 60.2 74.1 -28.8 162
pitch_belt 0.4 0 -1.5 1.6 0.9 -55.8 60.3
yaw_belt -11.7 -13.3 -7.5 -18 -6.2 -179 179
total_accel_belt 10.7 11.1 11 11.2 12.6 0 29
gyros_belt_x 0 0 0 0 0 -1 2.2
gyros_belt_y 0 0 0 0 0 -0.5 0.6
gyros_belt_z -0.1 -0.1 -0.1 -0.1 -0.1 -1.5 1.5
accel_belt_x -6.3 -5 -3.5 -8.1 -5.3 -120 85
accel_belt_y 29.2 31.8 30.5 30.1 29.2 -69 164
accel_belt_z -63.6 -73 -69.5 -67.9 -91.4 -275 105
magnet_belt_x 57.5 49 57.8 49 62.9 -49 485
magnet_belt_y 602.1 599.5 599.6 595.4 568 354 673
magnet_belt_z -337.6 -336.2 -337.9 -339.2 -378.2 -621 286
roll_arm -2.4 32.9 26.1 21.5 21.3 -178 180
pitch_arm 3.8 -6.5 -1.3 -10.6 -12.5 -88.2 88.5
yaw_arm -11.5 8.1 4.6 5 -2 -180 180
total_accel_arm 27.4 26.5 24.6 23.4 24.8 1 65
gyros_arm_x 0 0 0.1 0 0.1 -6.4 4.9
gyros_arm_y -0.2 -0.3 -0.3 -0.3 -0.3 -3.4 2.8
gyros_arm_z 0.3 0.3 0.3 0.3 0.3 -2.3 3
accel_arm_x -133.3 -41.3 -81.7 14.3 -18.3 -383 435
accel_arm_y 46.1 25.5 42.1 24.6 15.6 -315 308
accel_arm_z -75.2 -95.1 -55.8 -50.1 -78.9 -630 292
magnet_arm_x -24.6 236.6 150.4 398.9 321.5 -584 782
magnet_arm_y 235.7 128.7 189.3 95.1 81.6 -384 583
magnet_arm_z 412.2 194.6 360 293.8 212.2 -596 694
roll_dumbbell 21.4 35.4 -14.1 50.7 25.7 -153.7 153.5
pitch_dumbbell -18.9 2.9 -25.3 -3.1 -6.6 -149.6 137
yaw_dumbbell 2.4 14.4 -17.7 -0.2 6.8 -150.9 154.8
total_accel_dumbbell 14.5 14.3 12.7 11.5 14.3 0 42
gyros_dumbbell_x 0.2 0.2 0.2 0.2 0.1 -2 2.2
gyros_dumbbell_y 0 0 0.1 0 0.1 -2.1 4.4
gyros_dumbbell_z -0.1 -0.1 -0.2 -0.1 -0.1 -2.4 1.9
accel_dumbbell_x -50 -1.3 -40 -23.8 -17.3 -237 235
accel_dumbbell_y 51.3 68.1 29.6 54.7 54.4 -189 315
accel_dumbbell_z -55.1 -16.2 -53.2 -35.9 -22.5 -334 318
magnet_dumbbell_x -387.7 -250 -382 -326.2 -285.5 -639 583
magnet_dumbbell_y 216.8 261.2 164.4 227.8 236.6 -3,600 633
magnet_dumbbell_z 8.7 46.9 65.7 57.6 69.9 -262 452
roll_forearm 25.9 34 58.8 16.8 41.2 -180 180
pitch_forearm -7.4 14.8 12.1 28 16.9 -72.5 89.8
yaw_forearm 23.1 13.9 38 5.3 10.4 -180 180
total_accel_forearm 32.2 35.2 34.9 36.2 36.5 0 73
gyros_forearm_x 0.2 0.1 0.2 0.1 0.1 -5 3.5
gyros_forearm_y 0.1 0.1 0.1 0 0 -6.7 6.1
gyros_forearm_z 0.1 0.2 0.2 0.1 0.1 -5.6 4.3
accel_forearm_x -2.3 -73.4 -45.2 -155.8 -69 -496 477
accel_forearm_y 170.8 138.7 214.5 151.1 142.2 -585 591
accel_forearm_z -59.2 -48.8 -58.3 -48.2 -62 -391 287
magnet_forearm_x -199.3 -322.5 -324.2 -458.4 -323.7 -1,280 672
magnet_forearm_y 476.5 280.1 500.9 315.3 266.7 -896 1,480
magnet_forearm_z 406.4 384.5 451.8 361.4 354.6 -966 1,070