We work with a Human Activity Recognition dataset from Ugulino et. al. (2012) consisting of the activity variable “classe” (sitting-down, standing-up, standing, walking, and sitting) as well as several variables from the wearable accelerometers. 4 healthy subjects participated on 8 hours of activities.
The training data (pml-training.csv) was partioned further into a training part (65%), test part (15%) and a validation part (10%). We neither can (classe is missing) or should use the real test data (pml-testing.csv) to build the model.
The train part from the pml-training.csv is the bulk data used for building the model. The test part from pml-training was used to compare performance of different algorithms and fine-tune hyperparameters of the chosen final algorithm. To get at more realistic estimate of the out-of-sample error rate we also created a validation part.
A number of columns containing only summary statistics were removed since the same information is already present in other columns but as raw data.
The following heatmap shows the data we intend to use for training by the values of the outcome variable classe, but rescaled so that the mean of each variable is 0 and standard deviation 1. Variables with stronger colors (red or blue) may be more helpful for prediction than the other variables.
In the model building below however, I decided to proceed with unscaled data.
I first run a multinomial logistic regression on the training data, to get a accuracy score for benchmark. “Classe” was used as outcome and predictors selected by a stepwise feature selection (both directions).
I then tried naive bayes and random forest algorithms with cross validation, k=10, and otherwise standard hyperparameters. This implies that the dataset is trained on 9 (k-1) folds and validated on 1 fold. The random forest algorithm appeared to be far superior in terms of performance (table 1).
| Model | Accuracy | Bal.Accuracy | Sensitivity | Specificity | |
| 3 | Random forest (500 trees) | 0.9990 | 0.9993 | 0.9989 | 0.9998 | 
| 2 | Naive Bayes | 0.7493 | 0.8333 | 0.7318 | 0.9348 | 
| 1 | Multinomial logistic | 0.6945 | |||
| No_Trees | Bal_Accuracy | Accuracy | Specificity | Sensitivity | |
| Model 1 | 100 | 0.9990 | 0.9985 | 0.9996 | 0.9984 | 
| Model 2 | 200 | 0.9992 | 0.9988 | 0.9997 | 0.9987 | 
| Model 3 | 300 | 0.9993 | 0.9990 | 0.9997 | 0.9989 | 
| Model 4 | 400 | 0.9990 | 0.9985 | 0.9996 | 0.9984 | 
| Model 5 | 500 | 0.9993 | 0.9990 | 0.9998 | 0.9988 | 
| Model 6 | 600 | 0.9993 | 0.9990 | 0.9998 | 0.9988 | 
| Model 7 | 700 | 0.9997 | 0.9995 | 0.9999 | 0.9995 | 
| Model 8 | 800 | 0.9997 | 0.9995 | 0.9999 | 0.9995 | 
| Model 9 | 900 | 0.9995 | 0.9993 | 0.9998 | 0.9992 | 
| Model 10 | 1,000 | 0.9995 | 0.9993 | 0.9998 | 0.9992 | 
I decided to go forward with the random forest model and next ran 10 models with different number of trees (table 2). In general, the model does not seem to be particularly sensitive for number of trees. The model with 500 trees seems to suffice as the scores does not increase for a higher number of trees. With 500 trees, the highest accuracy was achieved with a little less than 30 predictors (graph 2).
Brieman and Cutler (2023) note that “in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error.” Thus we should be able to rely on the accuracy scores already achieved, implying that the out-of-sample error should be close to zero. In order to confirm this we predict the validation dataset using the final random forest model. Table 3 shows the results, confirming that out-of-sample error (1-accuracy) is indeed close to zero.
| Bal_Accuracy | Accuracy | Specificity | Sensitivity | |
| Accuracy | 0.9992 | 0.9989 | 0.9997 | 0.9987 | 
Brieman and Cutler (2023), “Random Forests”, available online: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Ugulino W., Cardador D., Vega K., Velloso E., Milidiu R., Fuks H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
| A | B | C | D | E | Min | Max | |
| raw_timestamp_part_1 | 132.3 | 132.3 | 132.3 | 132.3 | 132.3 | 132.2 | 132.3 | 
| raw_timestamp_part_2 | 497.1 | 498.8 | 494.2 | 506.1 | 516.7 | 0.3 | 998.8 | 
| roll_belt | 59.8 | 64.8 | 63.8 | 60.2 | 74.1 | -28.8 | 162 | 
| pitch_belt | 0.4 | 0 | -1.5 | 1.6 | 0.9 | -55.8 | 60.3 | 
| yaw_belt | -11.7 | -13.3 | -7.5 | -18 | -6.2 | -179 | 179 | 
| total_accel_belt | 10.7 | 11.1 | 11 | 11.2 | 12.6 | 0 | 29 | 
| gyros_belt_x | 0 | 0 | 0 | 0 | 0 | -1 | 2.2 | 
| gyros_belt_y | 0 | 0 | 0 | 0 | 0 | -0.5 | 0.6 | 
| gyros_belt_z | -0.1 | -0.1 | -0.1 | -0.1 | -0.1 | -1.5 | 1.5 | 
| accel_belt_x | -6.3 | -5 | -3.5 | -8.1 | -5.3 | -120 | 85 | 
| accel_belt_y | 29.2 | 31.8 | 30.5 | 30.1 | 29.2 | -69 | 164 | 
| accel_belt_z | -63.6 | -73 | -69.5 | -67.9 | -91.4 | -275 | 105 | 
| magnet_belt_x | 57.5 | 49 | 57.8 | 49 | 62.9 | -49 | 485 | 
| magnet_belt_y | 602.1 | 599.5 | 599.6 | 595.4 | 568 | 354 | 673 | 
| magnet_belt_z | -337.6 | -336.2 | -337.9 | -339.2 | -378.2 | -621 | 286 | 
| roll_arm | -2.4 | 32.9 | 26.1 | 21.5 | 21.3 | -178 | 180 | 
| pitch_arm | 3.8 | -6.5 | -1.3 | -10.6 | -12.5 | -88.2 | 88.5 | 
| yaw_arm | -11.5 | 8.1 | 4.6 | 5 | -2 | -180 | 180 | 
| total_accel_arm | 27.4 | 26.5 | 24.6 | 23.4 | 24.8 | 1 | 65 | 
| gyros_arm_x | 0 | 0 | 0.1 | 0 | 0.1 | -6.4 | 4.9 | 
| gyros_arm_y | -0.2 | -0.3 | -0.3 | -0.3 | -0.3 | -3.4 | 2.8 | 
| gyros_arm_z | 0.3 | 0.3 | 0.3 | 0.3 | 0.3 | -2.3 | 3 | 
| accel_arm_x | -133.3 | -41.3 | -81.7 | 14.3 | -18.3 | -383 | 435 | 
| accel_arm_y | 46.1 | 25.5 | 42.1 | 24.6 | 15.6 | -315 | 308 | 
| accel_arm_z | -75.2 | -95.1 | -55.8 | -50.1 | -78.9 | -630 | 292 | 
| magnet_arm_x | -24.6 | 236.6 | 150.4 | 398.9 | 321.5 | -584 | 782 | 
| magnet_arm_y | 235.7 | 128.7 | 189.3 | 95.1 | 81.6 | -384 | 583 | 
| magnet_arm_z | 412.2 | 194.6 | 360 | 293.8 | 212.2 | -596 | 694 | 
| roll_dumbbell | 21.4 | 35.4 | -14.1 | 50.7 | 25.7 | -153.7 | 153.5 | 
| pitch_dumbbell | -18.9 | 2.9 | -25.3 | -3.1 | -6.6 | -149.6 | 137 | 
| yaw_dumbbell | 2.4 | 14.4 | -17.7 | -0.2 | 6.8 | -150.9 | 154.8 | 
| total_accel_dumbbell | 14.5 | 14.3 | 12.7 | 11.5 | 14.3 | 0 | 42 | 
| gyros_dumbbell_x | 0.2 | 0.2 | 0.2 | 0.2 | 0.1 | -2 | 2.2 | 
| gyros_dumbbell_y | 0 | 0 | 0.1 | 0 | 0.1 | -2.1 | 4.4 | 
| gyros_dumbbell_z | -0.1 | -0.1 | -0.2 | -0.1 | -0.1 | -2.4 | 1.9 | 
| accel_dumbbell_x | -50 | -1.3 | -40 | -23.8 | -17.3 | -237 | 235 | 
| accel_dumbbell_y | 51.3 | 68.1 | 29.6 | 54.7 | 54.4 | -189 | 315 | 
| accel_dumbbell_z | -55.1 | -16.2 | -53.2 | -35.9 | -22.5 | -334 | 318 | 
| magnet_dumbbell_x | -387.7 | -250 | -382 | -326.2 | -285.5 | -639 | 583 | 
| magnet_dumbbell_y | 216.8 | 261.2 | 164.4 | 227.8 | 236.6 | -3,600 | 633 | 
| magnet_dumbbell_z | 8.7 | 46.9 | 65.7 | 57.6 | 69.9 | -262 | 452 | 
| roll_forearm | 25.9 | 34 | 58.8 | 16.8 | 41.2 | -180 | 180 | 
| pitch_forearm | -7.4 | 14.8 | 12.1 | 28 | 16.9 | -72.5 | 89.8 | 
| yaw_forearm | 23.1 | 13.9 | 38 | 5.3 | 10.4 | -180 | 180 | 
| total_accel_forearm | 32.2 | 35.2 | 34.9 | 36.2 | 36.5 | 0 | 73 | 
| gyros_forearm_x | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 | -5 | 3.5 | 
| gyros_forearm_y | 0.1 | 0.1 | 0.1 | 0 | 0 | -6.7 | 6.1 | 
| gyros_forearm_z | 0.1 | 0.2 | 0.2 | 0.1 | 0.1 | -5.6 | 4.3 | 
| accel_forearm_x | -2.3 | -73.4 | -45.2 | -155.8 | -69 | -496 | 477 | 
| accel_forearm_y | 170.8 | 138.7 | 214.5 | 151.1 | 142.2 | -585 | 591 | 
| accel_forearm_z | -59.2 | -48.8 | -58.3 | -48.2 | -62 | -391 | 287 | 
| magnet_forearm_x | -199.3 | -322.5 | -324.2 | -458.4 | -323.7 | -1,280 | 672 | 
| magnet_forearm_y | 476.5 | 280.1 | 500.9 | 315.3 | 266.7 | -896 | 1,480 | 
| magnet_forearm_z | 406.4 | 384.5 | 451.8 | 361.4 | 354.6 | -966 | 1,070 |