Practical machine learning

The data

We work with a Human Activity Recognition dataset from Ugulino et. al. (2012) consisting of the activity variable “classe” (sitting-down, standing-up, standing, walking, and sitting) as well as several variables from the wearable accelerometers. 4 healthy subjects participated on 8 hours of activities.

Data preparation

The training data (pml-training.csv) was partioned further into a training part (65%), test part (15%) and a validation part (10%). We neither can (classe is missing) or should use the real test data (pml-testing.csv) to build the model.

The train part from the pml-training.csv is the bulk data used for building the model. The test part from pml-training was used to compare performance of different algorithms and fine-tune hyperparameters of the chosen final algorithm. To get at more realistic estimate of the out-of-sample error rate we also created a validation part.

A number of columns containing only summary statistics were removed since the same information is already present in other columns but as raw data.

The following heatmap shows the data we intend to use for training by the values of the outcome variable classe, but rescaled so that the mean of each variable is 0 and standard deviation 1. Variables with stronger colors (red or blue) may be more helpful for prediction than the other variables.

In the model building below however, I decided to proceed with unscaled data.

Choice of algorithm, cross validation and final model

I first run a multinomial logistic regression on the training data, to get a accuracy score for benchmark. “Classe” was used as outcome and predictors selected by a stepwise feature selection (both directions).

I then tried naive bayes and random forest algorithms with cross validation, k=10, and otherwise standard hyperparameters. This implies that the dataset is trained on 9 (k-1) folds and validated on 1 fold. The random forest algorithm appeared to be far superior in terms of performance (table 1).

**Table 1. Model performance**

	Model	Accuracy	Bal.Accuracy	Sensitivity	Specificity

3	Random forest (500 trees)	0.9990	0.9993	0.9989	0.9998
2	Naive Bayes	0.7493	0.8333	0.7318	0.9348
1	Multinomial logistic	0.6945

**Table 2. Performance of random forest model for different number of trees**

	No_Trees	Bal_Accuracy	Accuracy	Specificity	Sensitivity

Model 1	100	0.9990	0.9985	0.9996	0.9984
Model 2	200	0.9992	0.9988	0.9997	0.9987
Model 3	300	0.9993	0.9990	0.9997	0.9989
Model 4	400	0.9990	0.9985	0.9996	0.9984
Model 5	500	0.9993	0.9990	0.9998	0.9988
Model 6	600	0.9993	0.9990	0.9998	0.9988
Model 7	700	0.9997	0.9995	0.9999	0.9995
Model 8	800	0.9997	0.9995	0.9999	0.9995
Model 9	900	0.9995	0.9993	0.9998	0.9992
Model 10	1,000	0.9995	0.9993	0.9998	0.9992

I decided to go forward with the random forest model and next ran 10 models with different number of trees (table 2). In general, the model does not seem to be particularly sensitive for number of trees. The model with 500 trees seems to suffice as the scores does not increase for a higher number of trees. With 500 trees, the highest accuracy was achieved with a little less than 30 predictors (graph 2).

Expected out-of-sample error

Brieman and Cutler (2023) note that “in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error.” Thus we should be able to rely on the accuracy scores already achieved, implying that the out-of-sample error should be close to zero. In order to confirm this we predict the validation dataset using the final random forest model. Table 3 shows the results, confirming that out-of-sample error (1-accuracy) is indeed close to zero.

**Table 3. Model performance**

	Bal_Accuracy	Accuracy	Specificity	Sensitivity

Accuracy	0.9992	0.9989	0.9997	0.9987

References

Brieman and Cutler (2023), “Random Forests”, available online: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Ugulino W., Cardador D., Vega K., Velloso E., Milidiu R., Fuks H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Appendix

**Table A1. Min and max for train dataset and mean by outcome value**

	A	B	C	D	E	Min	Max

raw_timestamp_part_1	132.3	132.3	132.3	132.3	132.3	132.2	132.3
raw_timestamp_part_2	497.1	498.8	494.2	506.1	516.7	0.3	998.8
roll_belt	59.8	64.8	63.8	60.2	74.1	-28.8	162
pitch_belt	0.4	0	-1.5	1.6	0.9	-55.8	60.3
yaw_belt	-11.7	-13.3	-7.5	-18	-6.2	-179	179
total_accel_belt	10.7	11.1	11	11.2	12.6	0	29
gyros_belt_x	0	0	0	0	0	-1	2.2
gyros_belt_y	0	0	0	0	0	-0.5	0.6
gyros_belt_z	-0.1	-0.1	-0.1	-0.1	-0.1	-1.5	1.5
accel_belt_x	-6.3	-5	-3.5	-8.1	-5.3	-120	85
accel_belt_y	29.2	31.8	30.5	30.1	29.2	-69	164
accel_belt_z	-63.6	-73	-69.5	-67.9	-91.4	-275	105
magnet_belt_x	57.5	49	57.8	49	62.9	-49	485
magnet_belt_y	602.1	599.5	599.6	595.4	568	354	673
magnet_belt_z	-337.6	-336.2	-337.9	-339.2	-378.2	-621	286
roll_arm	-2.4	32.9	26.1	21.5	21.3	-178	180
pitch_arm	3.8	-6.5	-1.3	-10.6	-12.5	-88.2	88.5
yaw_arm	-11.5	8.1	4.6	5	-2	-180	180
total_accel_arm	27.4	26.5	24.6	23.4	24.8	1	65
gyros_arm_x	0	0	0.1	0	0.1	-6.4	4.9
gyros_arm_y	-0.2	-0.3	-0.3	-0.3	-0.3	-3.4	2.8
gyros_arm_z	0.3	0.3	0.3	0.3	0.3	-2.3	3
accel_arm_x	-133.3	-41.3	-81.7	14.3	-18.3	-383	435
accel_arm_y	46.1	25.5	42.1	24.6	15.6	-315	308
accel_arm_z	-75.2	-95.1	-55.8	-50.1	-78.9	-630	292
magnet_arm_x	-24.6	236.6	150.4	398.9	321.5	-584	782
magnet_arm_y	235.7	128.7	189.3	95.1	81.6	-384	583
magnet_arm_z	412.2	194.6	360	293.8	212.2	-596	694
roll_dumbbell	21.4	35.4	-14.1	50.7	25.7	-153.7	153.5
pitch_dumbbell	-18.9	2.9	-25.3	-3.1	-6.6	-149.6	137
yaw_dumbbell	2.4	14.4	-17.7	-0.2	6.8	-150.9	154.8
total_accel_dumbbell	14.5	14.3	12.7	11.5	14.3	0	42
gyros_dumbbell_x	0.2	0.2	0.2	0.2	0.1	-2	2.2
gyros_dumbbell_y	0	0	0.1	0	0.1	-2.1	4.4
gyros_dumbbell_z	-0.1	-0.1	-0.2	-0.1	-0.1	-2.4	1.9
accel_dumbbell_x	-50	-1.3	-40	-23.8	-17.3	-237	235
accel_dumbbell_y	51.3	68.1	29.6	54.7	54.4	-189	315
accel_dumbbell_z	-55.1	-16.2	-53.2	-35.9	-22.5	-334	318
magnet_dumbbell_x	-387.7	-250	-382	-326.2	-285.5	-639	583
magnet_dumbbell_y	216.8	261.2	164.4	227.8	236.6	-3,600	633
magnet_dumbbell_z	8.7	46.9	65.7	57.6	69.9	-262	452
roll_forearm	25.9	34	58.8	16.8	41.2	-180	180
pitch_forearm	-7.4	14.8	12.1	28	16.9	-72.5	89.8
yaw_forearm	23.1	13.9	38	5.3	10.4	-180	180
total_accel_forearm	32.2	35.2	34.9	36.2	36.5	0	73
gyros_forearm_x	0.2	0.1	0.2	0.1	0.1	-5	3.5
gyros_forearm_y	0.1	0.1	0.1	0	0	-6.7	6.1
gyros_forearm_z	0.1	0.2	0.2	0.1	0.1	-5.6	4.3
accel_forearm_x	-2.3	-73.4	-45.2	-155.8	-69	-496	477
accel_forearm_y	170.8	138.7	214.5	151.1	142.2	-585	591
accel_forearm_z	-59.2	-48.8	-58.3	-48.2	-62	-391	287
magnet_forearm_x	-199.3	-322.5	-324.2	-458.4	-323.7	-1,280	672
magnet_forearm_y	476.5	280.1	500.9	315.3	266.7	-896	1,480
magnet_forearm_z	406.4	384.5	451.8	361.4	354.6	-966	1,070

Practical machine learning - course project

EmCarlss

2023-03-26