As seen last week in a post on grid search cross-validation, crossval
contains generic functions for statistical/machine learning cross-validation in R. A 4-fold cross-validation procedure is presented below:
In this post, I present some examples of use of crossval
on a linear model, and on the popular xgboost
and randomForest
models. The error measure used is Root Mean Squared Error (RMSE), and is currently the only choice implemented.
Installation
From Github, in R console:
devtools::install_github("thierrymoudiki/crossval")
Demo
We use a simulated dataset for this demo, containing 100 examples, and 5 explanatory variables:
# dataset creation
set.seed(123)
n <- 100 ; p <- 5
X <- matrix(rnorm(n * p), n, p)
y <- rnorm(n)
Linear model
X
contains the explanatory variablesy
is the responsek
is the number of folds in k-fold cross-validationrepeats
is the number of repeats of the k-fold cross-validation procedure
Linear model example:
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3)
## $folds
## repeat_1 repeat_2 repeat_3
## fold_1 0.8987732 0.9270326 0.7903096
## fold_2 0.8787553 0.8704522 1.2394063
## fold_3 1.0810407 0.7907543 1.3381991
## fold_4 1.0594537 1.1981031 0.7368007
## fold_5 0.7593157 0.8913229 0.7734180
##
## $mean
## [1] 0.9488758
##
## $sd
## [1] 0.1902999
##
## $median
## [1] 0.8913229
Linear model example, with validation set:
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3, p = 0.8)
## $folds
## repeat_1 repeat_2 repeat_3
## fold_training_1 1.1256933 0.9144503 0.9746044
## fold_validation_1 0.9734644 0.9805410 0.9761265
## fold_training_2 1.0124938 0.9652489 0.7257494
## fold_validation_2 0.9800293 0.9577811 0.9631389
## fold_training_3 0.7695705 1.0091999 0.9740067
## fold_validation_3 0.9753250 1.0373943 0.9863062
## fold_training_4 1.0482233 0.9194648 0.9680724
## fold_validation_4 0.9984861 0.9596531 0.9742874
## fold_training_5 0.9210179 1.0455006 0.9886350
## fold_validation_5 1.0126038 0.9658146 0.9658412
##
## $mean_training
## [1] 0.9574621
##
## $mean_validation
## [1] 0.9804529
##
## $sd_training
## [1] 0.1018837
##
## $sd_validation
## [1] 0.02145046
##
## $median_training
## [1] 0.9740067
##
## $median_validation
## [1] 0.975325
Random Forest
randomForest example:
require(randomForest)
# fit randomForest with mtry = 4
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3,
fit_func = randomForest::randomForest, predict_func = predict,
packages = "randomForest", fit_params = list(mtry = 4))
## $folds
## repeat_1 repeat_2 repeat_3
## fold_1 0.9820183 0.9895682 0.8752296
## fold_2 0.8701763 0.8771651 1.2719188
## fold_3 1.1869986 0.7736392 1.3521407
## fold_4 1.0946892 1.1204090 0.7100938
## fold_5 0.9847612 1.0565001 0.9194678
##
## $mean
## [1] 1.004318
##
## $sd
## [1] 0.1791315
##
## $median
## [1] 0.9847612
randomForest
with parameter mtry
= 4, and a validation set:
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 2, p = 0.8,
fit_func = randomForest::randomForest, predict_func = predict,
packages = "randomForest", fit_params = list(mtry = 4))
## $folds
## repeat_1 repeat_2
## fold_training_1 1.0819863 0.9096807
## fold_validation_1 0.8413615 0.8415839
## fold_training_2 0.9507086 1.0014771
## fold_validation_2 0.5631285 0.6545253
## fold_training_3 0.7020669 0.9632402
## fold_validation_3 0.5090071 0.9129895
## fold_training_4 0.8932151 1.0315366
## fold_validation_4 0.8299454 0.7147867
## fold_training_5 0.9158418 1.1093461
## fold_validation_5 0.6438410 0.7644071
##
## $mean_training
## [1] 0.9559099
##
## $mean_validation
## [1] 0.7275576
##
## $sd_training
## [1] 0.1151926
##
## $sd_validation
## [1] 0.133119
##
## $median_training
## [1] 0.9569744
##
## $median_validation
## [1] 0.7395969
xgboost
In this case, the response and covariates are named ‘label’ and ‘data’. So (for now), we do this:
# xgboost example -----
require(xgboost)
f_xgboost <- function(x, y, ...) xgboost::xgboost(data = x, label = y, ...)
Fit xgboost
with nrounds
= 10:
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3,
fit_func = f_xgboost, predict_func = predict,
packages = "xgboost", fit_params = list(nrounds = 10,
verbose = FALSE))
## $folds
## repeat_1 repeat_2 repeat_3
## fold_1 0.9487191 1.2019850 0.9160024
## fold_2 0.9194731 0.8990731 1.2619773
## fold_3 1.2775092 0.7691470 1.3942022
## fold_4 1.1893053 1.1250443 0.7173760
## fold_5 1.1200368 1.1686622 0.9986680
##
## $mean
## [1] 1.060479
##
## $sd
## [1] 0.1965465
##
## $median
## [1] 1.120037
Fit xgboost
with `nrounds = 10, and validation set:
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 2, p = 0.8,
fit_func = f_xgboost, predict_func = predict,
packages = "xgboost", fit_params = list(nrounds = 10,
verbose = FALSE))
## $folds
## repeat_1 repeat_2
## fold_training_1 1.1063607 1.0350719
## fold_validation_1 0.7891655 1.0025217
## fold_training_2 1.0117042 1.1723135
## fold_validation_2 0.4325200 0.5050369
## fold_training_3 0.7074600 1.0101371
## fold_validation_3 0.1916094 0.9800865
## fold_training_4 0.9131272 1.2411424
## fold_validation_4 0.8998582 0.7521359
## fold_training_5 0.9462418 1.0543695
## fold_validation_5 0.5432650 0.6850912
##
## $mean_training
## [1] 1.019793
##
## $mean_validation
## [1] 0.678129
##
## $sd_training
## [1] 0.147452
##
## $sd_validation
## [1] 0.2600431
##
## $median_training
## [1] 1.023388
##
## $median_validation
## [1] 0.7186136
Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!
Under License Creative Commons Attribution 4.0 International.
Comments powered by Talkyard.