Model calibration in the context of this post, is about finding *optimal* **hyperparameters** for Statistical/Machine learning (ML) models. *Optimal* in the sense that they minimize a given criterion such as model’s accuracy on unseen data, model’s precision, Root Mean Squared Error (RMSE), etc. **What are ML models’ hyperparameters**? Let’s take the example of a **linear model**:

```
y = beta_0 + beta_1 x_1 + beta_2 x_2
```

Imagine that `y`

is a car’s fuel consumption in Miles/(US) gallon. `x_1`

is its horsepower, and `x_2`

its number of cylinders. Knowing the values of `x_1`

and `x_2`

, we would like to estimate the average value of `y`

for many different cars. `beta_1`

and `beta_2`

are **unknown model parameters**, typically estimated by minimizing the distance between the observed car’s consumption `y`

, and the model `beta_0 + beta_1 x_1 + beta_2 x_2`

. With such a model, we can obtain for example that:

```
estimated fuel cons. = 0.1 + 0.4 x horsepower + 0.7 x no. of cylinders
```

Sometimes, when designing our linear model, we will want the unknown coefficients `beta_1`

and `beta_2`

to be bounded (`beta_1`

and `beta_2`

could otherwise exhibit a high variance). Or, we could want to consider a different polynomial degree `d`

for `x_1`

or `x_2`

. Whereas `beta_1`

are `beta_2`

are model **parameters**, the polynomial degree `d`

on explanatory variables and the bound `s`

put on parameters `beta_1`

and `beta_2`

are model **hyperparameters**.

Hyperparameters are those parameters that you can tune, in order to increase/decrease the model’s performance. `d`

is a **degree of freedom**. It controls model’s flexibility. The higher `d`

, the more flexible our model - meaning than it could almost fit “anything”. `s`

is a **regularization parameter** that stabilizes model estimates. Increasing `d`

might lead to **overfitting**, and a lower `d`

, to **underfitting**. Overfitting or underfitting are about: **too much flexibility or not enough**. We’ll use the `mtcars`

dataset to illustrate these concepts. This dataset is available from `R`

console, as:

```
data(mtcars)
```

According to its description, `mtcars`

is extracted from 1974 *Motor Trend* US magazine, and comprises **fuel consumption** (in Miles/(US) gallon) and **10 aspects of automobile design and performance** for 32 automobiles (1973–74 models). We’ll use 5 explanatory variables among 10 here.

```
mpg: Miles/(US) gallon # this is y
cyl: Number of cylinders # this is x_1
disp: Displacement (cu.in.) # this is x_2
hp: Gross horsepower # this is x_3
wt: Weight (1000 lbs) # this is x_4
carb: Number of carburetors # this is x_5
```

Below, are the correlations between the target variable (to be explained), `mpg`

, and explanatory variables `x_1`

, …, `x_5`

. We use R package `corrplot`

to plot these correlations.

All the explanatory variables are negatively correlated to the fuel consumption (in Miles/(US) gallon). A marginal increase in any of them leads, on average, to a decrease in fuel consumption. Now, in order to illustrate the concepts of overfitting and underfitting, we fit a **linear model** and a **smoothing spline** to `mpg`

(consumption) and `hp`

(horsepower).

**On the left**: model fitting on 23 cars, for a linear model and a spline. The linear model fits all the points parsimoniously, but the spline tries to memorize the patterns. **On the right**: errors obtained by each model on the 9 remaining cars, as a function of the spline’s degrees of freedom. **That’s overfitting, illustrated**. In other situations, a linear model **can also fit (very) poorly**, because it’s not flexible enough.

So, how do we **find a good compromise between overfitting or underfitting**? One way to achieve it is to use a hold-out sample, as we did on the previous example - with 23 cars out of 32 in the training procedure, and 9 for testing. Another way is to use **cross-validation**. The idea of cross-validation is to divide the whole dataset into k parts (usually called **folds**); **each part being successively included into a training set or a testing set**.

On this graph, we have k=5. `crossval`

is a - work in progress - `R`

package, for doing just that. **WHY** did I implement it? Because `R`

models are contributed by many different people. So, you’re not using a unified interface when training them. For example, in order to obtain predictions for 2 different models, you can have 2 different specifications of function `predict`

:

```
predict(fitting_obj_model1, newx)
```

or

```
predict(fitting_obj_model2, newdata)
```

`fitting_obj_model*`

are the trained models `1`

and `2`

. `newx`

and `newdata`

are the unseen data on which we would like to test the trained model. The position of arguments in function calls do also matter a lot. **Idea**: use a common cross-validation interface for many different models. Hence, `crossval`

. There is still room for improvement. If you find cases that are not covered by `crossval`

, you can contribute them here. Currently, the package can be **installed from Github** as (in R console):

```
library(devtools)
devtools::install_github("thierrymoudiki/crossval")
```

Here is an example of use of `crossval`

applied to glmnet (with my old school `R`

syntax yeah, I like it!):

```
require(glmnet)
require(Matrix)
# load the dataset
data("mtcars")
df <- mtcars[, c(1, 2, 3, 4, 6, 11)]
summary(df)
# create response and explanatory variables
X <- as.matrix(df[, -1])
y <- df$mpg
# grid of model hyperparameters
tuning_grid <- expand.grid(alpha = c(0, 0.5, 1),
lambda = c(0.01, 0.1, 1))
n_params <- nrow(tuning_grid)
# list of cross-validation results
# - 5-fold cross-validation (`k`)
# - repeated 3 times (`repeats`)
# - cross-validation on 80% of the data (`p`)
# - validation on the remaining 20%
cv_results <- lapply(1:n_params,
function(i)
crossval::crossval(
x = X,
y = y,
k = 5,
repeats = 3,
p = 0.8,
fit_func = glmnet::glmnet,
predict_func = predict.glmnet,
packages = c("glmnet", "Matrix"),
fit_params = list(alpha = tuning_grid[i, "alpha"],
lambda = tuning_grid[i, "lambda"])
))
names(cv_results) <- paste0("params_set", 1:n_params)
print(cv_results)
```

Many other examples of use of the package can be found in the README.
Also, `R`

packages like `caret`

or `mlr`

do similar things, but with a different philosophy. You may want to try them out too.

**Note:** I am currently looking for a *gig*. You can hire me on Malt or send me an email: **thierry dot moudiki at pm dot me**. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

Under License Creative Commons Attribution 4.0 International.

Comments powered by Talkyard.