A tutorial on tidy cross-validation with R
Analyzing NetHack data, part 1: What kills the players
Analyzing NetHack data, part 2: What players kill the most
Dealing with heteroskedasticity; regression with robust standard errors using R
Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport
Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer
Forecasting my weight with R
From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack
Getting data from pdfs using the pdftools package
Getting the data from the Luxembourguish elections out of Excel
Going from a human readable Excel file to a machine-readable csv with {tidyxl}
How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data
Imputing missing values in parallel using {furrr}
Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data
Missing data imputation and instrumental variables regression: the tidy approach
Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach
The best way to visit Luxembourguish castles is doing data science + combinatorial optimization
The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse
Using a genetic algorithm for the hyperparameter optimization of a SARIMA model
What hyper-parameters are, and what to do with them; an illustration with ridge regression
{pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr}
Building formulae
Functional peace of mind
Get basic summary statistics for all the variables in a data frame
Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash
Importing 30GB of data into R with sparklyr
Introducing brotools
It's lists all the way down
It's lists all the way down, part 2: We need to go deeper
Keep trying that api call with purrr::possibly()
Lesser known dplyr 0.7* tricks
Lesser known dplyr tricks
Lesser known purrr tricks
Make ggplot2 purrr
Mapping a list of functions to a list of datasets with a list of columns as arguments
Predicting job search by training a random forest on an unbalanced dataset
Teaching the tidyverse to beginners
Why I find tidyeval useful
tidyr::spread() and dplyr::rename_at() in action
Easy peasy STATA-like marginal effects with R
Functional programming and unit testing for data munging with R available on Leanpub
How to use jailbreakr
My free book has a cover!
Work on lists of datasets instead of individual datasets by using functional programming
Method of Simulated Moments with R
New website!
Nonlinear Gmm with R - Example with a logistic regression
Simulated Maximum Likelihood with R
Bootstrapping standard errors for difference-in-differences estimation with R
Careful with tryCatch
Data frame columns as arguments to dplyr functions
Export R output to a file
I've started writing a 'book': Functional programming and unit testing for data munging with R
Introduction to programming econometrics with R
Merge a list of datasets together
Object Oriented Programming with R: An example with a Cournot duopoly
R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest?
Read a lot of datasets at once with R
Unit testing with R
Update to Introduction to programming econometrics with R
Using R as a Computer Algebra System with Ryacas

This blog post is an excerpt of my ebook *Modern R with the tidyverse* that you can read for
free here. This is taken from Chapter 7, which deals
with statistical models. In the text below, I explain what hyper-parameters are, and as an example
I run a ridge regression using the `{glmnet}`

package. The book is still being written, so
comments are more than welcome!

Hyper-parameters are parameters of the model that cannot be directly learned from the data.
A linear regression does not have any hyper-parameters, but a random forest for instance has several.
You might have heard of ridge regression, lasso and elasticnet. These are
extensions to linear models that avoid over-fitting by penalizing *large* models. These
extensions of the linear regression have hyper-parameters that the practitioner has to tune. There
are several ways one can tune these parameters, for example, by doing a grid-search, or a random
search over the grid or using more elaborate methods. To introduce hyper-parameters, let’s get
to know ridge regression, also called Tikhonov regularization.

Ridge regression is used when the data you are working with has a lot of explanatory variables,
or when there is a risk that a simple linear regression might overfit to the training data, because,
for example, your explanatory variables are collinear.
If you are training a linear model and then you notice that it generalizes very badly to new,
unseen data, it is very likely that the linear model you trained overfits the data.
In this case, ridge regression might prove useful. The way ridge regression works might seem
counter-intuititive; it boils down to fitting a *worse* model to the training data, but in return,
this worse model will generalize better to new data.

The closed form solution of the ordinary least squares estimator is defined as:

\[ \widehat{\beta} = (X'X)^{-1}X'Y \]

where \(X\) is the design matrix (the matrix made up of the explanatory variables) and \(Y\) is the dependent variable. For ridge regression, this closed form solution changes a little bit:

\[ \widehat{\beta} = (X'X + \lambda I_p)^{-1}X'Y \]

where \(\lambda \in \mathbb{R}\) is an hyper-parameter and \(I_p\) is the identity matrix of dimension \(p\) (\(p\) is the number of explanatory variables). This formula above is the closed form solution to the following optimisation program:

\[ \sum_{i=1}^n \left(y_i - \sum_{j=1}^px_{ij}\beta_j\right)^2 \]

such that:

\[ \sum_{j=1}^p(\beta_j)^2 < c \]

for any strictly positive \(c\).

The `glmnet()`

function from the `{glmnet}`

package can be used for ridge regression, by setting
the `alpha`

argument to 0 (setting it to 1 would do LASSO, and setting it to a number between
0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression,
let me first divide the data into a training set and a testing set. I will be using the `Housing`

data from the `{Ecdat}`

package:

```
library(tidyverse)
library(Ecdat)
library(glmnet)
```

```
index <- 1:nrow(Housing)
set.seed(12345)
train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE)
test_index <- setdiff(index, train_index)
train_x <- Housing[train_index, ] %>%
select(-price)
train_y <- Housing[train_index, ] %>%
pull(price)
test_x <- Housing[test_index, ] %>%
select(-price)
test_y <- Housing[test_index, ] %>%
pull(price)
```

I do the train/test split this way, because `glmnet()`

requires a design matrix as input, and not
a formula. Design matrices can be created using the `model.matrix()`

function:

```
train_matrix <- model.matrix(train_y ~ ., data = train_x)
test_matrix <- model.matrix(test_y ~ ., data = test_x)
```

To run an unpenalized linear regression, we can set the penalty to 0:

`model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0)`

The model above provides the same result as a linear regression. Let’s compare the coefficients between the two:

`coef(model_lm_ridge)`

```
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -3247.030393
## (Intercept) .
## lotsize 3.520283
## bedrooms 1745.211187
## bathrms 14337.551325
## stories 6736.679470
## drivewayyes 5687.132236
## recroomyes 5701.831289
## fullbaseyes 5708.978557
## gashwyes 12508.524241
## aircoyes 12592.435621
## garagepl 4438.918373
## prefareayes 9085.172469
```

and now the coefficients of the linear regression (because I provide a design matrix, I have to use
`lm.fit()`

instead of `lm()`

which requires a formula, not a matrix.)

`coef(lm.fit(x = train_matrix, y = train_y))`

```
## (Intercept) lotsize bedrooms bathrms stories
## -3245.146665 3.520357 1744.983863 14336.336858 6737.000410
## drivewayyes recroomyes fullbaseyes gashwyes aircoyes
## 5686.394123 5700.210775 5709.493884 12509.005265 12592.367268
## garagepl prefareayes
## 4439.029607 9085.409155
```

as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear regression:

```
preds_lm <- predict(model_lm_ridge, test_matrix)
rmse_lm <- sqrt(mean((preds_lm - test_y)^2))
```

The RMSE for the linear unpenalized regression is equal to 14463.08.

Let’s now run a ridge regression, with `lambda`

equal to 100, and see if the RMSE is smaller:

`model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100)`

and let’s compute the RMSE again:

```
preds <- predict(model_ridge, test_matrix)
rmse <- sqrt(mean((preds - test_y)^2))
```

The RMSE for the linear penalized regression is equal to 14460.71, which is smaller than before.
But which value of `lambda`

gives smallest RMSE? To find out, one must run model over a grid of
`lambda`

values and pick the model with lowest RMSE. This procedure is available in the `cv.glmnet()`

function, which picks the best value for `lambda`

:

```
best_model <- cv.glmnet(train_matrix, train_y)
# lambda that minimises the MSE
best_model$lambda.min
```

`## [1] 66.07936`

According to `cv.glmnet()`

the best value for `lambda`

is 66.0793576.
In the next section, we will implement cross validation ourselves, in order to find the hyper-parameters
of a random forest.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.