A tutorial on tidy cross-validation with R
Analyzing NetHack data, part 1: What kills the players
Analyzing NetHack data, part 2: What players kill the most
Building a shiny app to explore historical newspapers: a step-by-step guide
Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1
Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2
Curly-Curly, the successor of Bang-Bang
Dealing with heteroskedasticity; regression with robust standard errors using R
Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport
Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer
Fast food, causality and R packages, part 1
Fast food, causality and R packages, part 2
For posterity: install {xml2} on GNU/Linux distros
Forecasting my weight with R
From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack
Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}
Getting data from pdfs using the pdftools package
Getting the data from the Luxembourguish elections out of Excel
Going from a human readable Excel file to a machine-readable csv with {tidyxl}
Historical newspaper scraping with {tesseract} and R
How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data
Imputing missing values in parallel using {furrr}
Intermittent demand, Croston and Die Hard
Looking into 19th century ads from a Luxembourguish newspaper with R
Making sense of the METS and ALTO XML standards
Manipulate dates easily with {lubridate}
Manipulating strings with the {stringr} package
Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data
Missing data imputation and instrumental variables regression: the tidy approach
Modern R with the tidyverse is available on Leanpub
Objects types and some useful R functions for beginners
Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`
R or Python? Why not both? Using Anaconda Python within R with {reticulate}
Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach
Some fun with {gganimate}
Split-apply-combine for Maximum Likelihood Estimation of a linear model
Statistical matching, or when one single data source is not enough
The best way to visit Luxembourguish castles is doing data science + combinatorial optimization
The never-ending editor war (?)
The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse
Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century
Using a genetic algorithm for the hyperparameter optimization of a SARIMA model
Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius
Using linear models with binary dependent variables, a simulation study
Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods
What hyper-parameters are, and what to do with them; an illustration with ridge regression
{disk.frame} is epic
{pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr}
Building formulae
Functional peace of mind
Get basic summary statistics for all the variables in a data frame
Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash
Importing 30GB of data into R with sparklyr
Introducing brotools
It's lists all the way down
It's lists all the way down, part 2: We need to go deeper
Keep trying that api call with purrr::possibly()
Lesser known dplyr 0.7* tricks
Lesser known dplyr tricks
Lesser known purrr tricks
Make ggplot2 purrr
Mapping a list of functions to a list of datasets with a list of columns as arguments
Predicting job search by training a random forest on an unbalanced dataset
Teaching the tidyverse to beginners
Why I find tidyeval useful
tidyr::spread() and dplyr::rename_at() in action
Easy peasy STATA-like marginal effects with R
Functional programming and unit testing for data munging with R available on Leanpub
How to use jailbreakr
My free book has a cover!
Work on lists of datasets instead of individual datasets by using functional programming
Method of Simulated Moments with R
New website!
Nonlinear Gmm with R - Example with a logistic regression
Simulated Maximum Likelihood with R
Bootstrapping standard errors for difference-in-differences estimation with R
Careful with tryCatch
Data frame columns as arguments to dplyr functions
Export R output to a file
I've started writing a 'book': Functional programming and unit testing for data munging with R
Introduction to programming econometrics with R
Merge a list of datasets together
Object Oriented Programming with R: An example with a Cournot duopoly
R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest?
Read a lot of datasets at once with R
Unit testing with R
Update to Introduction to programming econometrics with R
Using R as a Computer Algebra System with Ryacas

Analyzing a lot of datasets can be tedious. In my work, I often have to compute descriptive statistics, or plot some graphs for some variables for a lot of datasets. The variables in question have the same name accross the datasets but are measured for different years. As an example, imagine you have this situation:

```
data2000 <- mtcars
data2001 <- mtcars
```

For the sake of argument, imagine that `data2000`

is data from a survey conducted in the year 2000 and `data2001`

is the same survey but conducted in the year 2001. For illustration purposes, I use the `mtcars`

dataset, but I could have used any other example. In these sort of situations, the variables are named the same in both datasets. Now if I want to check the summary statistics of a variable, I might do it by running:

```
summary(data2000$cyl)
summary(data2001$cyl)
```

but this can get quite tedious, especially if instead of only having two years of data, you have 20 years. Another possibility is to merge both datasets and then check the summary statistics of the variable of interest. But this might require a lot of preprocessing, and sometimes you really just want to do a quick check, or some dirty graphs. So you might be tempted to write a loop, which would require to put these two datasets in some kind of structure, such as a list:

`list_data <- list("data2000" = data2000, "data2001" = data2001)`

`for (i in 1:2){ print(summary(list_data[[i]]$cyl)) }`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
```

But this also might get tedious, especially if you want to do this for a lot of different variables, and want to use different functions than `summary()`

.

Another, simpler way of doing this, is to use `purrr::map()`

or `lapply()`

. But there is a catch though: how do we specify the column we want to work on? Let’s try some things out:

`library(purrr)`

`map(list_data, summary(cyl))`

`Error in summary(cyl) : object ‘cyl’ not found`

Maybe this will work:

`map(list_data, summary, cyl)`

`## $data2000 mpg cyl disp hp`

Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0

1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5

Median :19.20 Median :6.000 Median :196.3 Median :123.0

Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7

3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0

Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0

drat wt qsec vs

Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000

1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000

Median :3.695 Median :3.325 Median :17.71 Median :0.0000

Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375

3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000

Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000

am gear carb

Min. :0.0000 Min. :3.000 Min. :1.000

1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000

Median :0.0000 Median :4.000 Median :2.000

Mean :0.4062 Mean :3.688 Mean :2.812

3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000

Max. :1.0000 Max. :5.000 Max. :8.000

`data2001 mpg cyl disp hp`

Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0

1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5

Median :19.20 Median :6.000 Median :196.3 Median :123.0

Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7

3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0

Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0

drat wt qsec vs

Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000

1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000

Median :3.695 Median :3.325 Median :17.71 Median :0.0000

Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375

3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000

Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000

am gear carb

Min. :0.0000 Min. :3.000 Min. :1.000

1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000

Median :0.0000 Median :4.000 Median :2.000

Mean :0.4062 Mean :3.688 Mean :2.812

3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000

Max. :1.0000 Max. :5.000 Max. :8.000

Not quite! You get the summary statistics of every variable, `cyl`

simply gets ignored. This might be ok in our small toy example, but if you have dozens of datasets with hundreds of variables, the output becomes unreadable. The solution is to use an anonymous functions:

`map(list_data, (function(x) summary(x$cyl)))`

`## $data2000 Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.000 6.000 6.188 8.000 8.000`

`$data2001 Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.000 6.000 6.188 8.000 8.000`

This is, in my opinion, much more readable than a loop, and the output of this is another list, so it’s easy to save it:

```
summary_cyl <- map(list_data, (function(x) summary(x$cyl)))
str(summary_cyl)
```

```
## List of 2
$ data2000:Classes ‘summaryDefault’, ‘table’ Named num [1:6] 4 4 6 6.19 8 …
.. ..- attr(
```*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" …
$ data2001:Classes ‘summaryDefault’, ‘table’ Named num [1:6] 4 4 6 6.19 8 …
.. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" …

With the loop, you would need to “allocate” an empty list that you would fill at each iteration.

So this is already nice, but wouldn’t it be nicer to simply have to type:

`summary(list_data$cyl)`

and have the summary of variable `cyl`

for each dataset in the list? Well it is possible with the following function I wrote to make my life easier:

```
to_map <- function(func){
function(list, column, …){
if(missing(column)){
res <- purrr::map(list, (function(x) func(x, …)))
} else {
res <- purrr::map(list, (function(x) func(x[column], …)))
}
res
}
}
```

By following this chapter of Hadley Wickham’s book, *Advanced R*, I was able to write this function. What does it do? It basically *generalizes* a function to work on a list of datasets instead of just on a dataset. So for example, in the case of `summary()`

:

`summarymap <- to_map(summary)`

`summarymap(list_data, "cyl")`

`$data2000 cyl`

Min. :4.000

1st Qu.:4.000

Median :6.000

Mean :6.188

3rd Qu.:8.000

Max. :8.000

`$data2001 cyl`

Min. :4.000

1st Qu.:4.000

Median :6.000

Mean :6.188

3rd Qu.:8.000

Max. :8.000

So now everytime I want to have summary statistics for a variable, I just need to use `summarymap()`

:

`summarymap(list_data, "mpg")`

`## $data2000 mpg`

Min. :10.40

1st Qu.:15.43

Median :19.20

Mean :20.09

3rd Qu.:22.80

Max. :33.90

`$data2001 mpg`

Min. :10.40

1st Qu.:15.43

Median :19.20

Mean :20.09

3rd Qu.:22.80

Max. :33.90

If I want the summary statistics for every variable, I simply omit the column name:

`summarymap(list_data)`

`$data2000 mpg cyl disp hp`

Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0

1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5

Median :19.20 Median :6.000 Median :196.3 Median :123.0

Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7

3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0

Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0

drat wt qsec vs

Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000

1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000

Median :3.695 Median :3.325 Median :17.71 Median :0.0000

Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375

3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000

Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000

am gear carb

Min. :0.0000 Min. :3.000 Min. :1.000

1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000

Median :0.0000 Median :4.000 Median :2.000

Mean :0.4062 Mean :3.688 Mean :2.812

3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000

Max. :1.0000 Max. :5.000 Max. :8.000

`$data2001 mpg cyl disp hp`

Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0

1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5

Median :19.20 Median :6.000 Median :196.3 Median :123.0

Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7

3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0

Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0

drat wt qsec vs

Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000

1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000

Median :3.695 Median :3.325 Median :17.71 Median :0.0000

Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375

3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000

Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000

am gear carb

Min. :0.0000 Min. :3.000 Min. :1.000

1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000

Median :0.0000 Median :4.000 Median :2.000

Mean :0.4062 Mean :3.688 Mean :2.812

3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000

Max. :1.0000 Max. :5.000 Max. :8.000

I can use any function:

`tablemap <- to_map(table)`

`tablemap(list_data, "cyl")`

`## $data2000`

4 6 8 11 7 14

$data2001

`4 6 8 11 7 14`

`tablemap(list_data, "mpg")`

`## $data2000`

10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 21 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9 2 2 1 2 1 1 1 2 1 1

$data2001

`10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 21 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9 2 2 1 2 1 1 1 2 1 1`

I hope you will find this little function useful, and as usual, for any comments just drop me an email by clicking the red enveloppe in the top right corner or tweet me.