Mapping a list of functions to a list of datasets with a list of columns as arguments

2018/01/19 R

This week I had the opportunity to teach R at my workplace, again. This course was the “advanced R” course, and unlike the one I taught at the end of last year, I had one more day (so 3 days in total) where I could show my colleagues the joys of the tidyverse and R.

To finish the section on programming with R, which was the very last section of the whole 3 day course I wanted to blow their minds; I had already shown them packages from the tidyverse in the previous days, such as dplyr, purrr and stringr, among others. I taught them how to use ggplot2, broom and modelr. They also liked janitor and rio very much. I noticed that it took them a bit more time and effort for them to digest purrr::map() and purrr::reduce(), but they all seemed to see how powerful these functions were. To finish on a very high note, I showed them the ultimate purrr::map() use case.

Consider the following; imagine you have a situation where you are working on a list of datasets. These datasets might be the same, but for different years, or for different countries, or they might be completely different datasets entirely. If you used rio::import_list() to read them into R, you will have them in a nice list. Let’s consider the following list as an example:

library(tidyverse)

data(mtcars)
data(iris)

data_list = list(mtcars, iris)

I made the choice to have completely different datasets. Now, I would like to map some functions to the columns of these datasets. If I only worked on one, for example on mtcars, I would do something like:

my_summarise_f = function(dataset, cols, funcs){
  dataset %>%
    summarise_at(vars(!!!cols), funs(!!!funcs))
}

And then I would use my function like so:

mtcars %>%
  my_summarise_f(quos(mpg, drat, hp), quos(mean, sd, max))

##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335

my_summarise_f() takes a dataset, a list of columns and a list of functions as arguments and uses tidy evaluation to apply mean(), sd(), and max() to the columns mpg, drat and hp of mtcars. That’s pretty useful, but not useful enough! Now I want to apply this to the list of datasets I defined above. For this, let’s define the list of columns I want to work on:

cols_mtcars = quos(mpg, drat, hp)
cols_iris = quos(Sepal.Length, Sepal.Width)

cols_list = list(cols_mtcars, cols_iris)

Now, let’s use some purrr magic to apply the functions I want to the columns I have defined in list_cols:

map2(data_list,
     cols_list,
     my_summarise_f, funcs = quos(mean, sd, max))

## [[1]]
##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335
## 
## [[2]]
##   Sepal.Length_mean Sepal.Width_mean Sepal.Length_sd Sepal.Width_sd
## 1          5.843333         3.057333       0.8280661      0.4358663
##   Sepal.Length_max Sepal.Width_max
## 1              7.9             4.4

That’s pretty useful, but not useful enough! I want to also use different functions to different datasets!

Well, let’s define a list of functions then:

funcs_mtcars = quos(mean, sd, max)
funcs_iris = quos(median, min)

funcs_list = list(funcs_mtcars, funcs_iris)

Because there is no map3(), we need to use pmap():

pmap(
  list(
    dataset = data_list,
    cols = cols_list,
    funcs = funcs_list
  ),
  my_summarise_f)

## [[1]]
##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335
## 
## [[2]]
##   Sepal.Length_median Sepal.Width_median Sepal.Length_min Sepal.Width_min
## 1                 5.8                  3              4.3               2

Now I’m satisfied! Let me tell you, this blew their minds 😄!

To be able to use things like that, I told them to always solve a problem for a single example, and from there, try to generalize their solution using functional programming tools found in purrr.

If you found this blog post useful, you might want to follow me on twitter for blog post updates.