Lesser known dplyr 0.7* tricks

2017/07/02 R

This blog post is an update to an older one I wrote in March. In the post from March, dplyr was at version 0.50, but since then a major update introduced some changes that make some of the tips in that post obsolete. So here I revisit the blog post from March by using dplyr 0.70.

Create new columns with `mutate()` and `case_when()`

The basic things such as selecting columns, renaming them, filtering, etc did not change with this new version. What did change however is creating new columns using case_when(). First, load dplyr and the mtcars dataset:

library("dplyr")
data(mtcars)

This was how it was done in version 0.50 (notice the ‘.$’ symbol before the variable ‘carb’):

mtcars %>%
    mutate(carb_new = case_when(.$carb == 1 ~ "one",
                                .$carb == 2 ~ "two",
                                .$carb == 4 ~ "four",
                                 TRUE ~ "other")) %>%
    head(5)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb carb_new
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     four
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     four
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      one
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      one
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      two

This has been simplified to:

mtcars %>%
    mutate(carb_new = case_when(carb == 1 ~ "one",
                                carb == 2 ~ "two",
                                carb == 4 ~ "four",
                                TRUE ~ "other")) %>%
    head(5)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb carb_new
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     four
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     four
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      one
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      one
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      two

No need for .$ anymore.

Apply a function to certain columns only, by rows, with `purrrlyr`

dplyr wasn’t the only package to get an overhaul, purrr also got the same treatment.

In the past, I applied a function to certains columns like this:

mtcars %>%
    select(am, gear, carb) %>%
    purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)

Now, by_row() does not exist in purrr anymore, but instead a new package called purrrlyr was introduced with functions that don’t really fit inside purrr nor dplyr:

mtcars %>%
    select(am, gear, carb) %>%
    purrrlyr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)

## # A tibble: 6 x 4
##      am  gear  carb sum_am_gear_carb
##   <dbl> <dbl> <dbl>            <dbl>
## 1     1     4     4                9
## 2     1     4     4                9
## 3     1     4     1                6
## 4     0     3     1                4
## 5     0     3     2                5
## 6     0     3     1                4

Think of purrrlyr as purrrs and dplyrs love child.

Using `dplyr` functions inside your own functions, or what is `tidyeval`

Programming with dplyr has been simplified a lot. Before version 0.70, one needed to use dplyr in conjuction with lazyeval to use dplyr functions inside one’s own fuctions. It was not always very easy, especially if you mixed columns and values inside your functions. Here’s the example from the March blog post:

extract_vars <- function(data, some_string){

  data %>%
    select_(lazyeval::interp(~contains(some_string))) -> data

  return(data)
}

extract_vars(mtcars, "spam")

More examples are available in this other blog post.

I will revisit them now with dplyr’s new tidyeval syntax. I’d recommend you read the Tidy evaluation vignette here. This vignette is part of the rlang package, which gets used under the hood by dplyr for all your programming needs. Here is the function I called simpleFunction(), written with the old dplyr syntax:

simpleFunction <- function(dataset, col_name){
  dataset %>%
    group_by_(col_name) %>%
    summarise(mean_mpg = mean(mpg)) -> dataset
  return(dataset)
}


simpleFunction(mtcars, "cyl")

## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4     26.7
## 2     6     19.7
## 3     8     15.1

With the new synax, it must be rewritten a little bit:

simpleFunction <- function(dataset, col_name){
  col_name <- enquo(col_name)
  dataset %>%
    group_by(!!col_name) %>%
    summarise(mean_mpg = mean(mpg)) -> dataset
  return(dataset)
}


simpleFunction(mtcars, cyl)

## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4     26.7
## 2     6     19.7
## 3     8     15.1

What has changed? Forget the underscore versions of the usual functions such as select_(), group_by_(), etc. Now, you must quote the column name using enquo() (or just quo() if working interactively, outside a function), which returns a quosure. This quosure can then be evaluated using !! in front of the quosure and inside the usual dplyr functions.

Let’s look at another example:

simpleFunction <- function(dataset, col_name, value){
  filter_criteria <- lazyeval::interp(~y == x, .values=list(y = as.name(col_name), x = value))
  dataset %>%
    filter_(filter_criteria) %>%
    summarise(mean_cyl = mean(cyl)) -> dataset
  return(dataset)
}


simpleFunction(mtcars, "am", 1)

##   mean_cyl
## 1 5.076923

As you can see, it’s a bit more complicated, as you needed to use lazyeval::interp() to make it work. With the improved dplyr, here’s how it’s done:

simpleFunction <- function(dataset, col_name, value){
  col_name <- enquo(col_name)
  dataset %>%
    filter((!!col_name) == value) %>%
    summarise(mean_cyl = mean(cyl)) -> dataset
  return(dataset)
}


simpleFunction(mtcars, am, 1)

##   mean_cyl
## 1 5.076923

Much, much easier! There is something that you must pay attention to though. Notice that I’ve written:

filter((!!col_name) == value)

and not:

filter(!!col_name == value)

I have enclosed !!col_name inside parentheses. I struggled with this, but thanks to help from @dmi3k and @_lionelhenry I was able to understand what was happening (isn’t the #rstats community on twitter great?).

One last thing: let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable? Easy:

simpleFunction <- function(dataset, group_col, mean_col, value){
  group_col <- enquo(group_col)
  mean_col <- enquo(mean_col)
  dataset %>%
    filter((!!group_col) == value) %>%
    summarise(mean((!!mean_col))) -> dataset
  return(dataset)
}


simpleFunction(mtcars, am, cyl, 1)

##   mean(cyl)
## 1  5.076923

«That’s very nice Bruno, but mean((cyl)) in the output looks ugly as sin» you might think, and you’d be right. It is possible to set the name of the column in the output using := instead of =:

simpleFunction <- function(dataset, group_col, mean_col, value){
  group_col <- enquo(group_col)
  mean_col <- enquo(mean_col)
  mean_name <- paste0("mean_", mean_col)[2]
  dataset %>%
    filter((!!group_col) == value) %>%
    summarise(!!mean_name := mean((!!mean_col))) -> dataset
  return(dataset)
}


simpleFunction(mtcars, am, cyl, 1)

##   mean_cyl
## 1 5.076923

To get the name of the column I added this line:

mean_name <- paste0("mean_", mean_col)[2]

To see what it does, try the following inside an R interpreter (remember to us quo() instead of enquo() outside functions!):

paste0("mean_", quo(cyl))

## [1] "mean_~"   "mean_cyl"

enquo() quotes the input, and with paste0() it gets converted to a string that can be used as a column name. However, the ~ is in the way and the output of paste0() is a vector of two strings: the correct name is contained in the second element, hence the [2]. There might be a more elegant way of doing that, but for now this has been working well for me.

That was it folks! I do recommend you read the Programming with dplyr vignette here as well as other blog posts, such as the one recommended to me by @dmi3k here.

Have fun with dplyr 0.70!

Lesser known dplyr 0.7* tricks

Create new columns with mutate() and case_when()

Apply a function to certain columns only, by rows, with purrrlyr

Using dplyr functions inside your own functions, or what is tidyeval

Create new columns with `mutate()` and `case_when()`

Apply a function to certain columns only, by rows, with `purrrlyr`

Using `dplyr` functions inside your own functions, or what is `tidyeval`