Lesser known dplyr tricks

2017/03/08 R

In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

mtcars %>% 
    select(-disp) %>% 
    head()

##                    mpg cyl  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 105 2.76 3.460 20.22  1  0    3    1

Re-ordering columns

Still using select(), it is easy te re-order columns in your data frame:

mtcars %>% 
    select(cyl, disp, hp, everything()) %>% 
    head()

##                   cyl disp  hp  mpg drat    wt  qsec vs am gear carb
## Mazda RX4           6  160 110 21.0 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       6  160 110 21.0 3.90 2.875 17.02  0  1    4    4
## Datsun 710          4  108  93 22.8 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      6  258 110 21.4 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   8  360 175 18.7 3.15 3.440 17.02  0  0    3    2
## Valiant             6  225 105 18.1 2.76 3.460 20.22  1  0    3    1

As its name implies everything() simply means all the other columns.

Renaming columns with `rename()`

mtcars <- rename(mtcars, spam_mpg = mpg)
mtcars <- rename(mtcars, spam_disp = disp)
mtcars <- rename(mtcars, spam_hp = hp)

head(mtcars)

##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1
## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1
## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0
## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0
## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1

Selecting columns with a regexp

It is easy to select the columns that start with “spam” with some helper functions:

mtcars %>% 
    select(contains("spam")) %>% 
    head()

##                   spam_mpg spam_disp spam_hp
## Mazda RX4             21.0       160     110
## Mazda RX4 Wag         21.0       160     110
## Datsun 710            22.8       108      93
## Hornet 4 Drive        21.4       258     110
## Hornet Sportabout     18.7       360     175
## Valiant               18.1       225     105

take also a look at starts_with(), ends_with(), contains(), matches(), num_range(), one_of() and everything().

Create new columns with `mutate()` and `if_else()`

mtcars %>% 
    mutate(vs_new = if_else(
        vs == 1, 
        "one", 
        "zero", 
        NA_character_)) %>% 
    head()

##   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb vs_new
## 1     21.0   6       160     110 3.90 2.620 16.46  0  1    4    4   zero
## 2     21.0   6       160     110 3.90 2.875 17.02  0  1    4    4   zero
## 3     22.8   4       108      93 3.85 2.320 18.61  1  1    4    1    one
## 4     21.4   6       258     110 3.08 3.215 19.44  1  0    3    1    one
## 5     18.7   8       360     175 3.15 3.440 17.02  0  0    3    2   zero
## 6     18.1   6       225     105 2.76 3.460 20.22  1  0    3    1    one

You might want to create a new variable conditionally on several values of another column:

mtcars %>% 
    mutate(carb_new = case_when(.$carb == 1 ~ "one",
                                .$carb == 2 ~ "two",
                                .$carb == 4 ~ "four",
                                 TRUE ~ "other")) %>% 
    head(15)

##    spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb
## 1      21.0   6     160.0     110 3.90 2.620 16.46  0  1    4    4
## 2      21.0   6     160.0     110 3.90 2.875 17.02  0  1    4    4
## 3      22.8   4     108.0      93 3.85 2.320 18.61  1  1    4    1
## 4      21.4   6     258.0     110 3.08 3.215 19.44  1  0    3    1
## 5      18.7   8     360.0     175 3.15 3.440 17.02  0  0    3    2
## 6      18.1   6     225.0     105 2.76 3.460 20.22  1  0    3    1
## 7      14.3   8     360.0     245 3.21 3.570 15.84  0  0    3    4
## 8      24.4   4     146.7      62 3.69 3.190 20.00  1  0    4    2
## 9      22.8   4     140.8      95 3.92 3.150 22.90  1  0    4    2
## 10     19.2   6     167.6     123 3.92 3.440 18.30  1  0    4    4
## 11     17.8   6     167.6     123 3.92 3.440 18.90  1  0    4    4
## 12     16.4   8     275.8     180 3.07 4.070 17.40  0  0    3    3
## 13     17.3   8     275.8     180 3.07 3.730 17.60  0  0    3    3
## 14     15.2   8     275.8     180 3.07 3.780 18.00  0  0    3    3
## 15     10.4   8     472.0     205 2.93 5.250 17.98  0  0    3    4
##    carb_new
## 1      four
## 2      four
## 3       one
## 4       one
## 5       two
## 6       one
## 7      four
## 8       two
## 9       two
## 10     four
## 11     four
## 12    other
## 13    other
## 14    other
## 15     four

Mind the .$ before the variable carb. There is a github issue about this, and it is already fixed in the development version of dplyr, which means that in the next version of dplyr, case_when() will work as any other specialized dplyr function inside mutate().

Apply a function to certain columns only, by rows

mtcars %>%
    select(am, gear, carb) %>%
    purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)

For this, I had to use purrr’s by_row() function. You can then add this column to your original data frame:

mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb)
head(mtcars)

##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1
## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1
## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0
## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0
## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0
##                   gear carb sum_am_gear_carb
## Mazda RX4            4    4                9
## Mazda RX4 Wag        4    4                9
## Datsun 710           4    1                6
## Hornet 4 Drive       3    1                4
## Hornet Sportabout    3    2                5
## Valiant              3    1                4

Use `do()` to do any arbitrary operation

mtcars %>% 
    group_by(cyl) %>% 
    do(models = lm(spam_mpg ~ drat + wt, data = .)) %>% 
    broom::tidy(models)

## # A tibble: 9 x 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1     4 (Intercept)   33.2      17.1       1.94  0.0877 
## 2     4 drat           1.32      3.45      0.384 0.711  
## 3     4 wt            -5.24      2.22     -2.37  0.0456 
## 4     6 (Intercept)   30.7       7.51      4.08  0.0151 
## 5     6 drat          -0.444     1.17     -0.378 0.725  
## 6     6 wt            -2.99      1.57     -1.91  0.129  
## 7     8 (Intercept)   29.7       7.09      4.18  0.00153
## 8     8 drat          -1.47      1.63     -0.903 0.386  
## 9     8 wt            -2.45      0.799    -3.07  0.0107

do() is useful when you want to use any R function (user defined functions work too!) with dplyr functions. First I grouped the observations by cyl and then ran a linear model for each group. Then I converted the output to a tidy data frame using broom::tidy().

Using `dplyr` functions inside your own functions

extract_vars <- function(data, some_string){
    
  data %>%
    select_(lazyeval::interp(~contains(some_string))) -> data
    
  return(data)
}

extract_vars(mtcars, "spam")

##                     spam_mpg spam_disp spam_hp
## Mazda RX4               21.0     160.0     110
## Mazda RX4 Wag           21.0     160.0     110
## Datsun 710              22.8     108.0      93
## Hornet 4 Drive          21.4     258.0     110
## Hornet Sportabout       18.7     360.0     175
## Valiant                 18.1     225.0     105
## Duster 360              14.3     360.0     245
## Merc 240D               24.4     146.7      62
## Merc 230                22.8     140.8      95
## Merc 280                19.2     167.6     123
## Merc 280C               17.8     167.6     123
## Merc 450SE              16.4     275.8     180
## Merc 450SL              17.3     275.8     180
## Merc 450SLC             15.2     275.8     180
## Cadillac Fleetwood      10.4     472.0     205
## Lincoln Continental     10.4     460.0     215
## Chrysler Imperial       14.7     440.0     230
## Fiat 128                32.4      78.7      66
## Honda Civic             30.4      75.7      52
## Toyota Corolla          33.9      71.1      65
## Toyota Corona           21.5     120.1      97
## Dodge Challenger        15.5     318.0     150
## AMC Javelin             15.2     304.0     150
## Camaro Z28              13.3     350.0     245
## Pontiac Firebird        19.2     400.0     175
## Fiat X1-9               27.3      79.0      66
## Porsche 914-2           26.0     120.3      91
## Lotus Europa            30.4      95.1     113
## Ford Pantera L          15.8     351.0     264
## Ferrari Dino            19.7     145.0     175
## Maserati Bora           15.0     301.0     335
## Volvo 142E              21.4     121.0     109

About this last point, you can read more about it here.

Hope you liked this small list of tricks!