About Me Blog
Building formulae Easy peasy STATA-like marginal effects with R Functional peace of mind It's lists all the way down It's lists all the way down, part 2: We need to go deeper Mapping a list of functions to a list of datasets with a list of columns as arguments Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Introducing brotools Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

tidyr::spread() and dplyr::rename_at() in action

I was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:

data1; data2
## # A tibble: 20 x 4
##    country       date       variable_1 value
##      <chr>      <chr>            <chr> <int>
##  1      lu 01/01/2005            maybe    22
##  2      lu 01/07/2005            maybe    13
##  3      lu 01/01/2006            maybe    40
##  4      lu 01/07/2006            maybe    25
##  5      lu 01/01/2005    totally_agree    42
##  6      lu 01/07/2005    totally_agree    17
##  7      lu 01/01/2006    totally_agree    25
##  8      lu 01/07/2006    totally_agree    16
##  9      lu 01/01/2005 totally_disagree    39
## 10      lu 01/07/2005 totally_disagree    17
## 11      lu 01/01/2006 totally_disagree    23
## 12      lu 01/07/2006 totally_disagree    21
## 13      lu 01/01/2005   kinda_disagree    69
## 14      lu 01/07/2005   kinda_disagree    12
## 15      lu 01/01/2006   kinda_disagree    10
## 16      lu 01/07/2006   kinda_disagree     9
## 17      lu 01/01/2005      kinda_agree    38
## 18      lu 01/07/2005      kinda_agree    31
## 19      lu 01/01/2006      kinda_agree    19
## 20      lu 01/07/2006      kinda_agree    12
## # A tibble: 20 x 4
##    country       date       variable_2 value
##      <chr>      <chr>            <chr> <int>
##  1      lu 01/01/2005      kinda_agree    22
##  2      lu 01/07/2005      kinda_agree    13
##  3      lu 01/01/2006      kinda_agree    40
##  4      lu 01/07/2006      kinda_agree    25
##  5      lu 01/01/2005    totally_agree    42
##  6      lu 01/07/2005    totally_agree    17
##  7      lu 01/01/2006    totally_agree    25
##  8      lu 01/07/2006    totally_agree    16
##  9      lu 01/01/2005 totally_disagree    39
## 10      lu 01/07/2005 totally_disagree    17
## 11      lu 01/01/2006 totally_disagree    23
## 12      lu 01/07/2006 totally_disagree    21
## 13      lu 01/01/2005            maybe    69
## 14      lu 01/07/2005            maybe    12
## 15      lu 01/01/2006            maybe    10
## 16      lu 01/07/2006            maybe     9
## 17      lu 01/01/2005   kinda_disagree    38
## 18      lu 01/07/2005   kinda_disagree    31
## 19      lu 01/01/2006   kinda_disagree    19
## 20      lu 01/07/2006   kinda_disagree    12

As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:

data1 %>%
  spread(variable_1, value)
## # A tibble: 4 x 7
##   country       date kinda_agree kinda_disagree maybe totally_agree
## *   <chr>      <chr>       <int>          <int> <int>         <int>
## 1      lu 01/01/2005          38             69    22            42
## 2      lu 01/01/2006          19             10    40            25
## 3      lu 01/07/2005          31             12    13            17
## 4      lu 01/07/2006          12              9    25            16
## # ... with 1 more variables: totally_disagree <int>

But because data1 and data2 have the same levels for variable_1 and variable_2, this would not work. So the solution I found online, in this SO thread was to use tidyr::spread() with dplyr::rename_at() like this:

data1 <- data1 %>%
  spread(variable_1, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable1:", .)))

glimpse(data1)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
data2 <- data2 %>%
  spread(variable_2, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable2:", .)))

glimpse(data2)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

rename_at() needs variables which you pass to vars(), a helper function to select variables, and a function that will do the renaming, passed to funs(). The function I use is simply paste0(), which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’, a dummy argument. Now these datasets can be merged:

data1 %>%
  full_join(data2) %>%
  glimpse()
## Joining, by = c("country", "date")
## Observations: 4
## Variables: 12
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

Hope this post helps you understand the difference between long and wide datasets better, as well as dplyr::rename_at()!