Econometrics and Free Software by Bruno Rodrigues.
RSS feed for blog post updates.
Follow me on Mastodon, twitter, or check out my Github.
Check out my package that adds logging to R functions, {chronicler}.
Or read my free ebooks, to learn some R and build reproducible analytical pipelines..
You can also watch my youtube channel or find the slides to the talks I've given here.
Buy me a coffee, my kids don't let me sleep.

tidyr::spread() and dplyr::rename_at() in action

R

I was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:

data1; data2
## # A tibble: 20 x 4
##    country date       variable_1       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 maybe               22
##  2 lu      01/07/2005 maybe               13
##  3 lu      01/01/2006 maybe               40
##  4 lu      01/07/2006 maybe               25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 kinda_disagree      69
## 14 lu      01/07/2005 kinda_disagree      12
## 15 lu      01/01/2006 kinda_disagree      10
## 16 lu      01/07/2006 kinda_disagree       9
## 17 lu      01/01/2005 kinda_agree         38
## 18 lu      01/07/2005 kinda_agree         31
## 19 lu      01/01/2006 kinda_agree         19
## 20 lu      01/07/2006 kinda_agree         12
## # A tibble: 20 x 4
##    country date       variable_2       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 kinda_agree         22
##  2 lu      01/07/2005 kinda_agree         13
##  3 lu      01/01/2006 kinda_agree         40
##  4 lu      01/07/2006 kinda_agree         25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 maybe               69
## 14 lu      01/07/2005 maybe               12
## 15 lu      01/01/2006 maybe               10
## 16 lu      01/07/2006 maybe                9
## 17 lu      01/01/2005 kinda_disagree      38
## 18 lu      01/07/2005 kinda_disagree      31
## 19 lu      01/01/2006 kinda_disagree      19
## 20 lu      01/07/2006 kinda_disagree      12

As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:

data1 %>%
  spread(variable_1, value)
## # A tibble: 4 x 7
##   country date       kinda_agree kinda_disagree maybe totally_agree
##   <chr>   <chr>            <int>          <int> <int>         <int>
## 1 lu      01/01/2005          38             69    22            42
## 2 lu      01/01/2006          19             10    40            25
## 3 lu      01/07/2005          31             12    13            17
## 4 lu      01/07/2006          12              9    25            16
## # ... with 1 more variable: totally_disagree <int>

But because data1 and data2 have the same levels for variable_1 and variable_2, this would not work. So the solution I found online, in this SO thread was to use tidyr::spread() with dplyr::rename_at() like this:

data1 <- data1 %>%
  spread(variable_1, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable1:", .)))

glimpse(data1)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
data2 <- data2 %>%
  spread(variable_2, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable2:", .)))

glimpse(data2)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

rename_at() needs variables which you pass to vars(), a helper function to select variables, and a function that will do the renaming, passed to funs(). The function I use is simply paste0(), which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’, a dummy argument. Now these datasets can be merged:

data1 %>%
  full_join(data2) %>%
  glimpse()
## Joining, by = c("country", "date")
## Observations: 4
## Variables: 12
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

Hope this post helps you understand the difference between long and wide datasets better, as well as dplyr::rename_at()!