tidyr::spread() and dplyr::rename_at() in action

2017/07/27 R

I was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:

data1; data2

## # A tibble: 20 x 4
##    country date       variable_1       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 maybe               22
##  2 lu      01/07/2005 maybe               13
##  3 lu      01/01/2006 maybe               40
##  4 lu      01/07/2006 maybe               25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 kinda_disagree      69
## 14 lu      01/07/2005 kinda_disagree      12
## 15 lu      01/01/2006 kinda_disagree      10
## 16 lu      01/07/2006 kinda_disagree       9
## 17 lu      01/01/2005 kinda_agree         38
## 18 lu      01/07/2005 kinda_agree         31
## 19 lu      01/01/2006 kinda_agree         19
## 20 lu      01/07/2006 kinda_agree         12

## # A tibble: 20 x 4
##    country date       variable_2       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 kinda_agree         22
##  2 lu      01/07/2005 kinda_agree         13
##  3 lu      01/01/2006 kinda_agree         40
##  4 lu      01/07/2006 kinda_agree         25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 maybe               69
## 14 lu      01/07/2005 maybe               12
## 15 lu      01/01/2006 maybe               10
## 16 lu      01/07/2006 maybe                9
## 17 lu      01/01/2005 kinda_disagree      38
## 18 lu      01/07/2005 kinda_disagree      31
## 19 lu      01/01/2006 kinda_disagree      19
## 20 lu      01/07/2006 kinda_disagree      12

As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:

data1 %>%
  spread(variable_1, value)

## # A tibble: 4 x 7
##   country date       kinda_agree kinda_disagree maybe totally_agree
##   <chr>   <chr>            <int>          <int> <int>         <int>
## 1 lu      01/01/2005          38             69    22            42
## 2 lu      01/01/2006          19             10    40            25
## 3 lu      01/07/2005          31             12    13            17
## 4 lu      01/07/2006          12              9    25            16
## # ... with 1 more variable: totally_disagree <int>

But because data1 and data2 have the same levels for variable_1 and variable_2, this would not work. So the solution I found online, in this SO thread was to use tidyr::spread() with dplyr::rename_at() like this:

data1 <- data1 %>%
  spread(variable_1, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable1:", .)))

glimpse(data1)

## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21

data2 <- data2 %>%
  spread(variable_2, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable2:", .)))

glimpse(data2)

## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

rename_at() needs variables which you pass to vars(), a helper function to select variables, and a function that will do the renaming, passed to funs(). The function I use is simply paste0(), which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’, a dummy argument. Now these datasets can be merged:

data1 %>%
  full_join(data2) %>%
  glimpse()

## Joining, by = c("country", "date")

## Observations: 4
## Variables: 12
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

Hope this post helps you understand the difference between long and wide datasets better, as well as dplyr::rename_at()!