About Me Blog
Getting data from pdfs using the pdftools package Imputing missing values in parallel using {furrr} {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

tidyr::spread() and dplyr::rename_at() in action

I was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:

data1; data2
## # A tibble: 20 x 4
##    country date       variable_1       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 maybe               22
##  2 lu      01/07/2005 maybe               13
##  3 lu      01/01/2006 maybe               40
##  4 lu      01/07/2006 maybe               25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 kinda_disagree      69
## 14 lu      01/07/2005 kinda_disagree      12
## 15 lu      01/01/2006 kinda_disagree      10
## 16 lu      01/07/2006 kinda_disagree       9
## 17 lu      01/01/2005 kinda_agree         38
## 18 lu      01/07/2005 kinda_agree         31
## 19 lu      01/01/2006 kinda_agree         19
## 20 lu      01/07/2006 kinda_agree         12
## # A tibble: 20 x 4
##    country date       variable_2       value
##    <chr>   <chr>      <chr>            <int>
##  1 lu      01/01/2005 kinda_agree         22
##  2 lu      01/07/2005 kinda_agree         13
##  3 lu      01/01/2006 kinda_agree         40
##  4 lu      01/07/2006 kinda_agree         25
##  5 lu      01/01/2005 totally_agree       42
##  6 lu      01/07/2005 totally_agree       17
##  7 lu      01/01/2006 totally_agree       25
##  8 lu      01/07/2006 totally_agree       16
##  9 lu      01/01/2005 totally_disagree    39
## 10 lu      01/07/2005 totally_disagree    17
## 11 lu      01/01/2006 totally_disagree    23
## 12 lu      01/07/2006 totally_disagree    21
## 13 lu      01/01/2005 maybe               69
## 14 lu      01/07/2005 maybe               12
## 15 lu      01/01/2006 maybe               10
## 16 lu      01/07/2006 maybe                9
## 17 lu      01/01/2005 kinda_disagree      38
## 18 lu      01/07/2005 kinda_disagree      31
## 19 lu      01/01/2006 kinda_disagree      19
## 20 lu      01/07/2006 kinda_disagree      12

As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:

data1 %>%
  spread(variable_1, value)
## # A tibble: 4 x 7
##   country date       kinda_agree kinda_disagree maybe totally_agree
##   <chr>   <chr>            <int>          <int> <int>         <int>
## 1 lu      01/01/2005          38             69    22            42
## 2 lu      01/01/2006          19             10    40            25
## 3 lu      01/07/2005          31             12    13            17
## 4 lu      01/07/2006          12              9    25            16
## # ... with 1 more variable: totally_disagree <int>

But because data1 and data2 have the same levels for variable_1 and variable_2, this would not work. So the solution I found online, in this SO thread was to use tidyr::spread() with dplyr::rename_at() like this:

data1 <- data1 %>%
  spread(variable_1, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable1:", .)))

glimpse(data1)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
data2 <- data2 %>%
  spread(variable_2, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable2:", .)))

glimpse(data2)
## Observations: 4
## Variables: 7
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

rename_at() needs variables which you pass to vars(), a helper function to select variables, and a function that will do the renaming, passed to funs(). The function I use is simply paste0(), which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’, a dummy argument. Now these datasets can be merged:

data1 %>%
  full_join(data2) %>%
  glimpse()
## Joining, by = c("country", "date")
## Observations: 4
## Variables: 12
## $ country                      <chr> "lu", "lu", "lu", "lu"
## $ date                         <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree`      <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree`   <int> 69, 10, 12, 9
## $ `variable1:maybe`            <int> 22, 40, 13, 25
## $ `variable1:totally_agree`    <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
## $ `variable2:kinda_agree`      <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree`   <int> 38, 19, 31, 12
## $ `variable2:maybe`            <int> 69, 10, 12, 9
## $ `variable2:totally_agree`    <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21

Hope this post helps you understand the difference between long and wide datasets better, as well as dplyr::rename_at()!