Econometrics and Free Software by Bruno Rodrigues.
RSS feed for blog post updates.
Follow me on Mastodon, twitter, or check out my Github.
Check out my package that adds logging to R functions, {chronicler}.
Or read my free ebooks, to learn some R and build reproducible analytical pipelines..
You can also watch my youtube channel or find the slides to the talks I've given here.
Buy me a coffee, my kids don't let me sleep.

Imputing missing values in parallel using {furrr}

R

Today I saw this tweet on my timeline:

and as a heavy {purrr} user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out {furrr}. For those unfamiliar with {purrr}, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: {purrr} contains so-called higher order functions, which are functions that take other functions as argument. One such function is map(). Consider the following simple example:

numbers <- seq(1, 10)

If you want the square root of this numbers, you can of course simply use the sqrt() function, because it is vectorized:

sqrt(numbers)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if sqrt() was not vectorized:

sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
  sqrt_numbers[i] <- sqrt(numbers[i])
}

First, you need to initialize a container, and then you have to populate the sqrt_numbers list with the results. Using, {purrr} is way easier:

library(tidyverse)
map(numbers, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 2.645751
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 3
## 
## [[10]]
## [1] 3.162278

map() is only one of the nice functions that are bundled inside {purrr}. Mastering {purrr} can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice} package. {mice} comes with an example dataset called boys, let’s take a look at it:

library(mice)

data(boys)

brotools::describe(boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 9 x 13
##   variable type    n_missing  nobs   mean    sd mode     min   max   q25
##   <chr>    <chr>       <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
## 1 age      Numeric         0   748   9.16  6.89 0.035  0.035  21.2  1.58
## 2 bmi      Numeric        21   748  18.1   3.05 14.54 11.8    31.7 15.9 
## 3 hc       Numeric        46   748  51.5   5.91 33.7  33.7    65   48.1 
## 4 hgt      Numeric        20   748 132.   46.5  50.1  50     198   84.9 
## 5 tv       Numeric       522   748  11.9   7.99 <NA>   1      25    4   
## 6 wgt      Numeric         4   748  37.2  26.0  3.65   3.14  117.  11.7 
## 7 gen      Factor        503   748  NA    NA    <NA>  NA      NA   NA   
## 8 phb      Factor        503   748  NA    NA    <NA>  NA      NA   NA   
## 9 reg      Factor          3   748  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

In the code above I use the describe() function from my personal package to get some summary statistics of the boys dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the mice function, it is very easy to impute them:

start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)
## Time difference of 3.290611 mins

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The mice() function returns a mids object. If you want to look at the data, you have to use the complete() function (careful, there is also a complete() function in the {tidyr} package, so to avoid problems, I suggest you explicitely call mice::complete()):

imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 11 x 13
##    variable type   n_missing  nobs   mean     sd mode     min   max    q25
##    <chr>    <chr>      <int> <int>  <dbl>  <dbl> <chr>  <dbl> <dbl>  <dbl>
##  1 .id      Numer…         0  7480 374.   216.   1      1     748   188.  
##  2 .imp     Numer…         0  7480   5.5    2.87 1      1      10     3   
##  3 age      Numer…         0  7480   9.16   6.89 0.035  0.035  21.2   1.58
##  4 bmi      Numer…         0  7480  18.0    3.03 14.54 11.8    31.7  15.9 
##  5 hc       Numer…         0  7480  51.6    5.89 33.7  33.7    65    48.3 
##  6 hgt      Numer…         0  7480 131.    46.5  50.1  50     198    83   
##  7 tv       Numer…         0  7480   8.39   8.09 2      1      25     2   
##  8 wgt      Numer…         0  7480  37.1   26.0  3.65   3.14  117.   11.7 
##  9 gen      Factor         0  7480  NA     NA    G1    NA      NA    NA   
## 10 phb      Factor         0  7480  NA     NA    P1    NA      NA    NA   
## 11 reg      Factor         0  7480  NA     NA    south NA      NA    NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

As expected, no more missing values. The “long” argument inside mice::complete() is needed if you want the complete() function to return a long dataset. Doing the above “manually” using {purrr} is possible with the following code:

start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 3.393966 mins

What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE) to a list of 1s, and creates 10 imputed data sets. m = . means that m will be equal to whatever is inside the list we are mapping our function over, so 1, then 1 then another 1 etc…. It took around the same amount of time as using mice() directly.

imp_boys_purrr is now a list of 10 mids objects. We thus need to map mice::complete() to imp_boys_purrr to get the data:

imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)

Now, imp_boys_purrr_complete is a list of 10 datasets. Let’s map brotools::describe() to it:

map(imp_boys_purrr_complete, brotools::describe)
## [[1]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.90 33.7  33.7    65   48.3    53.1  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   84     146.  175. 
## 5 tv       Numer…   748   8.35  8.00 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[2]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   145.  175  
## 5 tv       Numer…   748   8.37  8.02 1      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.9    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P2    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[3]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.87 33.7  33.7    65   48.5    53.3  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   145.  175  
## 5 tv       Numer…   748   8.46  8.14 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[4]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.02 14.54 11.8    31.7 15.9    17.5  19.4
## 3 hc       Numer…   748  51.7   5.93 33.7  33.7    65   48.5    53.4  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   82.9   145.  175  
## 5 tv       Numer…   748   8.45  8.11 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[5]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.91 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175. 
## 5 tv       Numer…   748   8.21  8.02 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[6]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.89 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.44  8.24 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[7]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.2    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   146.  175  
## 5 tv       Numer…   748   8.47  8.15 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[8]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.4
## 3 hc       Numer…   748  51.6   5.85 33.7  33.7    65   48.2    53.3  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.36  8.06 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[9]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.90 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.9   146.  175  
## 5 tv       Numer…   748   8.57  8.25 1      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[10]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.89 33.7  33.7    65   48.3    53.1  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.49  8.18 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of purrr::map(), called map2():

imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))

map2() applies a function, say f(), to 2 lists sequentially: f(x_1, y_1), then f(x_2, y_2), etc… So here I map mutate() to create a new column, imp_id in each dataset. Now let’s bind the rows and take a look at the data:

imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9 
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.3 
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83   
##  5 tv       Numeric          0  7480   8.42  8.11 3      1      25    2   
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7 
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA   
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA   
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA   
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using purrr::map() and simply use furrr::future_map() to parallelize the imputation process:

library(furrr)
## Loading required package: future
plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 33.73772 secs

Boooom! Much faster! And simply by loading {furrr}, then using plan(multiprocess) to run the code in parallel (if you forget that, the code will run on a single core) and using future_map() instead of map().

Let’s take a look at the data:

imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9 
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.4 
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83   
##  5 tv       Numeric          0  7480   8.35  8.09 3      1      25    2   
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7 
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA   
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA   
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA   
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with {furrr} you must install it from Github, as it is not yet available on CRAN:

devtools::install_github("DavisVaughan/furrr")

If you are not comfortable with map() (and thus future_map()) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it like so:

devtools::install_github("b-rodrigues/parlMICE")

If you found this blog post useful, you might want to follow me on twitter for blog post updates.