```Econometrics and Free Software by Bruno Rodrigues.
Check out my package that adds logging to R functions, {chronicler}.
Or read my free ebooks, to learn some R and build reproducible analytical pipelines..
You can also watch my youtube channel or find the slides to the talks I've given here.
Buy me a coffee, my kids don't let me sleep.
```

# Imputing missing values in parallel using {furrr}

R

Today I saw this tweet on my timeline:

and as a heavy `{purrr}` user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out `{furrr}`. For those unfamiliar with `{purrr}`, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: `{purrr}` contains so-called higher order functions, which are functions that take other functions as argument. One such function is `map()`. Consider the following simple example:

``numbers <- seq(1, 10)``

If you want the square root of this numbers, you can of course simply use the `sqrt()` function, because it is vectorized:

``sqrt(numbers)``
``````##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278``````

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if `sqrt()` was not vectorized:

``````sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
sqrt_numbers[i] <- sqrt(numbers[i])
}``````

First, you need to initialize a container, and then you have to populate the `sqrt_numbers` list with the results. Using, `{purrr}` is way easier:

``````library(tidyverse)
map(numbers, sqrt)``````
``````## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
## [[10]]
## [1] 3.162278``````

`map()` is only one of the nice functions that are bundled inside `{purrr}`. Mastering `{purrr}` can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the `{mice}` package. `{mice}` comes with an example dataset called `boys`, let’s take a look at it:

``````library(mice)

data(boys)

brotools::describe(boys) %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 9 x 13
##   variable type    n_missing  nobs   mean    sd mode     min   max   q25
##   <chr>    <chr>       <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
## 1 age      Numeric         0   748   9.16  6.89 0.035  0.035  21.2  1.58
## 2 bmi      Numeric        21   748  18.1   3.05 14.54 11.8    31.7 15.9
## 3 hc       Numeric        46   748  51.5   5.91 33.7  33.7    65   48.1
## 4 hgt      Numeric        20   748 132.   46.5  50.1  50     198   84.9
## 5 tv       Numeric       522   748  11.9   7.99 <NA>   1      25    4
## 6 wgt      Numeric         4   748  37.2  26.0  3.65   3.14  117.  11.7
## 7 gen      Factor        503   748  NA    NA    <NA>  NA      NA   NA
## 8 phb      Factor        503   748  NA    NA    <NA>  NA      NA   NA
## 9 reg      Factor          3   748  NA    NA    south NA      NA   NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>``````

In the code above I use the `describe()` function from my personal package to get some summary statistics of the `boys` dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the `mice` function, it is very easy to impute them:

``````start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)``````
``## Time difference of 3.290611 mins``

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase `maxit`, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The `mice()` function returns a `mids` object. If you want to look at the data, you have to use the `complete()` function (careful, there is also a `complete()` function in the `{tidyr}` package, so to avoid problems, I suggest you explicitely call `mice::complete()`):

``````imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 11 x 13
##    variable type   n_missing  nobs   mean     sd mode     min   max    q25
##    <chr>    <chr>      <int> <int>  <dbl>  <dbl> <chr>  <dbl> <dbl>  <dbl>
##  1 .id      Numer…         0  7480 374.   216.   1      1     748   188.
##  2 .imp     Numer…         0  7480   5.5    2.87 1      1      10     3
##  3 age      Numer…         0  7480   9.16   6.89 0.035  0.035  21.2   1.58
##  4 bmi      Numer…         0  7480  18.0    3.03 14.54 11.8    31.7  15.9
##  5 hc       Numer…         0  7480  51.6    5.89 33.7  33.7    65    48.3
##  6 hgt      Numer…         0  7480 131.    46.5  50.1  50     198    83
##  7 tv       Numer…         0  7480   8.39   8.09 2      1      25     2
##  8 wgt      Numer…         0  7480  37.1   26.0  3.65   3.14  117.   11.7
##  9 gen      Factor         0  7480  NA     NA    G1    NA      NA    NA
## 10 phb      Factor         0  7480  NA     NA    P1    NA      NA    NA
## 11 reg      Factor         0  7480  NA     NA    south NA      NA    NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>``````

As expected, no more missing values. The “long” argument inside `mice::complete()` is needed if you want the `complete()` function to return a long dataset. Doing the above “manually” using `{purrr}` is possible with the following code:

``````start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)``````
``## Time difference of 3.393966 mins``

What this does is map the function `~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)` to a list of `1`s, and creates 10 imputed data sets. `m = .` means that `m` will be equal to whatever is inside the list we are mapping our function over, so `1`, then `1` then another `1` etc…. It took around the same amount of time as using `mice()` directly.

`imp_boys_purrr` is now a list of 10 `mids` objects. We thus need to map `mice::complete()` to `imp_boys_purrr` to get the data:

``imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)``

Now, `imp_boys_purrr_complete` is a list of 10 datasets. Let’s map `brotools::describe()` to it:

``map(imp_boys_purrr_complete, brotools::describe)``
``````## [[1]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.90 33.7  33.7    65   48.3    53.1  56
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   84     146.  175.
## 5 tv       Numer…   748   8.35  8.00 3      1      25    2       3    15
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[2]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.3    53.2  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   145.  175
## 5 tv       Numer…   748   8.37  8.02 1      1      25    2       3    15
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.9    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P2    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[3]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.87 33.7  33.7    65   48.5    53.3  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   145.  175
## 5 tv       Numer…   748   8.46  8.14 2      1      25    2       3    15
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[4]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.02 14.54 11.8    31.7 15.9    17.5  19.4
## 3 hc       Numer…   748  51.7   5.93 33.7  33.7    65   48.5    53.4  56
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   82.9   145.  175
## 5 tv       Numer…   748   8.45  8.11 2      1      25    2       3    15
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[5]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.91 33.7  33.7    65   48.3    53.2  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175.
## 5 tv       Numer…   748   8.21  8.02 3      1      25    2       3    15
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[6]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.89 33.7  33.7    65   48.3    53.2  56
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175
## 5 tv       Numer…   748   8.44  8.24 3      1      25    2       3    15
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[7]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.2    53.2  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   146.  175
## 5 tv       Numer…   748   8.47  8.15 2      1      25    2       3    15
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[8]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.4
## 3 hc       Numer…   748  51.6   5.85 33.7  33.7    65   48.2    53.3  56
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175
## 5 tv       Numer…   748   8.36  8.06 2      1      25    2       3    15
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[9]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.90 33.7  33.7    65   48.3    53.2  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.9   146.  175
## 5 tv       Numer…   748   8.57  8.25 1      1      25    2       3    15
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[10]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.89 33.7  33.7    65   48.3    53.1  56
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175
## 5 tv       Numer…   748   8.49  8.18 2      1      25    2       3    15
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>``````

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of `purrr::map()`, called `map2()`:

``imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))``

`map2()` applies a function, say `f()`, to 2 lists sequentially: `f(x_1, y_1)`, then `f(x_2, y_2)`, etc… So here I map `mutate()` to create a new column, `imp_id` in each dataset. Now let’s bind the rows and take a look at the data:

``````imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.3
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83
##  5 tv       Numeric          0  7480   8.42  8.11 3      1      25    2
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>``````

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using `purrr::map()` and simply use `furrr::future_map()` to parallelize the imputation process:

``library(furrr)``
``## Loading required package: future``
``````plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)``````
``## Time difference of 33.73772 secs``

Boooom! Much faster! And simply by loading `{furrr}`, then using `plan(multiprocess)` to run the code in parallel (if you forget that, the code will run on a single core) and using `future_map()` instead of `map()`.

Let’s take a look at the data:

``````imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.4
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83
##  5 tv       Numeric          0  7480   8.35  8.09 3      1      25    2
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>``````

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with `{furrr}` you must install it from Github, as it is not yet available on CRAN:

``devtools::install_github("DavisVaughan/furrr")``

If you are not comfortable with `map()` (and thus `future_map()`) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it like so:

``devtools::install_github("b-rodrigues/parlMICE")``

If you found this blog post useful, you might want to follow me on twitter for blog post updates.