Econometrics and Free Software by Bruno Rodrigues.
RSS feed for blog post updates.
Follow me on Mastodon, twitter, or check out my Github.
Check out my package that adds logging to R functions, {chronicler}.
Or read my free ebooks, to learn some R and build reproducible analytical pipelines..
You can also watch my youtube channel or find the slides to the talks I've given here.
Buy me a coffee, my kids don't let me sleep.

Functional programming explains why containerization is needed for reproducibility


I’ve had some discussions online and in the real world about this blog post and I’d like to restate why containerization is needed for reproducibility, and do so from the lens of functional programming.

When setting up a pipeline, wether you’re a functional programming enthusiast or not, you’re aiming at setting it up in a way that this pipeline is the composition of (potentially) many referentially transparent and pure functions.

As a reminder:

  • referentially transparent functions are functions that always return the same output for the same given input. So for example f(x, y):=x+y is referentially transparent, but h(x):=x+y is not. Because y is not an input of h, h will look for y in the global environment. Depending on the value of y, h(1) might equal 10 one day, but 100 the next. Let’s say that f(1, 10) is always equal to 11. Because this is true, you could replace f(1, 10) everywhere it appears with 11. But consider the following example of a function that is not referentially transparent, rnorm(). Try rnorm(1) several times… It will always give a different result! This is because rnorm() looks for the seed in the global environment and uses that to generate a random number.

  • pure functions are functions without side effects. So a function just does its thing, and does not interact with anything else; doesn’t change anything in the global environment, doesn’t print anything on screen, doesn’t write anything to disk. Basically, pure functions are functions that do nothing else but computing stuff. Now this may seem limiting, and to some extent it is, so we will need to relax this a bit: we’ll be ok with functions that output stuff, but only the very last function in the pipeline will be allowed to do it.

To be pure, a function needs to be referentially transparent.

Ok so now that we know what referentially transparent and pure functions are, let’s explain why we want a pipeline to be a composition of such functions. Function composition is an operation that takes two functions g and f and returns a new function h such that h(x) = g(f(x)). Formally:

h = g ∘ f such that h(x) = g(f(x))

is the composition operator. You can read g ∘ f as g after f. In R, you can compose functions very easily, simply by using |> or %>%:

h <- f |> g

f |> g can be read as f then g, which is equivalent to g after f (ok, using |> is chaining rather than composing functions, but the net effect is the same).

So h would be our complete pipeline, which would be the composition, or chaining, of as many functions as needed:

h <- a |> b |> c |> d ... |> z

If all the functions are pure (and referentially transparent) then we’re assured that h will always produce the same outputs for the same inputs. As stated above, z will be allowed to not be pure an actually output something (like a rendered Quarto document) to disk. Ok so that’s great, and all, but why does the title of this blog post say that containerization is needed?

The problem is that all the functions we use have “hidden” inputs, and are never truly referentially transparent. These inputs are the following:

  • Version of R (or whatever programming language you’re using)
  • Versions of the packages you’re using
  • Operating system and its version (and all the different operating system dependencies that get used at run- or compile time)

For example, let’s take a look at this function:

f <- function(x){
  if (c(TRUE, FALSE)) x 

which will return the following on R 4.1 (which was released on May 2021):

[1] 1
Warning message:
In if (c(TRUE, FALSE)) 1 :
  the condition has length > 1 and only the first element will be used

So a result 1 and a warning. On R 4.2.2 (the current version as of writing), the exact same call returns:

Error in if (c(TRUE, FALSE)) 1 : the condition has length > 1

These types of breaking changes are rare in R, at least to my knowledge (I’m actually looking into this in greater detail, 2023 will likely be the year I show my findings), but in this case it illustrates my point: code that was behaving in a certain way started behaving in another way, even though nothing changed. What changed was the version of R, even though the function itself was pure. This wouldn’t be so surprising if instead of f(x), the function was something like f(x, r_version). In this case, the calls above would be something like:

f(1, r_version = "4.1")

and this would always return:

[1] 1
Warning message:
In if (c(TRUE, FALSE)) 1 :
  the condition has length > 1 and only the first element will be used

but changing the call to this:

f(1, r_version = "4.2.2")

would return the error:

Error in if (c(TRUE, FALSE)) 1 : the condition has length > 1

regardless of the version of R we’re running, so our function would be referentially transparent.

Alas, this is not possible, at least not like this.

Hence why tools like Docker, Podman (a Docker alternative) or Guix (which I learned about recently but never used, yet, and as far as I know, not a containerization solution, but a solution actually based on functional programming) are crucial to ensure that your pipeline is truly reproducible. Basically, using Docker you turn the hidden inputs defined before (versions of tools and OS) explicit. Take a look at this Dockerfile:

FROM rocker/r-ver:4.1.0

RUN R -e "f <- function(x){if (c(TRUE, FALSE)) x};f(1)"

CMD ["R"]

here’s what happens when you build it:

➤ docker build -t my_pipeline .
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM rocker/r-ver:4.1.0
4.1.0: Pulling from rocker/r-ver

eaead16dc43b: Already exists 
35eac095fa03: Pulling fs layer
c0088a79f8ab: Pulling fs layer
28e8d0ade0c0: Pulling fs layer
Digest: sha256:860c56970de1d37e9c376ca390617d50a127b58c56fbb807152c2e976ce02002
Status: Downloaded newer image for rocker/r-ver:4.1.0
 ---> d83268fb6cda
Step 2/3 : RUN R -e "f <- function(x){if (c(TRUE, FALSE)) x};f(1)"
 ---> Running in a158e4ab474f

R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> f <- function(x){if (c(TRUE, FALSE)) x};f(1)
[1] 1
Warning message:
In if (c(TRUE, FALSE)) x :> 

  the condition has length > 1 and only the first element will be used
Removing intermediate container a158e4ab474f
 ---> 49e2eb20a535
Step 3/3 : CMD ["R"]
 ---> Running in ccda657c4d95
Removing intermediate container ccda657c4d95
 ---> 5a432adbe6ff
Successfully built 5a432adbe6ff
Successfully tagged my_package:latest

as you can read from above, this starts the container with R version 4.1.0 and runs the code in it. We get back our result with the warning (it should be noted that in practice, you would structure your Dockerfile differently for running an actual pipeline).

This Dockerfile starts by using rocker/r-ver:4.1 as a basis. You can find this image in the versioned repository from the Rocker Project. This base image starts off from Ubuntu Focal Fossa so (Ubuntu version 20.04), uses R version 4.1.0 and even uses frozen CRAN repository as of 2021-08-09. It then runs our pipeline (or in this case, our simple function) in this, fixed environment. Our function essentially became f(x, os_version, r_version, packages_version) instead of just f(x). By changing the first statement of the Dockerfile:

FROM rocker/r-ver:4.1.0

to this:

FROM rocker/r-ver:3.5.0

we can even do some archaeology and run the pipeline on R version 3.5.0! This has great potential and hopefully one day Docker or similar solution will become just another tool in scientists/analysts toolbox.

If you want to start using Docker for your projects, I’ve written this tutorial and even a whole ebook.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

Buy me an EspressoBuy me an Espresso