Reproducibility with Docker and Github Actions for the average R enjoyer

2022/11/19 R

This blog post is a summary of Chapters 9 and 10 of this ebook I wrote for a course

The goal is the following: we want to write a pipeline that produces some plots. We want the code to be executed inside a Docker container for reproducibility, and we want this container to get executed on Github Actions. Github Actions is a Continuous Integration and Continuous Delivery service from Github that allows you to execute arbitrary code on events (like pushing code to a repo). It’s pretty neat. For example, you could be writing a paper using Latex and get the pdf compiled on Github Actions each time you push, without needing to have to do it yourself. Or if you are developing an R package, unit tests could get executed each time you push code, so you don’t have to do it manually.

This blog post will assume that you are familiar with R and are comfortable with it, as well as Git and Github.

It will also assume that you’ve at least heard of Docker and have it already installed on your computer, but ideally, you’ve already played a bit around with Docker. If you’re a total Docker beginner, this tutorial might be a bit too esoteric.

Let’s start by writing a pipeline that works on our machines using the {targets} package.

Getting something working on your machine

So, let’s say that you got some nice code that you need to rerun every month, week, day, or even hour. Or let’s say that you’re a researcher that is concerned with reproducibility. Let’s also say that you want to make sure that this code always produces the same result (let’s say it’s some plots that need to get remade once some data is refreshed).

Ok, so first of all, you really want your workflow to be defined using the {targets} package. If you’re not familiar with {targets}, this will serve as a micro introduction, but you really should read the {targets} manual, at least the walkthrough (watch the 4 minute video). {targets} is a build automation tool that you should definitely add to your toolbox.

Let’s define a workflow that does the following: data gets read, data gets filtered, data gets plotted. What’s the data about? Unemployment in Luxembourg. Luxembourg is a little Western European country that looks like a shoe and is about the size of .98 Rhode Islands from which yours truly hails from. Did you know that Luxembourg was a monarchy, and the last Grand-Duchy in the World? I bet you did not know that. Also, what you should know to understand the script below is that the country of Luxembourg is divided into Cantons, and each Cantons into Communes. Basically, if Luxembourg was the USA, Cantons would be States and Communes would be Counties (or Parishes or Boroughs). What’s confusing is that “Luxembourg” is also the name of a Canton, and of a Commune (which also has the status of a city).

Anyways, here’s how my script looks like:

library(targets)
library(dplyr)
library(ggplot2)
source("functions.R")


list(
    tar_target(
        unemp_data,
        get_data()
    ),

    tar_target(
        lux_data,
        clean_unemp(unemp_data,
                    place_name_of_interest = "Luxembourg",
                    level_of_interest = "Country",
                    col_of_interest = active_population)
    ),

    tar_target(
        canton_data,
        clean_unemp(unemp_data,
                    level_of_interest = "Canton",
                    col_of_interest = active_population)
    ),

    tar_target(
        commune_data,
        clean_unemp(unemp_data,
                    place_name_of_interest = c("Luxembourg",
                                               "Dippach",
                                               "Wiltz",
                                               "Esch/Alzette",
                                               "Mersch"),
                    col_of_interest = active_population)
    ),

    tar_target(
        lux_plot,
        make_plot(lux_data)
    ),

    tar_target(
        canton_plot,
        make_plot(canton_data)
    ),

    tar_target(
        commune_plot,
        make_plot(commune_data)
    ),

    tar_target(
        luxembourg_saved_plot,
        save_plot("fig/luxembourg.png", lux_plot),
        format = "file"
    ),

    tar_target(
        canton_saved_plot,
        save_plot("fig/canton.png", canton_plot),
        format = "file"
    ),

    tar_target(
        commune_saved_plot,
        save_plot("fig/commune.png", commune_plot),
        format = "file"
    )


)

Because this is a {targets} script, this needs to be saved inside a file called _targets.R. Each tar_target() object defines a target that will get built once we run the pipeline. The first element of tar_target() is the name of the target, the second line a call to a function that returns the first element and in the last three targets format = "file" is used to indicate that this target saves an output to disk (as a file).

The fourth line of the script sources a script called functions.R. This script should be placed next to the _targets.R script and should look like this:

# clean_unemp() is a function inside a package I made. Because I don't want you to install
# the package if you're following along, I'm simply sourcing it:

source("https://raw.githubusercontent.com/b-rodrigues/myPackage/main/R/functions.R")

# The cleaned data is also available in that same package. But again, because I don't want you
# to install a package just for a blog post, here is the script to clean it.
# Don't waste time trying to understand it, it's very specific to the data I'm using
# to illustrate the concept of reproducible analytical pipelines. Just accept this data 
# as given.

# This is a helper function to clean the data
clean_data <- function(x){
  x %>%
    janitor::clean_names() %>%
    mutate(level = case_when(
             grepl("Grand-D.*", commune) ~ "Country",
             grepl("Canton", commune) ~ "Canton",
             !grepl("(Canton|Grand-D.*)", commune) ~ "Commune"
           ),
           commune = ifelse(grepl("Canton", commune),
                            stringr::str_remove_all(commune, "Canton "),
                            commune),
           commune = ifelse(grepl("Grand-D.*", commune),
                            stringr::str_remove_all(commune, "Grand-Duche de "),
                            commune),
           ) %>%
    select(year,
           place_name = commune,
           level,
           everything())
}

# This reads in the data.
get_data <- function(){
  list(
    "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2013.csv",
    "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2014.csv",
    "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2015.csv",
    "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/unemp_2016.csv",
  ) |>
    purrr::map_dfr(readr::read_csv) %>%
    purrr::map_dfr(clean_data)
}

# This plots the data
make_plot <- function(data){
  ggplot(data) +
    geom_col(
      aes(
        y = active_population,
        x = year,
        fill = place_name
      )
    ) +
    theme(legend.position = "bottom",
          legend.title = element_blank())
}

# This saves plots to disk
save_plot <- function(save_path, plot){
  ggsave(save_path, plot)
  save_path
}

What you could do instead of having a functions.R script that you source like this, is put everything inside a package that you then host on Github. But that’s outside the scope of this blog post. Put these scripts inside a folder, open an R session inside that folder, and run the pipeline using targets::tar_make():

targets::tar_make()

/
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

• start target unemp_data
• built target unemp_data [1.826 seconds]
• start target canton_data
• built target canton_data [0.038 seconds]
• start target lux_data
• built target lux_data [0.034 seconds]
• start target commune_data
• built target commune_data [0.043 seconds]
• start target canton_plot
• built target canton_plot [0.007 seconds]
• start target lux_plot
• built target lux_plot [0.006 seconds]
• start target commune_plot
• built target commune_plot [0.003 seconds]
• start target canton_saved_plot
Saving 7 x 7 in image
• built target canton_saved_plot [0.425 seconds]
• start target luxembourg_saved_plot
Saving 7 x 7 in image
• built target luxembourg_saved_plot [0.285 seconds]
• start target commune_saved_plot
Saving 7 x 7 in image
• built target commune_saved_plot [0.291 seconds]
• end pipeline [3.128 seconds]

You can now see a fig/ folder in the root of your project with the plots. Sweet.

Making sure this is reproducible

Now what we would like to do is make sure that this pipeline will, for the same inputs, returns the same outputs FOREVER. If I’m running this in 10 years on R version 6.9, I want the exact same plots back. So the idea is to actually never run this on whatever version of R will be available in 10 years, but keep rerunning it, ad vitam æternam on whatever environment I’m using now to type this blog post. So for this, I’m going to use Docker.

(If, like me, you’re an average functional programming enjoyer, then this means getting rid of the hidden state of our pipeline. The hidden global state is the version of R and packages used to run the pipeline.)

What’s Docker? Docker is a way to run a Linux computer inside your computer (Linux or not). That computer is not real, but real enough for our purposes. Ever heard of virtual machines? Basically the same thing, but without the overhead of actually setting up and running a virtual machine.

You can write a simple text file that defines what your machine is, and what it should run. Thankfully, we don’t need to start from scratch and can use the amazing Rocker project that provides many, many, images for us to start playing with Docker. What’s a Docker image? A definition of a computer/machine. Which is a text file. Don’t ask why it’s called an image. Turns out the Rocker project has a page specifically on reproducibility. Their advice can be summarised as follows: if you’re aiming at setting up a reproducible pipeline, use a version-stable image. This means that if you start from such an image, the exact same R version will always be used to run your pipeline. Plus, the RStudio Public Package Manager (RSPM), frozen at a specific date, will be used to fetch the packages needed for your pipeline. So, not only is the R version frozen, but the exact same packages will always get installed (as long as the RSPM exists, hopefully for a long time).

Now, I’ve been talking about a script that defines an image for some time. This script is called a Dockerfile, and you can find the versioned Dockerfiles here. As you can see there are many Dockerfiles, each defining a Linux machine and with several things pre-installed. Let’s take a look at the image r-ver_4.2.1.Dockerfile. What’s interesting here are the following lines (let’s ignore the others):

8 ENV R_VERSION=4.2.1

16 ENV CRAN=https://packagemanager.rstudio.com/cran/__linux__/focal/2022-10-28

The last characters of that link are a date. This means that if you use this for your project, packages will be downloaded as they were on the October 28th, 2022, and the R version used will always be version 4.2.1.

Ok so, how do we use this?

Let’s add a Dockerfile to our project. Simply create a text file called Dockerfile and add the following lines in it:

FROM rocker/r-ver:4.2.1

RUN R -e "install.packages(c('dplyr', 'purrr', 'readr', 'stringr', 'ggplot2', 'janitor', 'targets'))"

RUN mkdir /home/fig

COPY _targets.R /_targets.R

COPY functions.R /functions.R

CMD R -e "targets::tar_make()"

Before continuing, I should explain what the first line does:

FROM rocker/r-ver:4.2.1

This simply means that we are using the image from before as a base. This image is itself based on Ubuntu Focal, see its first line:

FROM ubuntu:focal

Ubuntu is a very popular, likely the most popular, Linux distribution. So the versioned image is built on top of Ubuntu 20.04 codenamed Focal Fossa (which is a long term support release), and our image is built on top of that. To make sense of all this, you can take a look at the table here.

So now that we’ve written this Dockerfile, we need to build the image. This can be done inside a terminal with the following line:

docker build -t my_pipeline .

This tells Docker to build an image called my_pipeline using the Dockerfile in the current directory (hence the .).

But, here’s what happens when we try to run the pipeline (I’ll be showing the command to run the pipeline below):

> targets::tar_make()
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/local/lib/R/site-library/igraph/libs/igraph.so':
  libxml2.so.2: cannot open shared object file: No such file or directory
Calls: loadNamespace ... asNamespace -> loadNamespace -> library.dynam -> dyn.load
Execution halted

We get a nasty error message; apparently some library, libxml2.so cannot be found. So we need to change our Dockerfile, and add the following lines:

FROM rocker/r-ver:4.2.1

RUN apt-get update && apt-get install -y \
    libxml2-dev \
    libglpk-dev \
    libxt-dev

RUN R -e "install.packages(c('dplyr', 'purrr', 'readr', 'stringr', 'ggplot2', 'janitor', 'targets'))"

RUN mkdir /home/fig

COPY _targets.R /_targets.R

COPY functions.R /functions.R

CMD R -e "targets::tar_make()"

I’ve added these lines:

RUN apt-get update && apt-get install -y \
    libxml2-dev \
    libglpk-dev \
    libxt-dev

this runs the apt-get update and apt-get install commands. Aptitude is Ubuntu’s package manager and is used to install software. The three pieces of software I installed will avoid further issues. libxml2-dev is for the error message I’ve pasted here, while the other two avoid further error messages. One last thing before we rebuild th image: we actually need to change the _targets.R file a bit. Let’s take a look at our Dockerfile again, there’s three lines I haven’t commented:

RUN mkdir /home/fig

COPY _targets.R /_targets.R

COPY functions.R /functions.R

The first line creates the fig/ folder in the home/ directory, and the COPY statements copy the files into the Docker image, so that they’re actually available inside the Docker. I also need to tell _targets to save the figures into the home/fig folder. So simply change the last three targets from this:

tar_target(
        luxembourg_saved_plot,
        save_plot("fig/luxembourg.png", lux_plot),
        format = "file"
    ),

    tar_target(
        canton_saved_plot,
        save_plot("fig/canton.png", canton_plot),
        format = "file"
    ),

    tar_target(
        commune_saved_plot,
        save_plot("fig/commune.png", commune_plot),
        format = "file"
    )

to this:

tar_target(
        luxembourg_saved_plot,
        save_plot("/home/fig/luxembourg.png", lux_plot),
        format = "file"
    ),

    tar_target(
        canton_saved_plot,
        save_plot("/home/fig/canton.png", canton_plot),
        format = "file"
    ),

    tar_target(
        commune_saved_plot,
        save_plot("/home/fig/commune.png", commune_plot),
        format = "file"
    )

Ok, so now we’re ready to rebuild the image:

docker build -t my_pipeline .

and we can now run it:

docker run --rm --name my_pipeline_container -v /path/to/fig:/home/fig my_pipeline

docker run runs a container based on the image you defined. --rm means that the container should be removed once it stops, --name gives it a name, here my_pipeline_container (this is not really needed here, because the container stops and gets removed once it’s done running), and -v mounts a volume, which is a fancy way of saying that the folder /path/to/fig/, which is a real folder on your computer, is a portal to the folder /home/fig/ (which we created in the Dockerfile). This means that whatever gets saved inside home/fig/ inside the Docker container gets also saved inside /path/to/fig on your computer. The last argument my_pipeline is simply the Docker image you built before. You should see the three plots magically appearing in /path/to/fig once the container is done running. The other neat thing is that you can upload this image to Docker Hub, for free (to know how to do this, check out this section of the course I teach on this). This way, if other people want to run it, they could do so by running the same command as above, but replacing my_pipeline by your_username_on_docker_hub/image_name_on_docker_hub. People could even create new images based on this image, by using FROM your_username_on_docker_hub/image_name_on_docker_hub at the beginning of their Dockerfiles. If you want an example of a pipeline that starts off from such an image, you can check out this repository. This repository tells you how can run a reproducible pipeline by simply cloning it, building the image (which only takes a few seconds because all software is already installed in the image that I start from) and then running it.

Running this on Github Actions

Ok, so now, let’s suppose that we got an image on Docker Hub that contains all the dependencies required for our pipeline, and let’s say that we create a Github repository containing a Dockerfile that pulls from this image, as well as the required scripts for our pipeline. Basically, this is what I did here (the same repository that I linked above already). If you take a look at the first line of the Dockerfile in it, you will see this:

FROM brodriguesco/r421_rap:version1

This means that the image that gets built from this Dockerfile starts off from this image I’ve uploaded on Docker Hub, this way each time the image gets rebuilt, because the dependencies are already installed, it’s going to be fast. Ok, so now what I want is the following: each time I change a file, be it the Dockerfile, or the _targets.R script, commit my changes and push them, I want Github Actions to rebuild the image, run the container, and give me the plots back.

This means that I can focus on coding, Github Actions will take care of the boring stuff.

To do this, start by creating a .github/ directory on the root of your Github repo, and inside of it, add a .workflows directory, and add a file in it called something like docker-build-run.yml. What matters is that this file ends in .yml. This is what the file I use to define the actions I’ve described above looks like:

name: Docker Image CI

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:

  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Build the Docker image
      run: docker build -t my-image-name .
    - name: Docker Run Action
      run: docker run --rm --name my_pipeline_container -v /github/workspace/fig/:/home/graphs/:rw my-image-name
    - uses: actions/upload-artifact@v3
      with:
        name: my-figures
        path: /github/workspace/fig/

The first line defines the name of the job, here Docker Image CI. The lines state when this should get executed: whenever there’s a push on or pull request on main. The job itself runs on an Ubuntu VM (so Github Actions starts an Ubuntu VM that will pull a Docker image itself running Ubuntu…). Then, there’s the steps statement. For now, let’s focus on the run statements inside steps, because these should be familiar:

run: docker build -t my-image-name .

and:

run: docker run --rm --name my_pipeline_container -v /github/workspace/fig/:/home/graphs/:rw my-image-name

The only new thing here, is that the path “on our machine” has been changed to /github/workspace/. This is the home directory of your repository, so to speak. Now there’s the uses keyword that’s new:

uses: actions/checkout@v3

This action checkouts your repository inside the VM, so the files in the repo are available inside the VM. Then, there’s this action here:

- uses: actions/upload-artifact@v3
  with:
    name: my-figures
    path: /github/workspace/fig/

This action takes what’s inside /github/workspace/fig/ (which will be the output of our pipeline) and makes the contents available as so-called “artifacts”. Artifacts are the outputs of your workflow, and will be made available as zip files for download. In our case, as stated, the output of the pipeline. It is thus possible to rerun our workflow in the cloud. This has the advantage that we can now focus on simply changing the code, and not have to bother with useless manual steps. For example, let’s change this target in the _targets.R file:

tar_target(
    commune_data,
    clean_unemp(unemp_data,
                place_name_of_interest = c("Luxembourg", "Dippach", 
                                           "Wiltz", "Esch/Alzette", 
                                           "Mersch", "Dudelange"),
                col_of_interest = active_population)
)

I’ve added “Dudelange” to the list of communes to plot. Let me push this change to the repo now, and let’s take a look at the artifacts. The video below summarises the process:

As you can see in the video, the _targets.R script was changed, and the changes pushed to Github. This triggered the action we’ve defined before. The plots (artifacts) get refreshed, and we can download them. We see then that Dudelange was added in the communes.png plot!

If you enjoyed this blog post and want more of this, I wrote a whole ebook on it.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

Buy me an Espresso