About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

No excuse not to be a Bayesian anymore

My first encounter with Bayesian statistics was around 10 years ago, when I was doing my econometrics master’s degree. I was immediately very interested by the Bayesian approach to fit econometric models, because, when you’re reading about Bayesian approaches, it just sounds so easy and natural. You have a model, you might have some prior beliefs about the models parameters, and Bayes’ rule tells you how your beliefs should change when confronting the model to data (evidence). It is really appealing, and what I really liked as well was the interpretation of the results. It was very natural as well. Once your software is done estimating/training the model, you don’t actually get a vector of values for the parameters of the model. You get whole distributions for each parameter, so-called posterior distributions. You can then make statements like “there’s a 95% probability that this parameter lies between 0.12 and 0.21” for instance, which is a statement that you cannot make in a frequentist/classical framework.

However, while this was very appealing to me at the time, there is no free lunch as they say. At the time, and it was not that long ago, doing Bayesian statistics was not as straightforward as it is now, as I will show you in this blog post. At the time, the BUGS language was still the standard way to describe a Bayesian model and the actual estimation procedure. However, using the BUGS language was tricky; there was WinBUGS, a Windows tool that had already been discontinued at the time in favour of OpenBUGS, which was what I was using. The professor teaching this class, who eventually became one of my PhD advisors, was using WinBUGS to teach, but I was using Linux at the time already, so I went with OpenBUGS which worked with WINE, I think (WINE is a compatibility layer that allows running some Windows programs on Linux. It is quite amazing what WINE is able to do, so much so that Valve forked it to create Proton, which enables running Windows games on Linux on their popular Steam platform). Plus, I found out that there was an R package to call OpenBUGS from R and get the results back into R seamlessly! I think that I remember that there was one for WINBUGS as well, but I also think I remember I could not get it to work. Anyways, after some digging, I found an even better alternative called JAGS. JAGS is open source, and is natively available for Linux. It is also able to run in parallel, which is really useful for Bayesian inference. In any case, these were separate languages/programs from R, and as such the user had to learn how to use them. The way they worked was that users needed to write the Bayesian model in a separate text file. I will illustrate this with an example that I worked on during my Master’s, back in 2011. The example is taken from Ioannis Ntzoufras’ book Bayesian Modeling Using WinBUGS. The example can be found in chapter 7, section 7.4.2. The dataset is from Montgomery et al. (2006) and refers to the number of aircarft damages in 30 strike missions during the Vietnam War.

Here’s a description of the data, taken from Ntzoufras’ book:

  • damage: the number of damaged locations of the aircraft
  • type: binary variable which indicates the type of the plane (0 for A4, B for A6)
  • bombload: the aircraft bomb load in tons
  • airexp: the total months of aircrew experience.

The goal is to find a model for damage, what variables explain the amount of damage on the aircraft? Now, something quite interesting here, is that we only have 30 observations, and getting more observations is very costly. So what are you to do with this?

First, let’s write the model down:

\[ \begin{align} \text{damage}_i & = \text{Poisson}(\lambda_i) \\ \log(\lambda_i) & = \beta_1 + \beta_2*\text{type}_i + \beta_3 * \text{bombload}_i + \beta_4*\text{airexp}_i \end{align} \]

where \(i = 1, 2, 3, ..., 30\). Since damage is a count variable, we go with a Poisson distribution, which only has one parameter, \(\lambda\). \(\lambda\) is defined on the second line. Both these definition form the likelihood, which should be familiar to any statistician. In JAGS and BUGS, this likelihood had to be written in a separate text file, as I mentioned above, with a syntax that was very similar to R’s (at least for JAGS, because if memory serves, BUGS’s syntax was more different):

model{
for (i in 1:30){damage[i] ~ dpois( lambda[i] )
        log(lambda[i]) <- b1 + b2 * type[i] 
        + b3 * bombload[i] + b4 * airexp[i]}
b1 ~ dnorm(0, 1.0E-4)
b2 ~ dnorm(0, 1.0E-4)
b3 ~ dnorm(0, 1.0E-4)
b4 ~ dnorm(0, 1.0E-4)}

The last four lines are the prior distributions on the parameters. This is something that does not exist in frequentist statistics. Frequentists would maximise the above likelihood and call it a day. However, the Bayesian framework allows the practitioner to add prior knowledge into the model. This prior knowledge can come from domain knowledge or the literature. However, if the practitioner does not have a clue about good priors, then diffuse priors can be used. Diffuse priors do not carry much information, if at all. The priors above are diffuse; they’re normally distributed, centered around 0 with very small precision (in the Bayesian framework, the normal distribution is defined with two parameters, \(\mu\) and \(\tau\), where \(\tau = \dfrac{1}{\sigma}\)). But since my student years I have learned that using such priors is actually not a very good idea, and better alternatives exist (priors that at least provide some regularization for instance). The Bayesian approach to statistics is often criticized for this, because priors are not objective. If you’re not using diffuse priors, then you’re using priors that carry some information. This information is subjective and subjectivity is a big No-No. But should subjectivity be a No-No? After all, if you can defend your priors, either because of domain knowledge, or because of past studies that provide some clue why not use this information? Especially when here is very little data like in this example. Also, you can perform a sensitivity analysis, and show how the posterior distribution of the parameters change when your priors change. What is important is to be fully transparent about the priors you’re using, and have clear justification for them. If these conditions are met, I don’t see why you should not use prior information in your model. Plus, even in frequentist statistics prior knowledge is used as well, for instance by pre-processing the data in a certain way, or by constraining the values the parameters are allowed to take in the optimisation routine (I’m looking at you, L-BFGS-B).

Now, let’s continue with the data. To load the data, I had to manually created each variable (but maybe JAGS now uses data frames) to pass it to jags():

# We load the data this way since jags only takes numerical vectors, matrices or lists
# containing the names as input
damage <- c(0, 1, 0, 0, 0, 0, 1, 0, 0, 2, 1, 1, 1, 1, 2, 3, 1, 1, 1, 2, 0, 1, 1, 2, 5, 1, 1, 5, 5, 7) 
type <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 
bombload <- c(4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 7, 7, 7, 10, 10, 10, 12, 12, 12, 8, 8, 8, 14, 14, 14)
airexp <- c(91.5, 84, 76.5, 69, 61.5, 80, 72.5, 65, 57.5, 50, 103, 95.5, 88, 80.5, 73, 116.1, 100.6, 85, 69.4, 53.9, 
112.3, 96.7, 81.1, 65.6, 50, 120, 104.4, 88.9, 73.7, 57.8)

Now that we loaded the data, we can fit the model, by first defining a vector of parameters, and a named list for the data:

parameters <- c("b1", "b2", "b3","b4")
data <- list("damage","type","bombload","airexp")

#We don't give inits to jags since it can generate appropriate initial values

#Use this on single core machines, and/or windows machines
model_fit<-jags(data,inits=NULL,parameters,n.iter=50000,
                model.file="bugsjags.txt",n.chains=4,DIC=T)

Notice that jags() has an argument called model.file, which is the file I showed above. Below, the code to take a look at the result:

#Let's see the results
model_fit$BUGSoutput


model.mcmc<-as.mcmc(model_fit)

traceplot(model.mcmc)

xyplot(model.mcmc)

heidel.diag(model.mcmc) 

par(mfrow=c(2,3))
autocorr.plot(model.mcmc[1],auto.layout=F,ask=F)

geweke.plot(model.mcmc)

We’re actually not looking at this, because I’m not running the code; I only wanted to show you how this was done 8 years ago. But why? Because now I can show you how this is done nowadays with {rstanarm}:

library(rstanarm)

model_fit_stan <- stan_glm(damage ~ ., data = bombs, family = poisson)
## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.7e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.04083 seconds (Warm-up)
## Chain 1:                0.043647 seconds (Sampling)
## Chain 1:                0.084477 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 5e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.037986 seconds (Warm-up)
## Chain 2:                0.041253 seconds (Sampling)
## Chain 2:                0.079239 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 5e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.041033 seconds (Warm-up)
## Chain 3:                0.042982 seconds (Sampling)
## Chain 3:                0.084015 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'count' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 5e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.036928 seconds (Warm-up)
## Chain 4:                0.041124 seconds (Sampling)
## Chain 4:                0.078052 seconds (Total)
## Chain 4:

There is a lot of output, but the input was a single line that should look very familiar to practitioners used to the glm() function. I only used default options, as well as the default priors. Specifying different priors is quite simple, but I won’t discuss this here, because this blot post, while it might look like a tutorial, is not a tutorial. What I wanted to show you is the difference that 9 years make in software development. {stan} is an R package for Bayesian statistics that came out in 2012 and which has been developed ever since. Just like JAGS and BUGS, users can write external files with very detailed models. But for smaller, or more standard problems, {rstanarm}, makes Bayesian inference very easy and feel familiar to the traditional way of doing things and as its name implies, uses {stan} under the hood.

Now let’s continue a little bit and take a look at the model summary:

summary(model_fit_stan)
## 
## Model Info:
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      damage ~ .
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 30
##  predictors:   4
## 
## Estimates:
##               mean   sd   10%   50%   90%
## (Intercept) -0.5    0.9 -1.6  -0.5   0.7 
## type         0.6    0.5 -0.1   0.6   1.2 
## bombload     0.2    0.1  0.1   0.2   0.3 
## airexp       0.0    0.0  0.0   0.0   0.0 
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 1.5    0.3  1.1   1.5   1.9  
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse Rhat n_eff
## (Intercept)   0.0  1.0  2504 
## type          0.0  1.0  2412 
## bombload      0.0  1.0  2262 
## airexp        0.0  1.0  2652 
## mean_PPD      0.0  1.0  3813 
## log-posterior 0.0  1.0  1902 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Just like for Bayesian stats, we get our parameter’s estimates. But wait! In the intro of this blog post, I said that in Bayesian statistics, we estimate full parameter distributions. So why are we getting point estimates? Well, these are statistics from the posterior distribution, the mean, standard deviation and some deciles.

To explore the results, I like to use {bayestestR}:

library(bayestestR)

describe_posterior(model_fit_stan)
## # Description of Posterior Distributions
## 
## Parameter   | Median | CI | CI_low | CI_high |    pd | ROPE_CI | ROPE_low | ROPE_high | ROPE_Percentage |  Rhat |  ESS
## ----------------------------------------------------------------------------------------------------------------------
## (Intercept) | -0.473 | 89 | -1.898 |   0.929 | 0.711 |      89 |   -0.100 |     0.100 |           0.082 | 0.999 | 2504
## type        |  0.577 | 89 | -0.230 |   1.365 | 0.869 |      89 |   -0.100 |     0.100 |           0.099 | 1.002 | 2412
## bombload    |  0.169 | 89 |  0.065 |   0.275 | 0.994 |      89 |   -0.100 |     0.100 |           0.108 | 1.001 | 2262
## airexp      | -0.014 | 89 | -0.028 |  -0.001 | 0.953 |      89 |   -0.100 |     0.100 |           1.000 | 1.000 | 2652

Let’s also actually see the posterior of, say, \(\beta_3\):

library(insight)
## Warning: package 'insight' was built under R version 3.6.2
posteriors <- get_parameters(model_fit_stan)

ggplot(posteriors, aes(x = bombload)) +
  geom_density(fill = "cyan") +
  brotools::theme_blog()

I won’t again go into much detail, because you can read the very detailed Vignettes on {bayestestR}’s website: Get started with Bayesian Analysis and Describing the posterior which explain all of this much better than I would ever do. The code I’m showing here is basically a copy paste of these Vignettes, so if I piqued your interest, go read those Vignettes! I also highly recommend reading {rstanarm}’s Vignettes, and grabbing the second edition of Statistical Rethinking, by Richard McElreath, it is a great intro to Bayesian statistics with {stan}.

Now, as the title of this blog post reads, there is no excuse not to use Bayesian statistics; from a software point of view, it’s as simple as ever. And by the way, {stan} models are supported in {tidymodels}{parsnip} package as well, which makes things even easier!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub.

Buy me an EspressoBuy me an Espresso