About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Some fun with {gganimate}

In this short blog post I show you how you can use the {gganimate} package to create animations from {ggplot2} graphs with data from UNU-WIDER.

WIID data

Just before Christmas, UNU-WIDER released a new edition of their World Income Inequality Database:

The data is available in Excel and STATA formats, and I thought it was a great opportunity to release it as an R package. You can install it with:

devtools::install_github("b-rodrigues/wiid4")

Here a short description of the data, taken from UNU-WIDER’s website:

"The World Income Inequality Database (WIID) presents information on income inequality for developed, developing, and transition countries. It provides the most comprehensive set of income inequality statistics available and can be downloaded for free.

WIID4, released in December 2018, covers 189 countries (including historical entities), with over 11,000 data points in total. With the current version, the latest observations now reach the year 2017."

It was also a good opportunity to play around with the {gganimate} package. This package makes it possible to create animations and is an extension to {ggplot2}. Read more about it here.

Preparing the data

To create a smooth animation, I need to have a cylindrical panel data set; meaning that for each country in the data set, there are no missing years. I also chose to focus on certain variables only; net income, all the population of the country (instead of just focusing on the economically active for instance) as well as all the country itself (and not just the rural areas). On this link you can find a codebook (pdf warning), so you can understand the filters I defined below better.

Let’s first load the packages, data and perform the necessary transformations:

library(wiid4)
library(tidyverse)
library(ggrepel)
library(gganimate)
library(brotools)

small_wiid4 <- wiid4 %>%
    mutate(eu = as.character(eu)) %>%
    mutate(eu = case_when(eu == "1" ~ "EU member state",
                          eu == "0" ~ "Non-EU member state")) %>%
    filter(resource == 1, popcovr == 1, areacovr == 1, scale == 2) %>%
    group_by(country) %>%
    group_by(country, year) %>%
    filter(quality_score == max(quality_score)) %>%
    filter(source == min(source)) %>%
    filter(!is.na(bottom5)) %>%
    group_by(country) %>%
    mutate(flag = ifelse(all(seq(2004, 2016) %in% year), 1, 0)) %>%
    filter(flag == 1, year > 2003) %>%
    mutate(year = lubridate::ymd(paste0(year, "-01-01")))

For some country and some years, there are several sources of data with varying quality. I only keep the highest quality sources with:

    group_by(country, year) %>%
    filter(quality_score == max(quality_score)) %>%

If there are different sources of equal quality, I give priority to the sources that are the most comparable across country (Luxembourg Income Study, LIS data) to less comparable sources with (at least that’s my understanding of the source variable):

    filter(source == min(source)) %>%

I then remove missing data with:

    filter(!is.na(bottom5)) %>%

bottom5 and top5 give the share of income that is controlled by the bottom 5% and top 5% respectively. These are the variables that I want to plot.

Finally I keep the years 2004 to 2016, without any interruption with the following line:

    mutate(flag = ifelse(all(seq(2004, 2016) %in% year), 1, 0)) %>%
    filter(flag == 1, year > 2003) %>%

ifelse(all(seq(2004, 2016) %in% year), 1, 0)) creates a flag that equals 1 only if the years 2004 to 2016 are present in the data without any interruption. Then I only keep the data from 2004 on and only where the flag variable equals 1.

In the end, I ended up only with European countries. It would have been interesting to have countries from other continents, but apparently only European countries provide data in an annual basis.

Creating the animation

To create the animation I first started by creating a static ggplot showing what I wanted; a scatter plot of the income by bottom and top 5%. The size of the bubbles should be proportional to the GDP of the country (another variable provided in the data). Once the plot looked how I wanted I added the lines that are specific to {gganimate}:

    labs(title = 'Year: {frame_time}', x = 'Top 5', y = 'Bottom 5') +
    transition_time(year) +
    ease_aes('linear')

I took this from {gganimate}’s README.

animation <- ggplot(small_wiid4) +
    geom_point(aes(y = bottom5, x = top5, colour = eu, size = log(gdp_ppp_pc_usd2011))) +
    xlim(c(10, 20)) +
    geom_label_repel(aes(y = bottom5, x = top5, label = country), hjust = 1, nudge_x = 20) +
    theme(legend.position = "bottom") +
    theme_blog() +
    scale_color_blog() +
    labs(title = 'Year: {frame_time}', x = 'Top 5', y = 'Bottom 5') +
    transition_time(year) +
    ease_aes('linear')

I use geom_label_repel to place the countries’ labels on the right of the plot. If I don’t do this, the labels of the countries would be floating around and the animation would be unreadable.

I then spent some time trying to render a nice webm instead of a gif. It took some trial and error and I am still not entirely satisfied with the result, but here is the code to render the animation:

animate(animation, renderer = ffmpeg_renderer(options = list(s = "864x480", 
                                                             vcodec = "libvpx-vp9",
                                                             crf = "15",
                                                             b = "1600k", 
                                                             vf = "setpts=5*PTS")))

The option vf = "setpts=5*PTS" is important because it slows the video down, so we can actually see something. crf = "15" is the quality of the video (lower is better), b = "1600k" is the bitrate, and vcodec = "libvpx-vp9" is the codec I use. The video you saw at the top of this post is the result. You can also find the video here, and here’s a gif if all else fails:

I would have preferred if the video was smoother, which should be possible by creating more frames. I did not find such an option in {gganimate}, and perhaps there is none, at least for now.

In any case {gganimate} is pretty nice to play with, and I’ll definitely use it more!

Update

Silly me! It turns out thate the animate() function has arguments that can control the number of frames and the duration, without needing to pass options to the renderer. I was looking at options for the renderer only, without having read the documentation of the animate() function. It turns out that you can pass several arguments to the animate() function; for example, here is how you can make a GIF that lasts for 20 seconds running and 20 frames per second, pausing for 5 frames at the end and then restarting:

animate(animation, nframes = 400, duration = 20, fps = 20, end_pause = 5, rewind = TRUE)

I guess that you should only pass options to the renderer if you really need fine-grained control.

This took around 2 minutes to finish. You can use the same options with the ffmpeg renderer too. Here is what the gif looks like:

Much, much smoother!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.

Buy me an EspressoBuy me an Espresso