About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Manipulate dates easily with {lubridate}

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free here. This is taken from Chapter 5, which presents the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data. In the text below, I scrape a table from Wikipedia, which shows when African countries gained independence from other countries. Then, using {lubridate} functions I show you how you can answers questions such as Which countries gained independence before 1960?.

Set-up: scraping some data from Wikipedia

{lubridate} is yet another tidyverse package, that makes dealing with dates or duration data (and intervals) as painless as possible. I do not use every function contained in the package daily, and as such will only focus on some of the functions. However, if you have to deal with dates often, you might want to explore the package thoroughly.

Let’s get some data from a Wikipedia table:

library(tidyverse)
library(rvest)
page <- read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa")

independence <- page %>%
    html_node(".wikitable") %>%
    html_table(fill = TRUE)

independence <- independence %>%
    select(-Rank) %>%
    map_df(~str_remove_all(., "\\[.*\\]")) %>%
    rename(country = `Country[a]`,
           colonial_name = `Colonial name`,
           colonial_power = `Colonial power[b]`,
           independence_date = `Independence date[c]`,
           first_head_of_state = `First head of state[d]`,
           independence_won_through = `Independence won through`)

This dataset was scraped from the following Wikipedia table. It shows when African countries gained independence from which colonial powers. In Chapter 11, I will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents of the dataset:

independence
## # A tibble: 54 x 6
##    country colonial_name colonial_power independence_da… first_head_of_s…
##    <chr>   <chr>         <chr>          <chr>            <chr>           
##  1 Liberia Liberia       United States  26 July 1847     Joseph Jenkins …
##  2 South … Cape Colony … United Kingdom 31 May 1910      Louis Botha     
##  3 Egypt   Sultanate of… United Kingdom 28 February 1922 Fuad I          
##  4 Eritrea Italian Erit… Italy          10 February 1947 Haile Selassie  
##  5 Libya   British Mili… United Kingdo… 24 December 1951 Idris           
##  6 Sudan   Anglo-Egypti… United Kingdo… 1 January 1956   Ismail al-Azhari
##  7 Tunisia French Prote… France         20 March 1956    Muhammad VIII a…
##  8 Morocco French Prote… France Spain   2 March 19567 A… Mohammed V      
##  9 Ghana   Gold Coast    United Kingdom 6 March 1957     Kwame Nkrumah   
## 10 Guinea  French West … France         2 October 1958   Ahmed Sékou Tou…
## # … with 44 more rows, and 1 more variable: independence_won_through <chr>

as you can see, the date of independence is in a format that might make it difficult to answer questions such as Which African countries gained independence before 1960 ? for two reasons. First of all, the date uses the name of the month instead of the number of the month (well, this is not such a big deal, but still), and second of all the type of the independence day column is character and not “date”. So our first task is to correctly define the column as being of type date, while making sure that R understands that January is supposed to be “01”, and so on.

Using {lubridate}

There are several helpful functions included in {lubridate} to convert columns to dates. For instance if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(), for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s a few of these helper functions, and they can handle a lot of different formats for dates. In our case, having the name of the month instead of the number might seem quite problematic, but it turns out that this is a case that {lubridate} handles painfully:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
independence <- independence %>%
  mutate(independence_date = dmy(independence_date))
## Warning: 5 failed to parse.

Some dates failed to parse, for instance for Morocco. This is because these countries have several independence dates; this means that the string to convert looks like:

"2 March 1956
7 April 1956
10 April 1958
4 January 1969"

which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for simplicity’s sake.

Let’s take a look at the data now:

independence
## # A tibble: 54 x 6
##    country colonial_name colonial_power independence_da… first_head_of_s…
##    <chr>   <chr>         <chr>          <date>           <chr>           
##  1 Liberia Liberia       United States  1847-07-26       Joseph Jenkins …
##  2 South … Cape Colony … United Kingdom 1910-05-31       Louis Botha     
##  3 Egypt   Sultanate of… United Kingdom 1922-02-28       Fuad I          
##  4 Eritrea Italian Erit… Italy          1947-02-10       Haile Selassie  
##  5 Libya   British Mili… United Kingdo… 1951-12-24       Idris           
##  6 Sudan   Anglo-Egypti… United Kingdo… 1956-01-01       Ismail al-Azhari
##  7 Tunisia French Prote… France         1956-03-20       Muhammad VIII a…
##  8 Morocco French Prote… France Spain   NA               Mohammed V      
##  9 Ghana   Gold Coast    United Kingdom 1957-03-06       Kwame Nkrumah   
## 10 Guinea  French West … France         1958-10-02       Ahmed Sékou Tou…
## # … with 44 more rows, and 1 more variable: independence_won_through <chr>

As you can see, we now have a date column in the right format. We can now answer questions such as Which countries gained independence before 1960? quite easily, by using the functions year(), month() and day(). Let’s see which countries gained independence before 1960:

independence %>%
  filter(year(independence_date) <= 1960) %>%
  pull(country)
##  [1] "Liberia"                          "South Africa"                    
##  [3] "Egypt"                            "Eritrea"                         
##  [5] "Libya"                            "Sudan"                           
##  [7] "Tunisia"                          "Ghana"                           
##  [9] "Guinea"                           "Cameroon"                        
## [11] "Togo"                             "Mali"                            
## [13] "Madagascar"                       "Democratic Republic of the Congo"
## [15] "Benin"                            "Niger"                           
## [17] "Burkina Faso"                     "Ivory Coast"                     
## [19] "Chad"                             "Central African Republic"        
## [21] "Republic of the Congo"            "Gabon"                           
## [23] "Mauritania"

You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on Christmas Eve:

independence %>%
  filter(month(independence_date) == 12,
         day(independence_date) == 24) %>%
  pull(country)
## [1] "Libya"

Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between two dates, using the interval() column:

independence %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  select(country, independent_since)
## # A tibble: 54 x 2
##    country      independent_since             
##    <chr>        <S4: Interval>                
##  1 Liberia      1847-07-26 UTC--2019-02-10 UTC
##  2 South Africa 1910-05-31 UTC--2019-02-10 UTC
##  3 Egypt        1922-02-28 UTC--2019-02-10 UTC
##  4 Eritrea      1947-02-10 UTC--2019-02-10 UTC
##  5 Libya        1951-12-24 UTC--2019-02-10 UTC
##  6 Sudan        1956-01-01 UTC--2019-02-10 UTC
##  7 Tunisia      1956-03-20 UTC--2019-02-10 UTC
##  8 Morocco      NA--NA                        
##  9 Ghana        1957-03-06 UTC--2019-02-10 UTC
## 10 Guinea       1958-10-02 UTC--2019-02-10 UTC
## # … with 44 more rows

The independent_since column now contains an interval object that we can convert to years:

independence %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  select(country, independent_since) %>%
  mutate(years_independent = as.numeric(independent_since, "years"))
## # A tibble: 54 x 3
##    country      independent_since              years_independent
##    <chr>        <S4: Interval>                             <dbl>
##  1 Liberia      1847-07-26 UTC--2019-02-10 UTC             172. 
##  2 South Africa 1910-05-31 UTC--2019-02-10 UTC             109. 
##  3 Egypt        1922-02-28 UTC--2019-02-10 UTC              97.0
##  4 Eritrea      1947-02-10 UTC--2019-02-10 UTC              72  
##  5 Libya        1951-12-24 UTC--2019-02-10 UTC              67.1
##  6 Sudan        1956-01-01 UTC--2019-02-10 UTC              63.1
##  7 Tunisia      1956-03-20 UTC--2019-02-10 UTC              62.9
##  8 Morocco      NA--NA                                      NA  
##  9 Ghana        1957-03-06 UTC--2019-02-10 UTC              61.9
## 10 Guinea       1958-10-02 UTC--2019-02-10 UTC              60.4
## # … with 44 more rows

We can now see for how long the last country to gain independence has been independent. Because the data is not tidy (in some cases, an African country was colonized by two powers, see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

independence %>%
  filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  mutate(years_independent = as.numeric(independent_since, "years")) %>%
  group_by(colonial_power) %>%
  summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
## # A tibble: 4 x 2
##   colonial_power last_colony_independent_for
##   <chr>                                <dbl>
## 1 Belgium                               56.6
## 2 France                                41.6
## 3 Portugal                              43.2
## 4 United Kingdom                        42.6

{lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate} is a package that you have to master.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.

Buy me an EspressoBuy me an Espresso