About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Objects types and some useful R functions for beginners R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport

In this blog post, I will show you how you can quickly and easily forecast a univariate time series. I am going to use data from the EU Open Data Portal on air passenger transport. You can find the data here. I downloaded the data in the TSV format for Luxembourg Airport, but you could repeat the analysis for any airport.

Once you have the data, load some of the package we are going to need:


and define the following function:

ihs <- function(x){
    log(x + sqrt(x**2 + 1))

This function, the inverse hyperbolic sine, is useful to transform data in a manner that is very close to logging it, but that allows for 0’s. The data from Eurostat is not complete for some reason, so there are some 0 sometimes. To avoid having to log 0, which in R yields -Inf, I use this transformation.

Now, let’s load the data:

avia <- read_tsv("avia_par_lu.tsv")
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.

Let’s take a look at the data:

## # A tibble: 6 x 238
##   `unit,tra_meas,… `2018Q1` `2018M03` `2018M02` `2018M01` `2017Q4` `2017Q3`
##   <chr>            <chr>    <chr>     <chr>     <chr>     <chr>    <chr>   
## 1 FLIGHT,CAF_PAS,… 511      172       161       178       502      475     
## 2 FLIGHT,CAF_PAS,… :        :         :         :         :        :       
## 3 FLIGHT,CAF_PAS,… :        :         :         :         399      306     
## 4 FLIGHT,CAF_PAS,… 485      167       151       167       493      497     
## 5 FLIGHT,CAF_PAS,… 834      293       267       274       790      728     
## 6 FLIGHT,CAF_PAS,… :        :         :         :         :        :       
## # … with 231 more variables: `2017Q2` <chr>, `2017Q1` <chr>,
## #   `2017M12` <chr>, `2017M11` <chr>, `2017M10` <chr>, `2017M09` <chr>,
## #   `2017M08` <chr>, `2017M07` <chr>, `2017M06` <chr>, `2017M05` <chr>,
## #   `2017M04` <chr>, `2017M03` <chr>, `2017M02` <chr>, `2017M01` <chr>,
## #   `2017` <chr>, `2016Q4` <chr>, `2016Q3` <chr>, `2016Q2` <chr>,
## #   `2016Q1` <chr>, `2016M12` <chr>, `2016M11` <chr>, `2016M10` <chr>,
## #   `2016M09` <chr>, `2016M08` <chr>, `2016M07` <chr>, `2016M06` <chr>,
## #   `2016M05` <chr>, `2016M04` <chr>, `2016M03` <chr>, `2016M02` <chr>,
## #   `2016M01` <chr>, `2016` <chr>, `2015Q4` <chr>, `2015Q3` <chr>,
## #   `2015Q2` <chr>, `2015Q1` <chr>, `2015M12` <chr>, `2015M11` <chr>,
## #   `2015M10` <chr>, `2015M09` <chr>, `2015M08` <chr>, `2015M07` <chr>,
## #   `2015M06` <chr>, `2015M05` <chr>, `2015M04` <chr>, `2015M03` <chr>,
## #   `2015M02` <chr>, `2015M01` <chr>, `2015` <chr>, `2014Q4` <chr>,
## #   `2014Q3` <chr>, `2014Q2` <chr>, `2014Q1` <chr>, `2014M12` <chr>,
## #   `2014M11` <chr>, `2014M10` <chr>, `2014M09` <chr>, `2014M08` <chr>,
## #   `2014M07` <chr>, `2014M06` <chr>, `2014M05` <chr>, `2014M04` <chr>,
## #   `2014M03` <chr>, `2014M02` <chr>, `2014M01` <chr>, `2014` <chr>,
## #   `2013Q4` <chr>, `2013Q3` <chr>, `2013Q2` <chr>, `2013Q1` <chr>,
## #   `2013M12` <chr>, `2013M11` <chr>, `2013M10` <chr>, `2013M09` <chr>,
## #   `2013M08` <chr>, `2013M07` <chr>, `2013M06` <chr>, `2013M05` <chr>,
## #   `2013M04` <chr>, `2013M03` <chr>, `2013M02` <chr>, `2013M01` <chr>,
## #   `2013` <chr>, `2012Q4` <chr>, `2012Q3` <chr>, `2012Q2` <chr>,
## #   `2012Q1` <chr>, `2012M12` <chr>, `2012M11` <chr>, `2012M10` <chr>,
## #   `2012M09` <chr>, `2012M08` <chr>, `2012M07` <chr>, `2012M06` <chr>,
## #   `2012M05` <chr>, `2012M04` <chr>, `2012M03` <chr>, `2012M02` <chr>,
## #   `2012M01` <chr>, `2012` <chr>, …

So yeah, useless in that state. The first column actually is composed of 3 columns, merged together, and instead of having one column with the date, and another with the value, we have one column per date. Some cleaning is necessary before using this data.

Let’s start with going from a wide to a long data set:

avia %>%
    select("unit,tra_meas,airp_pr\\time", contains("20")) %>%
    gather(date, passengers, -`unit,tra_meas,airp_pr\\time`)

The first line makes it possible to only select the columns that contain the string “20”, so selecting columns from 2000 onward. Then, using gather, I go from long to wide. The data looks like this now:

## # A tibble: 117,070 x 3
##    `unit,tra_meas,airp_pr\\time`  date   passengers
##    <chr>                          <chr>  <chr>     
##  1 FLIGHT,CAF_PAS,LU_ELLX_AT_LOWW 2018Q1 511       
##  2 FLIGHT,CAF_PAS,LU_ELLX_BE_EBBR 2018Q1 :         
##  3 FLIGHT,CAF_PAS,LU_ELLX_CH_LSGG 2018Q1 :         
##  4 FLIGHT,CAF_PAS,LU_ELLX_CH_LSZH 2018Q1 485       
##  5 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDF 2018Q1 834       
##  6 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDI 2018Q1 :         
##  7 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDM 2018Q1 1095      
##  8 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDR 2018Q1 :         
##  9 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDT 2018Q1 :         
## 10 FLIGHT,CAF_PAS,LU_ELLX_DK_EKCH 2018Q1 :         
## # … with 117,060 more rows

Now, let’s separate the first column into 3 columns:

avia %>%
    select("unit,tra_meas,airp_pr\\time", contains("20")) %>%
    gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>%
     separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",")

This separates the first column into 3 new columns, “unit”, “tra_meas” and “air_pr\time”. This step is not necessary for the rest of the analysis, but might as well do it. The data looks like this now:

## # A tibble: 117,070 x 5
##    unit   tra_meas `air_pr\\time`  date   passengers
##    <chr>  <chr>    <chr>           <chr>  <chr>     
##  1 FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2018Q1 511       
##  2 FLIGHT CAF_PAS  LU_ELLX_BE_EBBR 2018Q1 :         
##  3 FLIGHT CAF_PAS  LU_ELLX_CH_LSGG 2018Q1 :         
##  4 FLIGHT CAF_PAS  LU_ELLX_CH_LSZH 2018Q1 485       
##  5 FLIGHT CAF_PAS  LU_ELLX_DE_EDDF 2018Q1 834       
##  6 FLIGHT CAF_PAS  LU_ELLX_DE_EDDI 2018Q1 :         
##  7 FLIGHT CAF_PAS  LU_ELLX_DE_EDDM 2018Q1 1095      
##  8 FLIGHT CAF_PAS  LU_ELLX_DE_EDDR 2018Q1 :         
##  9 FLIGHT CAF_PAS  LU_ELLX_DE_EDDT 2018Q1 :         
## 10 FLIGHT CAF_PAS  LU_ELLX_DK_EKCH 2018Q1 :         
## # … with 117,060 more rows

The next steps are simple renamings. I have copy-pasted the information from the Eurostat page where you can view the data online. If you click here:

you will be able to select the variables you want displayed in the table, as well as the dictionary of the variables. I simply copy pasted it and recoded the variables. You can take a look at the whole cleaning workflow by clicking “Click to expand” below:

Click here to take a look at the whole cleaning workflow

avia_clean <- avia %>%
    select("unit,tra_meas,airp_pr\\time", contains("20")) %>%
    gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>%
    separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",") %>%
    mutate(tra_meas = fct_recode(tra_meas,
         `Passengers on board` = "PAS_BRD",
         `Passengers on board (arrivals)` = "PAS_BRD_ARR",
         `Passengers on board (departures)` = "PAS_BRD_DEP",
         `Passengers carried` = "PAS_CRD",
         `Passengers carried (arrival)` = "PAS_CRD_ARR",
         `Passengers carried (departures)` = "PAS_CRD_DEP",
         `Passengers seats available` = "ST_PAS",
         `Passengers seats available (arrivals)` = "ST_PAS_ARR",
         `Passengers seats available (departures)` = "ST_PAS_DEP",
         `Commercial passenger air flights` = "CAF_PAS",
         `Commercial passenger air flights (arrivals)` = "CAF_PAS_ARR",
         `Commercial passenger air flights (departures)` = "CAF_PAS_DEP")) %>%
    mutate(unit = fct_recode(unit,
                             Passenger = "PAS",
                             Flight = "FLIGHT",
                             `Seats and berths` = "SEAT")) %>%
    mutate(destination = fct_recode(`air_pr\\time`,
                                     `WIEN-SCHWECHAT` = "LU_ELLX_AT_LOWW",
                                     `BRUSSELS` = "LU_ELLX_BE_EBBR",
                                     `GENEVA` = "LU_ELLX_CH_LSGG",
                                     `ZURICH` = "LU_ELLX_CH_LSZH",
                                     `FRANKFURT/MAIN` = "LU_ELLX_DE_EDDF",
                                     `HAMBURG` = "LU_ELLX_DE_EDDH",
                                     `BERLIN-TEMPELHOF` = "LU_ELLX_DE_EDDI",
                                     `MUENCHEN` = "LU_ELLX_DE_EDDM",
                                     `SAARBRUECKEN` = "LU_ELLX_DE_EDDR",
                                     `BERLIN-TEGEL` = "LU_ELLX_DE_EDDT",
                                     `KOBENHAVN/KASTRUP` = "LU_ELLX_DK_EKCH",
                                     `HURGHADA / INTL` = "LU_ELLX_EG_HEGN",
                                     `IRAKLION/NIKOS KAZANTZAKIS` = "LU_ELLX_EL_LGIR",
                                     `FUERTEVENTURA` = "LU_ELLX_ES_GCFV",
                                     `GRAN CANARIA` = "LU_ELLX_ES_GCLP",
                                     `LANZAROTE` = "LU_ELLX_ES_GCRR",
                                     `TENERIFE SUR/REINA SOFIA` = "LU_ELLX_ES_GCTS",
                                     `BARCELONA/EL PRAT` = "LU_ELLX_ES_LEBL",
                                     `ADOLFO SUAREZ MADRID-BARAJAS` = "LU_ELLX_ES_LEMD",
                                     `MALAGA/COSTA DEL SOL` = "LU_ELLX_ES_LEMG",
                                     `PALMA DE MALLORCA` = "LU_ELLX_ES_LEPA",
                                     `SYSTEM - PARIS` = "LU_ELLX_FR_LF90",
                                     `NICE-COTE D'AZUR` = "LU_ELLX_FR_LFMN",
                                     `PARIS-CHARLES DE GAULLE` = "LU_ELLX_FR_LFPG",
                                     `STRASBOURG-ENTZHEIM` = "LU_ELLX_FR_LFST",
                                     `KEFLAVIK` = "LU_ELLX_IS_BIKF",
                                     `MILANO/MALPENSA` = "LU_ELLX_IT_LIMC",
                                     `BERGAMO/ORIO AL SERIO` = "LU_ELLX_IT_LIME",
                                     `ROMA/FIUMICINO` = "LU_ELLX_IT_LIRF",
                                     `AGADIR/AL MASSIRA` = "LU_ELLX_MA_GMAD",
                                     `AMSTERDAM/SCHIPHOL` = "LU_ELLX_NL_EHAM",
                                     `WARSZAWA/CHOPINA` = "LU_ELLX_PL_EPWA",
                                     `PORTO` = "LU_ELLX_PT_LPPR",
                                     `LISBOA` = "LU_ELLX_PT_LPPT",
                                     `STOCKHOLM/ARLANDA` = "LU_ELLX_SE_ESSA",
                                     `MONASTIR/HABIB BOURGUIBA` = "LU_ELLX_TN_DTMB",
                                     `ENFIDHA-HAMMAMET INTERNATIONAL` = "LU_ELLX_TN_DTNH",
                                     `ENFIDHA ZINE EL ABIDINE BEN ALI` = "LU_ELLX_TN_DTNZ",
                                     `DJERBA/ZARZIS` = "LU_ELLX_TN_DTTJ",
                                     `ANTALYA (MIL-CIV)` = "LU_ELLX_TR_LTAI",
                                     `ISTANBUL/ATATURK` = "LU_ELLX_TR_LTBA",
                                     `SYSTEM - LONDON` = "LU_ELLX_UK_EG90",
                                     `MANCHESTER` = "LU_ELLX_UK_EGCC",
                                     `LONDON GATWICK` = "LU_ELLX_UK_EGKK",
                                     `LONDON/CITY` = "LU_ELLX_UK_EGLC",
                                     `LONDON HEATHROW` = "LU_ELLX_UK_EGLL",
                                     `LONDON STANSTED` = "LU_ELLX_UK_EGSS",
                                     `NEWARK LIBERTY INTERNATIONAL, NJ.` = "LU_ELLX_US_KEWR",
                                     `O.R TAMBO INTERNATIONAL` = "LU_ELLX_ZA_FAJS")) %>%
    mutate(passengers = as.numeric(passengers)) %>%
    select(unit, tra_meas, destination, date, passengers)
## Warning in evalq(as.numeric(passengers), <environment>): NAs introduced by
## coercion

There is quarterly data and monthly data. Let’s separate the two:

avia_clean_quarterly <- avia_clean %>%
    filter(tra_meas == "Passengers on board (arrivals)",
           !is.na(passengers)) %>%
    filter(str_detect(date, "Q")) %>%
    mutate(date = yq(date))

In the “date” column, I detect the observations with “Q” in their name, indicating that it is quarterly data. I do the same for monthly data, but I have to add the string “01” to the dates. This transforms a date that looks like this “2018M1” to this “2018M101”. “2018M101” can then be converted into a date by using the ymd() function from lubridate. yq() was used for the quarterly data.

avia_clean_monthly <- avia_clean %>%
    filter(tra_meas == "Passengers on board (arrivals)",
           !is.na(passengers)) %>%
    filter(str_detect(date, "M")) %>%
    mutate(date = paste0(date, "01")) %>%
    mutate(date = ymd(date)) %>%
    select(destination, date, passengers)

Time for some plots. Let’s start with the raw data:

avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    ggplot() +
    ggtitle("Raw data") +
    geom_line(aes(y = total, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + 

And now with the logged data (or rather, the data transformed using the inverted hyperbolic sine transformation):

avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    mutate(total_ihs = ihs(total)) %>%
    ggplot() +
    ggtitle("Logged data") +
    geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + 

We clearly see a seasonal pattern in the data. There is also an upward trend. We will have to deal with these two problems if we want to do some forecasting. For this, let’s limit ourselves to data from before 2015, and convert the “passengers” column from the data to a time series object, using the ts() function:

avia_clean_train <- avia_clean_monthly %>%
    select(date, passengers) %>%
    filter(year(date) < 2015) %>%
    group_by(date) %>%
    summarise(total_passengers = sum(passengers)) %>%
    pull(total_passengers) %>%
    ts(., frequency = 12, start = c(2005, 1))

We will try to pseudo-forecast the data from 2015 to the last point available, March 2018. First, let’s tranform the data:

logged_data <- ihs(avia_clean_train)

Taking the log, or ihs of the data deals with stabilizing the variance of the time series.

There might also be a need to difference the data. Computing the differences between consecutive observations makes the time-series stationary. This will be taken care of by the auto.arima() function, if needed. The auto.arima() function returns the best ARIMA model according to different statistical criterions, such as the AIC, AICc or BIC.

(model_fit <- auto.arima(logged_data))
## Series: logged_data 
## ARIMA(2,1,1)(2,1,0)[12] 
## Coefficients:
##           ar1      ar2      ma1     sar1     sar2
##       -0.4061  -0.2431  -0.3562  -0.5590  -0.3282
## s.e.   0.2003   0.1432   0.1994   0.0911   0.0871
## sigma^2 estimated as 0.004503:  log likelihood=137.11
## AIC=-262.21   AICc=-261.37   BIC=-246.17

auto.arima() found that the best model would be an \(ARIMA(2, 1, 1)(2, 1, 0)_{12}\). This is an seasonal autoregressive model, with p = 2, d = 1, q = 1, P = 2 and D = 1.

model_forecast <- forecast(model_fit, h = 39)

I can now forecast the model for the next 39 months (which correspond to the data available).

To plot the forecast, one could do a simple call to the plot function. But the resulting plot is not very aesthetic. To plot my own, I have to grab the data that was forecast, and do some munging again:

point_estimate <- model_forecast$mean %>%
    as_tsibble() %>%
    rename(point_estimate = value,
           date = index)

upper <- model_forecast$upper %>%
    as_tsibble() %>%
    spread(key, value) %>%
    rename(date = index,
           upper80 = `80%`,
           upper95 = `95%`)

lower <- model_forecast$lower %>%
    as_tsibble() %>%
    spread(key, value) %>%
    rename(date = index,
           lower80 = `80%`,
           lower95 = `95%`)

estimated_data <- reduce(list(point_estimate, upper, lower), full_join, by = "date")

as_tsibble() is a function from the {tsibble} package that converts objects that are time-series aware to time-aware tibbles. If you are not familiar with ts_tibble(), I urge you to run the above lines one by one, and especially to compare as_tsibble() with the standard as_tibble() from the {tibble} package.

This is how estimated_data looks:

## # A tsibble: 6 x 6 [1M]
##       date point_estimate upper80 upper95 lower80 lower95
##      <mth>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 2015 Jan           11.9    12.0    12.1    11.8    11.8
## 2 2015 Feb           11.9    12.0    12.0    11.8    11.7
## 3 2015 Mar           12.1    12.2    12.3    12.0    12.0
## 4 2015 Apr           12.2    12.3    12.4    12.1    12.1
## 5 2015 May           12.3    12.4    12.4    12.2    12.1
## 6 2015 Jun           12.3    12.4    12.5    12.2    12.1

We can now plot the data, with the forecast, and with the 95% confidence interval:

avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    mutate(total_ihs = ihs(total)) %>%
    ggplot() +
    ggtitle("Logged data") +
    geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
    geom_ribbon(data = estimated_data, aes(x = date, ymin = lower95, ymax = upper95), fill = "#666018", alpha = 0.2) +
    geom_line(data = estimated_data, aes(x = date, y = point_estimate), linetype = 2, colour = "#8e9d98") +

The pseudo-forecast (the dashed line) is not very far from the truth, only overestimating the seasonal peaks, but the true line is within the 95% confidence interval, which is good!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates or buy me an espresso.

Buy me an EspressoBuy me an Espresso