About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Cluster multiple time series using K-means

I have been recently confronted to the issue of finding similarities among time-series and though about using k-means to cluster them. To illustrate the method, I’ll be using data from the Penn World Tables, readily available in R (inside the {pwt9} package):

library(tidyverse)
library(lubridate)
library(pwt9)
library(brotools)

First, of all, let’s only select the needed columns:

pwt <- pwt9.0 %>%
select(country, year, avh)

avh contains the average worked hours for a given country and year. The data looks like this:

head(pwt)
##          country year avh
## ABW-1950   Aruba 1950  NA
## ABW-1951   Aruba 1951  NA
## ABW-1952   Aruba 1952  NA
## ABW-1953   Aruba 1953  NA
## ABW-1954   Aruba 1954  NA
## ABW-1955   Aruba 1955  NA

For each country, there’s yearly data on the avh variable. The goal here is to cluster the different countries by looking at how similar they are on the avh variable. Let’s do some further cleaning. The k-means implementation in R expects a wide data frame (currently my data frame is in the long format) and no missing values. These could potentially be imputed, but I can’t be bothered:

pwt_wide <- pwt %>%
  pivot_wider(names_from = year, values_from = avh)  %>%
  filter(!is.na(`1950`)) %>%
  mutate_at(vars(-country), as.numeric)

To convert my data frame from long to wide, I use the fresh pivot_wider() function, instead of the less intuitive spread() function.

We’re ready to use the k-means algorithm. To know how many clusters I should aim for, I’ll be using the elbow method (if you’re not familiar with this method, click on the image at the very top of this post):

wss <- map_dbl(1:5, ~{kmeans(select(pwt_wide, -country), ., nstart=50,iter.max = 15 )$tot.withinss})

n_clust <- 1:5

elbow_df <- as.data.frame(cbind("n_clust" = n_clust, "wss" = wss))


ggplot(elbow_df) +
geom_line(aes(y = wss, x = n_clust), colour = "#82518c") +
theme_blog()

Looks like 3 clusters is a good choice. Let’s now run the kmeans algorithm:

clusters <- kmeans(select(pwt_wide, -country), centers = 3)

clusters is a list with several interesting items. The item centers contains the “average” time series:

(centers <- rownames_to_column(as.data.frame(clusters$centers), "cluster"))
##   cluster     1950     1951     1952     1953     1954     1955     1956
## 1       1 2110.440 2101.273 2088.947 2074.273 2066.617 2053.391 2034.926
## 2       2 2086.509 2088.571 2084.433 2081.939 2078.756 2078.710 2074.175
## 3       3 2363.600 2350.774 2338.032 2325.375 2319.011 2312.083 2308.483
##       1957     1958     1959     1960     1961     1962     1963     1964
## 1 2021.855 2007.221 1995.038 1985.904 1978.024 1971.618 1963.780 1962.983
## 2 2068.807 2062.021 2063.687 2060.176 2052.070 2044.812 2038.939 2037.488
## 3 2301.355 2294.556 2287.556 2279.773 2272.899 2262.781 2255.690 2253.431
##       1965     1966     1967     1968     1969     1970     1971     1972
## 1 1952.945 1946.961 1928.445 1908.354 1887.624 1872.864 1855.165 1825.759
## 2 2027.958 2021.615 2015.523 2007.176 2001.289 1981.906 1967.323 1961.269
## 3 2242.775 2237.216 2228.943 2217.717 2207.037 2190.452 2178.955 2167.124
##       1973     1974     1975     1976     1977     1978     1979     1980
## 1 1801.370 1770.484 1737.071 1738.214 1713.395 1693.575 1684.215 1676.703
## 2 1956.755 1951.066 1933.527 1926.508 1920.668 1911.488 1904.316 1897.103
## 3 2156.304 2137.286 2125.298 2118.138 2104.382 2089.717 2083.036 2069.678
##       1981     1982     1983     1984     1985     1986     1987     1988
## 1 1658.894 1644.019 1636.909 1632.371 1623.901 1615.320 1603.383 1604.331
## 2 1883.376 1874.730 1867.266 1861.386 1856.947 1849.568 1848.748 1847.690
## 3 2055.658 2045.501 2041.428 2030.095 2040.210 2033.289 2028.345 2029.290
##       1989     1990     1991     1992     1993     1994     1995     1996
## 1 1593.225 1586.975 1573.084 1576.331 1569.725 1567.599 1567.113 1558.274
## 2 1842.079 1831.907 1823.552 1815.864 1823.824 1830.623 1831.815 1831.648
## 3 2031.741 2029.786 1991.807 1974.954 1973.737 1975.667 1980.278 1988.728
##       1997     1998     1999     2000     2001     2002     2003     2004
## 1 1555.079 1555.071 1557.103 1545.349 1530.207 1514.251 1509.647 1522.389
## 2 1835.372 1836.030 1839.857 1827.264 1813.477 1781.696 1786.047 1781.858
## 3 1985.076 1961.219 1966.310 1959.219 1946.954 1940.110 1924.799 1917.130
##       2005     2006     2007     2008     2009     2010     2011     2012
## 1 1514.492 1512.872 1515.299 1514.055 1493.875 1499.563 1503.049 1493.862
## 2 1775.167 1776.759 1773.587 1771.648 1734.559 1736.098 1742.143 1735.396
## 3 1923.496 1912.956 1902.156 1897.550 1858.657 1861.875 1861.608 1850.802
##       2013     2014
## 1 1485.589 1486.991
## 2 1729.973 1729.543
## 3 1848.158 1851.829

clusters also contains the cluster item, which tells me to which cluster the different countries belong to. I can easily add this to the original data frame:

pwt_wide <- pwt_wide %>% 
  mutate(cluster = clusters$cluster)

Now, let’s prepare the data for visualisation. I have to go back to a long data frame for this:

pwt_long <- pwt_wide %>%
  pivot_longer(cols=c(-country, -cluster), names_to = "year", values_to = "avh") %>%
  mutate(year = ymd(paste0(year, "-01-01")))

centers_long <- centers %>%
  pivot_longer(cols = -cluster, names_to = "year", values_to = "avh") %>%  
  mutate(year = ymd(paste0(year, "-01-01")))

And I can now plot the different time series, by cluster and highlight the “average” time series for each cluster as well (yellow line):

ggplot() +
  geom_line(data = pwt_long, aes(y = avh, x = year, group = country), colour = "#82518c") +
  facet_wrap(~cluster, nrow = 1) + 
  geom_line(data = centers_long, aes(y = avh, x = year, group = cluster), col = "#b58900", size = 2) +
  theme_blog() +
  labs(title = "Average hours worked in several countries", 
       caption = "The different time series have been clustered using k-means.
                 Cluster 1: Belgium, Switzerland, Germany, Denmark, France, Luxembourg, Netherlands,
                 Norway, Sweden.\nCluster 2: Australia, Colombia, Ireland, Iceland, Japan, Mexico,
                 Portugal, Turkey.\nCluster 3: Argentina, Austria, Brazil, Canada, Cyprus, Spain, Finland,
                 UK, Italy, New Zealand, Peru, USA, Venezuela") +
  theme(plot.caption = element_text(colour = "white"))

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub

Buy me an EspressoBuy me an Espresso