About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1

Can I get enough of historical newspapers data? Seems like I don’t. I already wrote four (1, 2, 3 and 4) blog posts, but there’s still a lot to explore. This blog post uses a new batch of data announced on twitter:

and this data could not have arrived at a better moment, since something else got announced via Twitter recently:

I wanted to try using Vowpal Wabbit for a couple of weeks now because it seems to be the perfect tool for when you’re dealing with what I call big-ish data: data that is not big data, and might fit in your RAM, but is still a PITA to deal with. It can be data that is large enough to take 30 seconds to be imported into R, and then every operation on it lasts for minutes, and estimating/training a model on it might eat up all your RAM. Vowpal Wabbit avoids all this because it’s an online-learning system. Vowpal Wabbit is capable of training a model with data that it sees on the fly, which means VW can be used for real-time machine learning, but also for when the training data is very large. Each row of the data gets streamed into VW which updates the estimated parameters of the model (or weights) in real time. So no need to first import all the data into R!

The goal of this blog post is to get started with VW, and build a very simple logistic model to classify documents using the historical newspapers data from the National Library of Luxembourg, which you can download here (scroll down and download the Text Analysis Pack). The goal is not to build the best model, but a model. Several steps are needed for this: prepare the data, install VW and train a model using {RVowpalWabbit}.

Step 1: Preparing the data

The data is in a neat .xml format, and extracting what I need will be easy. However, the input format for VW is a bit unusual; it resembles .psv files (Pipe Separated Values) but allows for more flexibility. I will not dwell much into it, but for our purposes, the file must look like this:

1 | this is the first observation, which in our case will be free text
2 | this is another observation, its label, or class, equals 2
4 | this is another observation, of class 4

The first column, before the “|” is the target class we want to predict, and the second column contains free text.

The raw data looks like this:

Click if you want to see the raw data

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2019-02-28T11:13:01</responseDate>
<request>http://www.eluxemburgensia.lu/OAI</request>
<ListRecords>
<record>
<header>
<identifier>digitool-publish:3026998-DTL45</identifier>
<datestamp>2019-02-28T11:13:01Z</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcterms="http://purl.org/dc/terms/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:identifier>
https://persist.lu/ark:/70795/6gq1q1/articles/DTL45
</dc:identifier>
<dc:source>newspaper/indeplux/1871-12-29_01</dc:source>
<dcterms:isPartOf>L'indépendance luxembourgeoise</dcterms:isPartOf>
<dcterms:isReferencedBy>
issue:newspaper/indeplux/1871-12-29_01/article:DTL45
</dcterms:isReferencedBy>
<dc:date>1871-12-29</dc:date>
<dc:publisher>Jean Joris</dc:publisher>
<dc:relation>3026998</dc:relation>
<dcterms:hasVersion>
http://www.eluxemburgensia.lu/webclient/DeliveryManager?pid=3026998#panel:pp|issue:3026998|article:DTL45
</dcterms:hasVersion>
<dc:description>
CONSEIL COMMUNAL de la ville de Luxembourg. Séance du 23 décembre 1871. (Suite.) Art. 6. Glacière communale. M. le Bourgmcstr ¦ . Le collège échevinal propose un autro mode de se procurer de la glace. Nous avons dépensé 250 fr. cha- que année pour distribuer 30 kilos do glace; c’est une trop forte somme pour un résultat si minime. Nous aurions voulu nous aboucher avec des fabricants de bière ou autres industriels qui nous auraient fourni de la glace en cas de besoin. L’architecte qui été chargé de passer un contrat, a été trouver des négociants, mais ses démarches n’ont pas abouti. 
</dc:description>
<dc:title>
CONSEIL COMMUNAL de la ville de Luxembourg. Séance du 23 décembre 1871. (Suite.)
</dc:title>
<dc:type>ARTICLE</dc:type>
<dc:language>fr</dc:language>
<dcterms:extent>863</dcterms:extent>
</oai_dc:dc>
</metadata>
</record>
</ListRecords>
</OAI-PMH>

I need several things from this file:

  • The title of the newspaper: <dcterms:isPartOf>L'indépendance luxembourgeoise</dcterms:isPartOf>
  • The type of the article: <dc:type>ARTICLE</dc:type>. Can be Article, Advertisement, Issue, Section or Other.
  • The contents: <dc:description>CONSEIL COMMUNAL de la ville de Luxembourg. Séance du ....</dc:description>

I will only focus on newspapers in French, even though newspapers in German also had articles in French. This is because the tag <dc:language>fr</dc:language> is not always available. If it were, I could simply look for it and extract all the content in French easily, but unfortunately this is not the case.

First of all, let’s get the data into R:

library("tidyverse")
library("xml2")
library("furrr")

files <- list.files(path = "export01-newspapers1841-1878/", all.files = TRUE, recursive = TRUE)

This results in a character vector with the path to all the files:

head(files)
[1] "000/1400000/1400000-ADVERTISEMENT-DTL78.xml"   "000/1400000/1400000-ADVERTISEMENT-DTL79.xml"  
[3] "000/1400000/1400000-ADVERTISEMENT-DTL80.xml"   "000/1400000/1400000-ADVERTISEMENT-DTL81.xml"  
[5] "000/1400000/1400000-MODSMD_ARTICLE1-DTL34.xml" "000/1400000/1400000-MODSMD_ARTICLE2-DTL35.xml"

Now I write a function that does the needed data preparation steps. I describe what the function does in the comments inside:

to_vw <- function(xml_file){

    # read in the xml file
    file <- read_xml(paste0("export01-newspapers1841-1878/", xml_file))

    # Get the newspaper
    newspaper <- xml_find_all(file, ".//dcterms:isPartOf") %>% xml_text()

    # Only keep the newspapers written in French
    if(!(newspaper %in% c("L'UNION.",
                          "L'indépendance luxembourgeoise",
                          "COURRIER DU GRAND-DUCHÉ DE LUXEMBOURG.",
                          "JOURNAL DE LUXEMBOURG.",
                          "L'AVENIR",
                          "L’Arlequin",
                          "La Gazette du Grand-Duché de Luxembourg",
                          "L'AVENIR DE LUXEMBOURG",
                          "L'AVENIR DU GRAND-DUCHE DE LUXEMBOURG.",
                          "L'AVENIR DU GRAND-DUCHÉ DE LUXEMBOURG.",
                          "Le gratis luxembourgeois",
                          "Luxemburger Zeitung – Journal de Luxembourg",
                          "Recueil des mémoires et des travaux publiés par la Société de Botanique du Grand-Duché de Luxembourg"))){
        return(NULL)
    } else {
        # Get the type of the content. Can be article, advert, issue, section or other
        type <- xml_find_all(file, ".//dc:type") %>% xml_text()

        type <- case_when(type == "ARTICLE" ~ "1",
                          type == "ADVERTISEMENT" ~ "2",
                          type == "ISSUE" ~ "3",
                          type == "SECTION" ~ "4",
                          TRUE ~ "5"
        )

        # Get the content itself. Only keep alphanumeric characters, and remove any line returns or 
        # carriage returns
        description <- xml_find_all(file, ".//dc:description") %>%
            xml_text() %>%
            str_replace_all(pattern = "[^[:alnum:][:space:]]", "") %>%
            str_to_lower() %>%
            str_replace_all("\r?\n|\r|\n", " ")

        # Return the final object: one line that looks like this
        # 1 | bla bla
        paste(type, "|", description)
    }

}

I can now run this code to parse all the files, and I do so in parallel, thanks to the {furrr} package:

plan(multiprocess, workers = 12)

text_fr <- files %>%
    future_map(to_vw)

text_fr <- text_fr %>%
    discard(is.null)

write_lines(text_fr, "text_fr.txt")

Step 2: Install Vowpal Wabbit

To easiest way to install VW must be using Anaconda, and more specifically the conda package manager. Anaconda is a Python (and R) distribution for scientific computing and it comes with a package manager called conda which makes installing Python (or R) packages very easy. While VW is a standalone piece of software, it can also be installed by conda or pip. Instead of installing the full Anaconda distribution, you can install Miniconda, which only comes with the bare minimum: a Python executable and the conda package manager. You can find Miniconda here and once it’s installed, you can install VW with:

conda install -c gwerbin vowpal-wabbit 

It is also possible to install VW with pip, as detailed here, but in my experience, managing Python packages with pip is not super. It is better to manage your Python distribution through conda, because it creates environments in your home folder which are independent of the system’s Python installation, which is often out-of-date.

Step 3: Building a model

Vowpal Wabbit can be used from the command line, but there are interfaces for Python and since a few weeks, for R. The R interface is quite crude for now, as it’s still in very early stages. I’m sure it will evolve, and perhaps a Vowpal Wabbit engine will be added to {parsnip}, which would make modeling with VW really easy.

For now, let’s only use 10000 lines for prototyping purposes before running the model on the whole file. Because the data is quite large, I do not want to import it into R. So I use command line tools to manipulate this data directly from my hard drive:

# Prepare data
system2("shuf", args = "-n 10000 text_fr.txt > small.txt")

shuf is a Unix command, and as such the above code should work on GNU/Linux systems, and most likely macOS too. shuf generates random permutations of a given file to standard output. I use > to direct this output to another file, which I called small.txt. The -n 10000 options simply means that I want 10000 lines.

I then split this small file into a training and a testing set:

# Adapted from http://bitsearch.blogspot.com/2009/03/bash-script-to-split-train-and-test.html

# The command below counts the lines in small.txt. This is not really needed, since I know that the 
# file only has 10000 lines, but I kept it here for future reference
# notice the stdout = TRUE option. This is needed because the output simply gets shown in R's
# command line and does get saved into a variable.
nb_lines <- system2("cat", args = "small.txt | wc -l", stdout = TRUE)

system2("split", args = paste0("-l", as.numeric(nb_lines)*0.99, " small.txt data_split/"))

split is the Unix command that does the splitting. I keep 99% of the lines in the training set and 1% in the test set. This creates two files, aa and ab. I rename them using the mv Unix command:

system2("mv", args = "data_split/aa data_split/small_train.txt")
system2("mv", args = "data_split/ab data_split/small_test.txt")

Ok, now let’s run a model using the VW command line utility from R, using system2():

oaa_fit <- system2("~/miniconda3/bin/vw", args = "--oaa 5 -d data_split/small_train.txt -f small_oaa.model", stderr = TRUE)

I need to point system2() to the vw executable, and then add some options. --oaa stands for one against all and is a way of doing multiclass classification; first, one class gets classified by a logistic classifier against all the others, then the other class against all the others, then the other…. The 5 in the option means that there are 5 classes.

-d data_split/train.txt specifies the path to the training data. -f means “final regressor” and specifies where you want to save the trained model.

This is the output that get’s captured and saved into oaa_fit:

 [1] "final_regressor = oaa.model"                                             
 [2] "Num weight bits = 18"                                                    
 [3] "learning rate = 0.5"                                                     
 [4] "initial_t = 0"                                                           
 [5] "power_t = 0.5"                                                           
 [6] "using no cache"                                                          
 [7] "Reading datafile = data_split/train.txt"                                 
 [8] "num sources = 1"                                                         
 [9] "average  since         example        example  current  current  current"
[10] "loss     last          counter         weight    label  predict features"
[11] "1.000000 1.000000            1            1.0        3        1       87"
[12] "1.000000 1.000000            2            2.0        1        3     2951"
[13] "1.000000 1.000000            4            4.0        1        3      506"
[14] "0.625000 0.250000            8            8.0        1        1      262"
[15] "0.625000 0.625000           16           16.0        1        2      926"
[16] "0.500000 0.375000           32           32.0        4        1        3"
[17] "0.375000 0.250000           64           64.0        1        1      436"
[18] "0.296875 0.218750          128          128.0        2        2      277"
[19] "0.238281 0.179688          256          256.0        2        2      118"
[20] "0.158203 0.078125          512          512.0        2        2       61"
[21] "0.125000 0.091797         1024         1024.0        2        2      258"
[22] "0.096191 0.067383         2048         2048.0        1        1       45"
[23] "0.085205 0.074219         4096         4096.0        1        1      318"
[24] "0.076172 0.067139         8192         8192.0        2        1      523"
[25] ""                                                                        
[26] "finished run"                                                            
[27] "number of examples = 9900"                                               
[28] "weighted example sum = 9900.000000"                                      
[29] "weighted label sum = 0.000000"                                           
[30] "average loss = 0.073434"                                                 
[31] "total feature number = 4456798"  

Now, when I try to run the same model using RVowpalWabbit::vw() I get the following error:

oaa_class <- c("--oaa", "5",
               "-d", "data_split/small_train.txt",
               "-f", "vw_models/small_oaa.model")

result <- vw(oaa_class)
Error in Rvw(args) : unrecognised option '--oaa'

I think the problem might be because I installed Vowpal Wabbit using conda, and the package cannot find the executable. I’ll open an issue with reproducible code and we’ll see.

In any case, that’s it for now! In the next blog post, we’ll see how to get the accuracy of this very simple model, and see how to improve it!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.

Buy me an EspressoBuy me an Espresso