About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius

Lately I’ve been interested in trying to cluster documents, and to find similar documents based on their contents. In this blog post, I will use Seneca’s Moral letters to Lucilius and compute the pairwise cosine similarity of his 124 letters. Computing the cosine similarity between two vectors returns how similar these vectors are. A cosine similarity of 1 means that the angle between the two vectors is 0, and thus both vectors have the same direction. Seneca’s Moral letters to Lucilius deal mostly with philosophical topics, as Seneca was, among many other things, a philosopher of the stoic school. The stoic school of philosophy is quite interesting, but it has been unfortunately misunderstood, especially in modern times. There is now a renewed interest for this school, see Modern Stoicism.

The first step is to scrape the letters. The code below scrapes the letters, and saves them into a list. I first start by writing a function that gets the raw text. Note the xpath argument of the html_nodes() function. I obtained this complex expression by using the SelectorGadget extension for Google Chrome, and then selecting the right element of the web page. See this screenshot if my description was not very clear.

Then, the extract_text() function extracts the text from the letter. The only line that might be a bit complex is discard(~`==`(., "")) which removes every empty line.

Finally, there’s the get_letter() function that actually gets the letter by calling the first two functions. In the last line, I get all the letters into a list by mapping the list of urls to the get_letter() function.

library(tidyverse)
library(rvest)

base_url <- "https://en.wikisource.org/wiki/Moral_letters_to_Lucilius/Letter_"

letter_numbers <- seq(1, 124)

letter_urls <- paste0(base_url, letter_numbers)

get_raw_text <- function(base_url, letter_number){
  paste0(base_url, letter_number) %>%
    read_html() %>%
    html_nodes(xpath ='//*[contains(concat( " ", @class, " " ), concat( " ", "mw-parser-output", " " ))]') %>%  
    html_text()
}


extract_text <- function(raw_text, letter_number){
  raw_text <- raw_text %>%
    str_split("\n") %>%  
    flatten_chr() %>%  
    discard(~`==`(., ""))

  start <- 5

  end <- str_which(raw_text, "Footnotes*")

  raw_text[start:(end-1)] %>%
    str_remove_all("\\[\\d{1,}\\]") %>%
    str_remove_all("\\[edit\\]")
}

get_letter <- function(base_url, letter_number){

  raw_text <- get_raw_text(base_url, letter_number)

  extract_text(raw_text, letter_number)
}

letters_to_lucilius <- map2(base_url, letter_numbers, get_letter)

Now that we have the letters saved in a list, we need to process the text a little bit. In order to compute the cosine similarity between the letters, I need to somehow represent them as vectors. There are several ways of doing this, and I am going to compute the tf-idf of each letter. The tf-idf will give me a vector for each letter, with zero and non-zero values. Zero values represent words that are common to all letters, and thus do not have any predictive power. Non-zero values are words that are not present in all letters, but maybe only a few. I expect that letters that discuss death for example, will have the word death in them, and letters that do not discuss death will not have this word. The word death thus has what I call predictive power, in that it helps us distinguish the letters discussing death from the other letters that do not discuss it. The same reasoning can be applied for any topic.

So, to get the tf-idf of each letter, I first need to put them in a tidy dataset. I will use the {tidytext} package for this. First, I load the required packages, convert each letter to a dataframe of one column that contains the text, and save the letter’s titles into another list:

library(tidytext)
library(SnowballC)
library(stopwords)
library(text2vec)

letters_to_lucilius_df <- map(letters_to_lucilius, ~tibble("text" = .))

letter_titles <- letters_to_lucilius_df %>%
  map(~slice(., 1)) %>%
  map(pull)

Now, I add this title to each dataframe as a new column, called title:

letters_to_lucilius_df <-  map2(.x = letters_to_lucilius_df, .y = letter_titles,
                                ~mutate(.x, title = .y)) %>%
  map(~slice(., -1))

I can now use unnest_tokens() to transform the datasets. Before, I had the whole text of the letter in one column. After using unnest_tokens() I now have a dataset with one row per word. This will make it easy to compute frequencies by letters, or what I am interested in, the tf-idf of each letter:

tokenized_letters <- letters_to_lucilius_df %>%
  bind_rows() %>%
  group_by(title) %>%
  unnest_tokens(word, text)

I can now remove stopwords, using the data containing in the {stopwords} package:

stopwords_en <- tibble("word" = stopwords("en", source  = "smart"))

tokenized_letters <- tokenized_letters %>%
  anti_join(stopwords_en) %>%
  filter(!str_detect(word, "\\d{1,}"))
## Joining, by = "word"

Next step, wordstemming, meaning, going from “dogs” to “dog”, or from “was” to “be”. If you do not do wordstemming, “dogs” and “dog” will be considered different words, even though they are not. wordStem() is a function from {SnowballC}.

tokenized_letters <- tokenized_letters %>%
  mutate(word = wordStem(word, language = "en"))

Finally, I can compute the tf-idf of each letter and cast the data as a sparse matrix:

tfidf_letters <- tokenized_letters %>%
  count(title, word, sort  = TRUE) %>%
  bind_tf_idf(word, title, n)

sparse_matrix <- tfidf_letters %>%
  cast_sparse(title, word, tf)

Let’s take a look at the sparse matrix:

sparse_matrix[1:10, 1:4]
## 10 x 4 sparse Matrix of class "dgCMatrix"
##                                                                   thing
## CXIII. On the Vitality of the Soul and Its Attributes       0.084835631
## LXVI. On Various Aspects of Virtue                          0.017079890
## LXXXVII. Some Arguments in Favour of the Simple Life        0.014534884
## CXVII. On Real Ethics as Superior to Syllogistic Subtleties 0.025919732
## LXXVI. On Learning Wisdom in Old Age                        0.021588946
## CII. On the Intimations of Our Immortality                  0.014662757
## CXXIV. On the True Good as Attained by Reason               0.010139417
## XCIV. On the Value of Advice                                0.009266409
## LXXXI. On Benefits                                          0.007705479
## LXXXV. On Some Vain Syllogisms                              0.013254786
##                                                                     live
## CXIII. On the Vitality of the Soul and Its Attributes       0.0837751856
## LXVI. On Various Aspects of Virtue                          .           
## LXXXVII. Some Arguments in Favour of the Simple Life        0.0007267442
## CXVII. On Real Ethics as Superior to Syllogistic Subtleties 0.0050167224
## LXXVI. On Learning Wisdom in Old Age                        0.0025906736
## CII. On the Intimations of Our Immortality                  0.0019550342
## CXXIV. On the True Good as Attained by Reason               .           
## XCIV. On the Value of Advice                                0.0023166023
## LXXXI. On Benefits                                          0.0008561644
## LXXXV. On Some Vain Syllogisms                              0.0022091311
##                                                                   good
## CXIII. On the Vitality of the Soul and Its Attributes       0.01166490
## LXVI. On Various Aspects of Virtue                          0.04132231
## LXXXVII. Some Arguments in Favour of the Simple Life        0.04578488
## CXVII. On Real Ethics as Superior to Syllogistic Subtleties 0.04849498
## LXXVI. On Learning Wisdom in Old Age                        0.04663212
## CII. On the Intimations of Our Immortality                  0.05180841
## CXXIV. On the True Good as Attained by Reason               0.06717364
## XCIV. On the Value of Advice                                0.01081081
## LXXXI. On Benefits                                          0.01626712
## LXXXV. On Some Vain Syllogisms                              0.01472754
##                                                                 precept
## CXIII. On the Vitality of the Soul and Its Attributes       .          
## LXVI. On Various Aspects of Virtue                          .          
## LXXXVII. Some Arguments in Favour of the Simple Life        .          
## CXVII. On Real Ethics as Superior to Syllogistic Subtleties .          
## LXXVI. On Learning Wisdom in Old Age                        .          
## CII. On the Intimations of Our Immortality                  .          
## CXXIV. On the True Good as Attained by Reason               0.001267427
## XCIV. On the Value of Advice                                0.020463320
## LXXXI. On Benefits                                          .          
## LXXXV. On Some Vain Syllogisms                              .

We can consider each row of this matrix as the vector representing a letter, and thus compute the cosine similarity between letters. For this, I am using the sim2() function from the {text2vec} package. I then create the get_similar_letters() function that returns similar letters for a given reference letter:

similarities <- sim2(sparse_matrix, method = "cosine", norm = "l2") 

get_similar_letters <- function(similarities, reference_letter, n_recommendations = 3){
  sort(similarities[reference_letter, ], decreasing = TRUE)[1:(2 + n_recommendations)]
}
get_similar_letters(similarities, 19)
##          XXX. On Conquering the Conqueror 
##                                 1.0000000 
##                  XXIV. On Despising Death 
##                                 0.6781600 
##      LXXXII. On the Natural Fear of Death 
##                                 0.6639736 
## LXX. On the Proper Time to Slip the Cable 
##                                 0.5981706 
## LXXVIII. On the Healing Power of the Mind 
##                                 0.4709679
get_similar_letters(similarities, 99)
##                              LXI. On Meeting Death Cheerfully 
##                                                     1.0000000 
##                     LXX. On the Proper Time to Slip the Cable 
##                                                     0.5005015 
## XCIII. On the Quality, as Contrasted with the Length, of Life 
##                                                     0.4631796 
##                         CI. On the Futility of Planning Ahead 
##                                                     0.4503093 
##                              LXXVII. On Taking One's Own Life 
##                                                     0.4147019
get_similar_letters(similarities, 32)
##                                    LIX. On Pleasure and Joy 
##                                                   1.0000000 
##          XXIII. On the True Joy which Comes from Philosophy 
##                                                   0.4743672 
##                          CIX. On the Fellowship of Wise Men 
##                                                   0.4526835 
## XC. On the Part Played by Philosophy in the Progress of Man 
##                                                   0.4498278 
##         CXXIII. On the Conflict between Pleasure and Virtue 
##                                                   0.4469312
get_similar_letters(similarities, 101)
##                    X. On Living to Oneself 
##                                  1.0000000 
##          LXXIII. On Philosophers and Kings 
##                                  0.3842292 
##                  XLI. On the God within Us 
##                                  0.3465457 
##                       XXXI. On Siren Songs 
##                                  0.3451388 
## XCV. On the Usefulness of Basic Principles 
##                                  0.3302794

As we can see from these examples, this seems to be working quite well: the first title is the title of the reference letter, will the next 3 are the suggested letters. The problem is that my matrix is not in the right order, and thus reference letter 19 does not correspond to letter 19 of Seneca… I have to correct that, but not today.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.

Buy me an EspressoBuy me an Espresso