Manipulating strings with the {stringr} package

2019/02/10 R

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free here. This is taken from Chapter 4, in which I introduce the {stringr} package.

Manipulate strings with `{stringr}`

{stringr} contains functions to manipulate strings. In Chapter 10, I will teach you about regular expressions, but the functions contained in {stringr} allow you to already do a lot of work on strings, without needing to be a regular expression expert.

I will discuss the most common string operations: detecting, locating, matching, searching and replacing, and exctracting/removing strings.

To introduce these operations, let us use an ALTO file of an issue of The Winchester News from October 31, 1910, which you can find on this link (to see how the newspaper looked like, click here). I re-hosted the file on a public gist for archiving purposes. While working on the book, the original site went down several times…

ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed material, such as newspapers (source: ALTO Wikipedia page). For more details, you can read my blogpost on the matter, but for our current purposes, it is enough to know that the file contains the text of newspaper articles. The file looks like this:

<TextLine HEIGHT="138.0" WIDTH="2434.0" HPOS="4056.0" VPOS="5814.0">
<String STYLEREFS="ID7" HEIGHT="108.0" WIDTH="393.0" HPOS="4056.0" VPOS="5838.0" CONTENT="timore" WC="0.82539684">
<ALTERNATIVE>timole</ALTERNATIVE>
<ALTERNATIVE>tlnldre</ALTERNATIVE>
<ALTERNATIVE>timor</ALTERNATIVE>
<ALTERNATIVE>insole</ALTERNATIVE>
<ALTERNATIVE>landed</ALTERNATIVE>
</String>
<SP WIDTH="74.0" HPOS="4449.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="105.0" WIDTH="432.0" HPOS="4524.0" VPOS="5847.0" CONTENT="market" WC="0.95238096"/>
<SP WIDTH="116.0" HPOS="4956.0" VPOS="5847.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="138.0" HPOS="5073.0" VPOS="5883.0" CONTENT="as" WC="0.96825397"/>
<SP WIDTH="74.0" HPOS="5211.0" VPOS="5883.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="285.0" HPOS="5286.0" VPOS="5877.0" CONTENT="were" WC="1.0">
<ALTERNATIVE>verc</ALTERNATIVE>
<ALTERNATIVE>veer</ALTERNATIVE>
</String>
<SP WIDTH="68.0" HPOS="5571.0" VPOS="5877.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="147.0" HPOS="5640.0" VPOS="5838.0" CONTENT="all" WC="1.0"/>
<SP WIDTH="83.0" HPOS="5787.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="183.0" HPOS="5871.0" VPOS="5835.0" CONTENT="the" WC="0.95238096">
<ALTERNATIVE>tll</ALTERNATIVE>
<ALTERNATIVE>Cu</ALTERNATIVE>
<ALTERNATIVE>tall</ALTERNATIVE>
</String>
<SP WIDTH="75.0" HPOS="6054.0" VPOS="5835.0"/>
<String STYLEREFS="ID3" HEIGHT="132.0" WIDTH="351.0" HPOS="6129.0" VPOS="5814.0" CONTENT="cattle" WC="0.95238096"/>
</TextLine>

We are interested in the strings after CONTENT=. We are going to use functions from the {stringr} package to get the strings after CONTENT=. In Chapter 10, we are going to explore this file again, but using complex regular expressions to get all the content in one go.

Getting text data into Rstudio

First of all, let us read in the file:

winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")

Even though the file is an XML file, I still read it in using read_lines() and not read_xml() from the {xml2} package. This is for the purposes of the current exercise, and also because I always have trouble with XML files, and prefer to treat them as simple text files, and use regular expressions to get what I need.

Now that the ALTO file is read in and saved in the winchester variable, you might want to print the whole thing in the console. Before that, take a look at the structure:

str(winchester)

##  chr [1:43] "" ...

So the winchester variable is a character atomic vector with 43 elements. So first, we need to understand what these elements are. Let’s start with the first one:

winchester[1]

## [1] ""

Ok, so it seems like the first element is part of the header of the file. What about the second one?

winchester[2]

## [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"><base href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id=\"bN015htcoyT__google-cache-hdr\"><div><span>This is Google's cache of <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml</a>.</span>&nbsp;<span>It is a snapshot of the page as it appeared on 21 Jan 2019 05:18:18 GMT.</span>&nbsp;<span>The <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">current page</a> could have changed in the meantime.</span>&nbsp;<a href=\"http://support.google.com/websearch/bin/answer.py?hl=en&amp;p=cached&amp;answer=1687222\"><span>Learn more</span>.</a></div><div><span style=\"display:inline-block;margin-top:8px;margin-right:104px;white-space:nowrap\"><span style=\"margin-right:28px\"><span style=\"font-weight:bold\">Full version</span></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=1&amp;vwsrc=0\"><span>Text-only version</span></a></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=0&amp;vwsrc=1\"><span>View source</span></a></span></span></div><span style=\"display:inline-block;margin-top:8px;color:#717171\"><span>Tip: To quickly find your search term on this page, press <b>Ctrl+F</b> or <b>⌘-F</b> (Mac) and use the find bar.</span></span></div><div style=\"position:relative;\"><?xml version=\"1.0\" encoding=\"UTF-8\"?>"

Same. So where is the content? The file is very large, so if you print it in the console, it will take quite some time to print, and you will not really be able to make out anything. The best way would be to try to detect the string CONTENT and work from there.

Detecting, getting the position and locating strings

When confronted to an atomic vector of strings, you might want to know inside which elements you can find certain strings. For example, to know which elements of winchester contain the string CONTENT, use str_detect():

winchester %>%
  str_detect("CONTENT")

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

This returns a boolean atomic vector of the same length as winchester. If the string CONTENT is nowhere to be found, the result will equal FALSE, if not it will equal TRUE. Here it is easy to see that the last element contains the string CONTENT. But what if instead of having 43 elements, the vector had 24192 elements? And hundreds would contain the string CONTENT? It would be easier to instead have the indices of the vector where one can find the word CONTENT. This is possible with str_which():

winchester %>%
  str_which("CONTENT")

## [1] 43

Here, the result is 43, meaning that the 43rd element of winchester contains the string CONTENT somewhere. If we need more precision, we can use str_locate() and str_locate_all(). To explain how both these functions work, let’s create a very small example:

ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")

Now suppose I am interested in philosophers whose name ends in us. Let us use str_locate() first:

ancient_philosophers %>%
  str_locate("us")

##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]     8   9
## [4,]    NA  NA
## [5,]     7   8
## [6,]     5   6

You can interpret the result as follows: in the rows, the index of the vector where the string us is found. So the 3rd, 5th and 6th philosopher have us somewhere in their name. The result also has two columns: start and end. These give the position of the string. So the string us can be found starting at position 8 of the 3rd element of the vector, and ends at position 9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both ending with us. However, str_locate() only shows the position of the us in Marcus.

To get both us strings, you need to use str_locate_all():

ancient_philosophers %>%
  str_locate_all("us")

## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     8   9
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## [1,]     7   8
## 
## [[6]]
##      start end
## [1,]     5   6
## [2,]    14  15

Now we get the position of the two us in Marcus Aurelius. Doing this on the winchester vector will give use the position of the CONTENT string, but this is not really important right now. What matters is that you know how str_locate() and str_locate_all() work.

So now that we know what interests us in the 43nd element of winchester, let’s take a closer look at it:

winchester[43]

As you can see, it’s a mess:

<TextLine HEIGHT=\"126.0\" WIDTH=\"1731.0\" HPOS=\"17160.0\" VPOS=\"21252.0\"><String HEIGHT=\"114.0\" WIDTH=\"354.0\" HPOS=\"17160.0\" VPOS=\"21264.0\" CONTENT=\"0tV\" WC=\"0.8095238\"/><SP WIDTH=\"131.0\" HPOS=\"17514.0\" VPOS=\"21264.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"111.0\" WIDTH=\"474.0\" HPOS=\"17646.0\" VPOS=\"21258.0\" CONTENT=\"BATES\" WC=\"1.0\"/><SP WIDTH=\"140.0\" HPOS=\"18120.0\" VPOS=\"21258.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"114.0\" WIDTH=\"630.0\" HPOS=\"18261.0\" VPOS=\"21252.0\" CONTENT=\"President\" WC=\"1.0\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"153.0\" WIDTH=\"1689.0\" HPOS=\"17145.0\" VPOS=\"21417.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"258.0\" HPOS=\"17145.0\" VPOS=\"21439.0\" CONTENT=\"WM\" WC=\"0.82539684\"><TextLine HEIGHT=\"120.0\" WIDTH=\"2211.0\" HPOS=\"16788.0\" VPOS=\"21870.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"102.0\" HPOS=\"16788.0\" VPOS=\"21894.0\" CONTENT=\"It\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"16890.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"93.0\" HPOS=\"16962.0\" VPOS=\"21885.0\" CONTENT=\"is\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17055.0\" VPOS=\"21885.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"102.0\" WIDTH=\"417.0\" HPOS=\"17136.0\" VPOS=\"21879.0\" CONTENT=\"seldom\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17553.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"267.0\" HPOS=\"17634.0\" VPOS=\"21873.0\" CONTENT=\"hard\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"17901.0\" VPOS=\"21873.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"87.0\" WIDTH=\"111.0\" HPOS=\"17982.0\" VPOS=\"21879.0\" CONTENT=\"to\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"18093.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"219.0\" HPOS=\"18174.0\" VPOS=\"21870.0\" CONTENT=\"find\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18393.0\" VPOS=\"21870.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"66.0\" HPOS=\"18471.0\" VPOS=\"21894.0\" CONTENT=\"a\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18537.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"78.0\" WIDTH=\"384.0\" HPOS=\"18615.0\" VPOS=\"21888.0\" CONTENT=\"succes\" WC=\"0.82539684\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"126.0\" WIDTH=\"2316.0\" HPOS=\"16662.0\" VPOS=\"22008.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"75.0\" WIDTH=\"183.0\" HPOS=\"16662.0\" VPOS=\"22059.0\" CONTENT=\"sor\" WC=\"1.0\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\"72.0\" HPOS=\"16845.0\" VPOS=\"22059.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"90.0\" WIDTH=\"168.0\" HPOS=\"16917.0\" VPOS=\"22035.0\" CONTENT=\"for\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"17085.0\" VPOS=\"22035.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"267.0\" HPOS=\"17157.0\" VPOS=\"22050.0\" CONTENT=\"even\" WC=\"1.0\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\"77.0\" HPOS=\"17434.0\" VPOS=\"22050.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"66.0\" WIDTH=\"63.0\" HPOS=\"17502.0\" VPOS=\"22044.0\"

The file was imported without any newlines. So we need to insert them ourselves, by splitting the string in a clever way.

Splitting strings

There are two functions included in {stringr} to split strings, str_split() and str_split_fixed(). Let’s go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have something else in common than both being Roman Stoic philosophers. Their names are composed of several words. If we want to split their names at the space character, we can use str_split() like this:

ancient_philosophers %>%
  str_split(" ")

## [[1]]
## [1] "aristotle"
## 
## [[2]]
## [1] "plato"
## 
## [[3]]
## [1] "epictetus"
## 
## [[4]]
## [1] "seneca"  "the"     "younger"
## 
## [[5]]
## [1] "epicurus"
## 
## [[6]]
## [1] "marcus"   "aurelius"

str_split() also has a simplify = TRUE option:

ancient_philosophers %>%
  str_split(" ", simplify = TRUE)

##      [,1]        [,2]       [,3]     
## [1,] "aristotle" ""         ""       
## [2,] "plato"     ""         ""       
## [3,] "epictetus" ""         ""       
## [4,] "seneca"    "the"      "younger"
## [5,] "epicurus"  ""         ""       
## [6,] "marcus"    "aurelius" ""

This time, the returned object is a matrix.

What about str_split_fixed()? The difference is that here you can specify the number of pieces to return. For example, you could consider the name “Aurelius” to be the middle name of Marcus Aurelius, and the “the younger” to be the middle name of Seneca the younger. This means that you would want to split the name only at the first space character, and not at all of them. This is easily achieved with str_split_fixed():

ancient_philosophers %>%
  str_split_fixed(" ", 2)

##      [,1]        [,2]         
## [1,] "aristotle" ""           
## [2,] "plato"     ""           
## [3,] "epictetus" ""           
## [4,] "seneca"    "the younger"
## [5,] "epicurus"  ""           
## [6,] "marcus"    "aurelius"

This gives the expected result.

So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning of this section, you will notice that every line ends with the “>” character. So let’s split at that character!

winchester_text <- winchester[43] %>%
  str_split(">")

Let’s take a closer look at winchester_text:

str(winchester_text)

## List of 1
##  $ : chr [1:19706] "</processingStepSettings" "<processingSoftware" "<softwareCreator" "iArchives</softwareCreator" ...

So this is a list of length one, and the first, and only, element of that list is an atomic vector with 19706 elements. Since this is a list of only one element, we can simplify it by saving the atomic vector in a variable:

winchester_text <- winchester_text[[1]]

Let’s now look at some lines:

winchester_text[1232:1245]

##  [1] "<SP WIDTH=\"66.0\" HPOS=\"5763.0\" VPOS=\"9696.0\"/"                                                                         
##  [2] "<String STYLEREFS=\"ID7\" HEIGHT=\"108.0\" WIDTH=\"612.0\" HPOS=\"5829.0\" VPOS=\"9693.0\" CONTENT=\"Louisville\" WC=\"1.0\""
##  [3] "<ALTERNATIVE"                                                                                                                
##  [4] "Loniile</ALTERNATIVE"                                                                                                        
##  [5] "<ALTERNATIVE"                                                                                                                
##  [6] "Lenities</ALTERNATIVE"                                                                                                       
##  [7] "</String"                                                                                                                    
##  [8] "</TextLine"                                                                                                                  
##  [9] "<TextLine HEIGHT=\"150.0\" WIDTH=\"2520.0\" HPOS=\"4032.0\" VPOS=\"9849.0\""                                                 
## [10] "<String STYLEREFS=\"ID7\" HEIGHT=\"108.0\" WIDTH=\"510.0\" HPOS=\"4032.0\" VPOS=\"9861.0\" CONTENT=\"Tobacco\" WC=\"1.0\"/"  
## [11] "<SP WIDTH=\"113.0\" HPOS=\"4542.0\" VPOS=\"9861.0\"/"                                                                        
## [12] "<String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"696.0\" HPOS=\"4656.0\" VPOS=\"9861.0\" CONTENT=\"Warehouse\" WC=\"1.0\"" 
## [13] "<ALTERNATIVE"                                                                                                                
## [14] "WHrchons</ALTERNATIVE"

This now looks easier to handle. We can narrow it down to the lines that only contain the string we are interested in, “CONTENT”. First, let’s get the indices:

content_winchester_index <- winchester_text %>%
  str_which("CONTENT")

How many lines contain the string “CONTENT”?

length(content_winchester_index)

## [1] 4462

As you can see, this reduces the amount of data we have to work with. Let us save this is a new variable:

content_winchester <- winchester_text[content_winchester_index]

Matching strings

Matching strings is useful, but only in combination with regular expressions. As stated at the beginning of this section, we are going to learn about regular expressions in Chapter 10, but in order to make this section useful, we are going to learn the easiest, but perhaps the most useful regular expression: .*.

Let’s go back to our ancient philosophers, and use str_match() and see what happens. Let’s match the “us” string:

ancient_philosophers %>%
  str_match("us")

##      [,1]
## [1,] NA  
## [2,] NA  
## [3,] "us"
## [4,] NA  
## [5,] "us"
## [6,] "us"

Not very useful, but what about the regular expression .*? How could it help?

ancient_philosophers %>%
  str_match(".*us")

##      [,1]             
## [1,] NA               
## [2,] NA               
## [3,] "epictetus"      
## [4,] NA               
## [5,] "epicurus"       
## [6,] "marcus aurelius"

That’s already very interesting! So how does .* work? To understand, let’s first start by using . alone:

ancient_philosophers %>%
  str_match(".us")

##      [,1] 
## [1,] NA   
## [2,] NA   
## [3,] "tus"
## [4,] NA   
## [5,] "rus"
## [6,] "cus"

This also matched whatever symbol comes just before the “u” from “us”. What if we use two . instead?

ancient_philosophers %>%
  str_match("..us")

##      [,1]  
## [1,] NA    
## [2,] NA    
## [3,] "etus"
## [4,] NA    
## [5,] "urus"
## [6,] "rcus"

This time, we get the two symbols that immediately precede “us”. Instead of continuing like this we now use the *, which matches zero or more of .. So by combining * and ., we can match any symbol repeatedly, until there is nothing more to match. Note that there is also +, which works similarly to *, but it matches one or more symbols.

There is also a str_match_all():

ancient_philosophers %>%
  str_match_all(".*us")

## [[1]]
##      [,1]
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]       
## [1,] "epictetus"
## 
## [[4]]
##      [,1]
## 
## [[5]]
##      [,1]      
## [1,] "epicurus"
## 
## [[6]]
##      [,1]             
## [1,] "marcus aurelius"

In this particular case it does not change the end result, but keep it in mind for cases like this one:

c("haha", "huhu") %>%
  str_match("ha")

##      [,1]
## [1,] "ha"
## [2,] NA

and:

c("haha", "huhu") %>%
  str_match_all("ha")

## [[1]]
##      [,1]
## [1,] "ha"
## [2,] "ha"
## 
## [[2]]
##      [,1]

What if we want to match names containing the letter “t”? Easy:

ancient_philosophers %>%
  str_match(".*t.*")

##      [,1]                
## [1,] "aristotle"         
## [2,] "plato"             
## [3,] "epictetus"         
## [4,] "seneca the younger"
## [5,] NA                  
## [6,] NA

So how does this help us with our historical newspaper? Let’s try to get the strings that come after “CONTENT”:

winchester_content <- winchester_text %>%
  str_match("CONTENT.*")

Let’s use our faithful str() function to take a look:

winchester_content %>%
  str

##  chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...

Hum, there’s a lot of NA values! This is because a lot of the lines from the file did not have the string “CONTENT”, so there is no match possible. Let’s us remove all these NAs. Because the result is a matrix, we cannot use the filter() function from {dplyr}. So we need to convert it to a tibble first:

winchester_content <- winchester_content %>%
  as.tibble() %>%
  filter(!is.na(V1))

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column gets automatically called V1. This is why I filter on this column. Let’s take a look at the data:

head(winchester_content)

## # A tibble: 6 x 1
##   V1                                  
##   <chr>                               
## 1 "CONTENT=\"J\" WC=\"0.8095238\"/"   
## 2 "CONTENT=\"a\" WC=\"0.8095238\"/"   
## 3 "CONTENT=\"Ira\" WC=\"0.95238096\"/"
## 4 "CONTENT=\"mj\" WC=\"0.8095238\"/"  
## 5 "CONTENT=\"iI\" WC=\"0.8095238\"/"  
## 6 "CONTENT=\"tE1r\" WC=\"0.8095238\"/"

Searching and replacing strings

We are getting close to the final result. We still need to do some cleaning however. Since our data is inside a nice tibble, we might as well stick with it. So let’s first rename the column and change all the strings to lowercase:

winchester_content <- winchester_content %>% 
  mutate(content = tolower(V1)) %>% 
  select(-V1)

Let’s take a look at the result:

head(winchester_content)

## # A tibble: 6 x 1
##   content                             
##   <chr>                               
## 1 "content=\"j\" wc=\"0.8095238\"/"   
## 2 "content=\"a\" wc=\"0.8095238\"/"   
## 3 "content=\"ira\" wc=\"0.95238096\"/"
## 4 "content=\"mj\" wc=\"0.8095238\"/"  
## 5 "content=\"ii\" wc=\"0.8095238\"/"  
## 6 "content=\"te1r\" wc=\"0.8095238\"/"

The second part of the string, “wc=….” is not really interesting. Let’s search and replace this with an empty string, using str_replace():

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "wc.*", ""))

head(winchester_content)

## # A tibble: 6 x 1
##   content            
##   <chr>              
## 1 "content=\"j\" "   
## 2 "content=\"a\" "   
## 3 "content=\"ira\" " 
## 4 "content=\"mj\" "  
## 5 "content=\"ii\" "  
## 6 "content=\"te1r\" "

We need to use the regular expression from before to replace “wc” and every character that follows. The same can be use to remove “content=”:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", ""))

head(winchester_content)

## # A tibble: 6 x 1
##   content    
##   <chr>      
## 1 "\"j\" "   
## 2 "\"a\" "   
## 3 "\"ira\" " 
## 4 "\"mj\" "  
## 5 "\"ii\" "  
## 6 "\"te1r\" "

We are almost done, but some cleaning is still necessary:

Exctracting or removing strings

Now, because I now the ALTO spec, I know how to find words that are split between two sentences:

winchester_content %>% 
  filter(str_detect(content, "hyppart"))

## # A tibble: 64 x 1
##    content                                                               
##    <chr>                                                                 
##  1 "\"aver\" subs_type=\"hyppart1\" subs_content=\"average\" "           
##  2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "            
##  3 "\"considera\" subs_type=\"hyppart1\" subs_content=\"consideration\" "
##  4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "     
##  5 "\"re\" subs_type=\"hyppart1\" subs_content=\"resigned\" "            
##  6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "        
##  7 "\"install\" subs_type=\"hyppart1\" subs_content=\"installed\" "      
##  8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "           
##  9 "\"be\" subs_type=\"hyppart1\" subs_content=\"before\" "              
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "            
## # … with 54 more rows

For instance, the word “average” was split over two lines, the first part of the word, “aver” on the first line, and the second part of the word, “age”, on the second line. We want to keep what comes after “subs_content”. Let’s extract the word “average” using str_extract(). However, because only some words were split between two lines, we first need to detect where the string “hyppart1” is located, and only then can we extract what comes after “subs_content”. Thus, we need to combine str_detect() to first detect the string, and then str_extract() to extract what comes after “subs_content”:

winchester_content <- winchester_content %>% 
  mutate(content = if_else(str_detect(content, "hyppart1"), 
                           str_extract_all(content, "content=.*", simplify = TRUE), 
                           content))

Let’s take a look at the result:

winchester_content %>% 
  filter(str_detect(content, "content"))

## # A tibble: 64 x 1
##    content                                                          
##    <chr>                                                            
##  1 "content=\"average\" "                                           
##  2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "       
##  3 "content=\"consideration\" "                                     
##  4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "
##  5 "content=\"resigned\" "                                          
##  6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "   
##  7 "content=\"installed\" "                                         
##  8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "      
##  9 "content=\"before\" "                                            
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "       
## # … with 54 more rows

We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”, which are not needed now:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", "")) %>% 
  mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))

head(winchester_content)

## # A tibble: 6 x 1
##   content    
##   <chr>      
## 1 "\"j\" "   
## 2 "\"a\" "   
## 3 "\"ira\" " 
## 4 "\"mj\" "  
## 5 "\"ii\" "  
## 6 "\"te1r\" "

Almost done! We only need to remove the " characters:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace_all(content, "\"", "")) 

head(winchester_content)

## # A tibble: 6 x 1
##   content
##   <chr>  
## 1 "j "   
## 2 "a "   
## 3 "ira " 
## 4 "mj "  
## 5 "ii "  
## 6 "te1r "

Let’s remove space characters with str_trim():

winchester_content <- winchester_content %>% 
  mutate(content = str_trim(content)) 

head(winchester_content)

## # A tibble: 6 x 1
##   content
##   <chr>  
## 1 j      
## 2 a      
## 3 ira    
## 4 mj     
## 5 ii     
## 6 te1r

To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence, such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset with stopwords inside the {stopwords} package:

library(stopwords)

data(data_stopwords_stopwordsiso)

eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)

winchester_content <- winchester_content %>% 
  anti_join(eng_stopwords) %>% 
  filter(nchar(content) > 3)

## Joining, by = "content"

head(winchester_content)

## # A tibble: 6 x 1
##   content   
##   <chr>     
## 1 te1r      
## 2 jilas     
## 3 edition   
## 4 winchester
## 5 news      
## 6 injuries

That’s it for this section! You now know how to work with strings, but in Chapter 10 we are going one step further by learning about regular expressions, which offer much more power.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.

Buy me an Espresso