Text Manipulation

Statistical Computing, 36-350

Friday - June 5, 2019

Last time: Indexing and iteration

Part I

String basics

What are strings?

The simplest distinction:

class("r")
## [1] "character"
class("Ben")
## [1] "character"

Why do we care about strings?

Whitespaces

Whitespaces count as characters and can be included in strings:

my_str <- "Dear Mr. Carnegie,\n\tThanks for the great school!\n\nSincerely, Ben"
my_str
## [1] "Dear Mr. Carnegie,\n\tThanks for the great school!\n\nSincerely, Ben"

Use cat() to print strings to the console, displaying whitespaces properly

cat(my_str)
## Dear Mr. Carnegie,
##  Thanks for the great school!
## 
## Sincerely, Ben

Vectors/matrices of strings

The character is a basic data type in R (like numeric, or logical), so we can make vectors or matrices of out them. Just like we would with numbers

str_vec <- c("Statistical", "Computing", "isn't that bad") # Collect 3 strings
str_vec # All elements of the vector
## [1] "Statistical"    "Computing"      "isn't that bad"
str_vec[3] # The 3rd element
## [1] "isn't that bad"
str_vec[-(1:2)] # All but the 1st and 2nd
## [1] "isn't that bad"
str_mat <- matrix("", 2, 3) # Build an empty 2 x 3 matrix
str_mat[1,] <- str_vec # Fill the 1st row with str_vec
str_mat[2,1:2] <- str_vec[1:2] # Fill the 2nd row, only entries 1 and 2, with
                              # those of str_vec
str_mat[2,3] <- "isn't a fad" # Fill the 2nd row, 3rd entry, with a new string
str_mat # All elements of the matrix
##      [,1]          [,2]        [,3]            
## [1,] "Statistical" "Computing" "isn't that bad"
## [2,] "Statistical" "Computing" "isn't a fad"
t(str_mat) # Transpose of the matrix
##      [,1]             [,2]         
## [1,] "Statistical"    "Statistical"
## [2,] "Computing"      "Computing"  
## [3,] "isn't that bad" "isn't a fad"

Converting other data types to strings

Easy! Make things into strings with as.character()

as.character(0.8)
## [1] "0.8"
as.character(0.8e+10)
## [1] "8e+09"
as.character(1:5)
## [1] "1" "2" "3" "4" "5"
as.character(TRUE)
## [1] "TRUE"

Converting strings to other data types

Not as easy! Depends on the given string, of course

as.numeric("0.5")
## [1] 0.5
as.numeric("0.5 ")
## [1] 0.5
as.numeric("0.5e-10")
## [1] 5e-11
as.numeric("Hi!")
## Warning: NAs introduced by coercion
## [1] NA
as.logical("True")
## [1] TRUE
as.logical("TRU")
## [1] NA

Number of characters

Use nchar() to count the number of characters in a string

nchar("coffee")
## [1] 6
nchar("code monkey")
## [1] 11
length("code monkey")
## [1] 1
length(c("coffee", "code monkey"))
## [1] 2
nchar(c("coffee", "code monkey")) # Vectorization!
## [1]  6 11

Part II

Substrings, splitting and combining strings

Getting a substring

Use substr() to grab a subseqence of characters from a string, called a substring

phrase <- "Give me a break"
substr(phrase, 1, 4)
## [1] "Give"
substr(phrase, nchar(phrase)-4, nchar(phrase))
## [1] "break"
substr(phrase, nchar(phrase)+1, nchar(phrase)+10)
## [1] ""

substr() vectorizes

Just like nchar(), and many other string functions

presidents <- c("Clinton", "Bush", "Reagan", "Carter", "Ford")
substr(presidents, 1, 2) # Grab the first 2 letters from each
## [1] "Cl" "Bu" "Re" "Ca" "Fo"
substr(presidents, 1:5, 1:5) # Grab the first, 2nd, 3rd, etc.
## [1] "C" "u" "a" "t" ""
substr(presidents, 1, 1:5) # Grab the first, first 2, first 3, etc.
## [1] "C"    "Bu"   "Rea"  "Cart" "Ford"
substr(presidents, nchar(presidents)-1, nchar(presidents)) # Grab the last 2
## [1] "on" "sh" "an" "er" "rd"
                                                           # letters from each

Replace a substring

Can also use substr() to replace a character, or a substring

phrase
## [1] "Give me a break"
substr(phrase, 1, 1) <- "L"
phrase # "G" changed to "L"
## [1] "Live me a break"
substr(phrase, 1000, 1001) <- "R"
phrase # Nothing happened
## [1] "Live me a break"
substr(phrase, 1, 4) <- "Show"
phrase # "Live" changed to "Show"
## [1] "Show me a break"

Splitting a string

Use the strsplit() function to split based on a keyword

ingredients <- "chickpeas, tahini, olive oil, garlic, salt"
split_obj <- strsplit(ingredients, split=",")
split_obj
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"
class(split_obj)
## [1] "list"
length(split_obj)
## [1] 1

Note that the output is actually a list! (With just one element, which is a vector of strings)

strsplit() vectorizes

Just like nchar(), substr(), and the many others

great_profs <- "Nugent, Tibshirani, Genovese, Rinaldo, Shalizi, Ventura"
favorite_cats <- "tiger, leopard, jaguar, lion"
split_list <- strsplit(c(ingredients, great_profs, favorite_cats), split=",")
split_list
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Tibshirani" " Genovese"   " Rinaldo"    " Shalizi"   
## [6] " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Splitting character-by-character

Finest splitting you can do is character-by-character: use strsplit() with split=""

split_chars <- strsplit(ingredients, split = "")[[1]]
split_chars
##  [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"
length(split_chars)
## [1] 42
nchar(ingredients) # Matches the previous count
## [1] 42

Combining strings

Use the paste() function to join two (or more) strings into one, separated by a keyword

paste("Spider", "Man") # Default is to separate by " "
## [1] "Spider Man"
paste("Spider", "Man", sep = "-")
## [1] "Spider-Man"
paste("Spider", "Man", "does whatever", sep = ", ")
## [1] "Spider, Man, does whatever"

paste() vectorizes

Just like nchar(), substr(), strsplit(), etc. Seeing a theme yet?

presidents
## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"
paste(presidents, c("D", "R", "R", "D", "R"))
## [1] "Clinton D" "Bush R"    "Reagan R"  "Carter D"  "Ford R"
paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)
## [1] "Clinton D" "Bush R"    "Reagan D"  "Carter R"  "Ford D"
paste(presidents, " (", 42:38, ")", sep="")
## [1] "Clinton (42)" "Bush (41)"    "Reagan (40)"  "Carter (39)" 
## [5] "Ford (38)"

Condensing a vector of strings

Can condense a vector of strings into one big string by using paste() with the collapse argument

presidents
## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"
paste(presidents, collapse = "; ")
## [1] "Clinton; Bush; Reagan; Carter; Ford"
paste(presidents, " (", 42:38, ")", sep="", collapse = "; ")
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep = "", 
      collapse=  "; ")
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
paste(presidents, collapse=NULL) # No condensing, the default
## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

Part III

Reading in text, summarizing text

Text from the outside

How to get text, from an external source, into R? Use the readLines() function

endgame_lines <- readLines("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt")
class(endgame_lines) # We have a character vector
## [1] "character"
length(endgame_lines) # Many lines (elements)!
## [1] 3242
endgame_lines[1:6] # First 6 lines
## [1] "."                                                                                                                                                                                                                                                                  
## [2] " [The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows while mentoring his daughter, Lila Barton, on shooting one.]"
## [3] ""                                                                                                                                                                                                                                                                   
## [4] "CLINT BARTON : Okay, hold on. Don't shoot. You see where you're going?"                                                                                                                                                                                             
## [5] ""                                                                                                                                                                                                                                                                   
## [6] "LILA BARTON : Mhm. "

Reading from a local file

We don’t need to use the web; readLines() can be used on a local file. The following code would read in a text file from my (Ben’s) computer:

endgame_lines_2 <- readLines("~/Documents/CMU/third_summer/36-350-summer-data/Week1/endgame.txt")

This will cause an error for you, unless your folder is set up exactly like the instructor’s laptop! So using web links is more robust

Reconstitution

Fancy word, but all it means: make one long string, then split the words

endgame_text <- paste(endgame_lines, collapse = " ")
endgame_words <- strsplit(endgame_text, split = " ")[[1]]

# Sanity check
substr(endgame_text, 4, 200)
## [1] "[The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows"
endgame_words[3:20]
##  [1] "[The"   "screen" "first"  "panels" "up"     "to"     "an"    
##  [8] "arrow"  "being"  "nocked" "into"   "a"      "bow."   "The"   
## [15] "archer" "behind" "firmly" "grips"

Basic preprocessing

The most basic steps we’ll do are - Convert words to lower case - Remove numbers and punctuation - Remove “empty” strings

# Convert to lower case
endgame_words <- tolower(endgame_words)

# Removing numbers and punctuation (Don't worry if this doesn't make sense)
endgame_words <- gsub("[[:punct:]]|[[:digit:]]", "", endgame_words)
endgame_words <- gsub("[[:space:]]", "", endgame_words) # removing spaces



# Remove empty strings
endgame_words <- endgame_words[endgame_words != ""]
endgame_words[1:20]
##  [1] "the"    "screen" "first"  "panels" "up"     "to"     "an"    
##  [8] "arrow"  "being"  "nocked" "into"   "a"      "bow"    "the"   
## [15] "archer" "behind" "firmly" "grips"  "it"     "tight"

Notice that beforehand, punctuation mattered. This is not ideal for us—we’ll learn just a little bit about how to fix this on lab/homework, using regular expressions

Counting words

Our most basic tool for summarizing text: word counts, retrieved using table()

endgame_wordtab <- table(endgame_words)
class(endgame_wordtab)
## [1] "table"
length(endgame_wordtab)
## [1] 3594
endgame_wordtab[1:10]
## endgame_words
##              a       aaaaahhh      abandoned           able         aboard 
##            493              1              4              2              1 
##          about          above         abrupt absentmindedly       absolute 
##             51              1              1              2              1

What did we get? Alphabetically sorted unique words, and their counts = number of appearances

The names are words, the entries are counts

Note: this is actually a vector of numbers, and the words are the names of the vector

endgame_wordtab[1:5]
## endgame_words
##         a  aaaaahhh abandoned      able    aboard 
##       493         1         4         2         1
endgame_wordtab[3] == 4
## abandoned 
##      TRUE
names(endgame_wordtab)[3] == "abandoned"
## [1] TRUE

So with named indexing, we can now use this to look up whatever words we want

endgame_wordtab["avengers"] 
## avengers 
##       33
endgame_wordtab["thanos"] 
## thanos 
##    106
endgame_wordtab["happy"]
## happy 
##    11
endgame_wordtab["sad"] # I guess we should classify this movie as a joyful one (at least less sad than Infinity Wars)
## <NA> 
##   NA

Most frequent words

Let’s sort in decreasing order, to get the most frequent words

endgame_wordtab_sorted <- sort(endgame_wordtab, decreasing = TRUE)
length(endgame_wordtab_sorted)
## [1] 3594
head(endgame_wordtab_sorted, 20) # First 20
## endgame_words
##   the    to     a   and   you     i    of    in  tony    it    we    on 
##   891   573   493   454   417   289   285   275   260   242   196   194 
##    is   his  that steve    he stark scott bruce 
##   184   178   173   170   167   160   143   142
tail(endgame_wordtab_sorted, 20) # Last 20
## endgame_words
##                                                                   young 
##                                                                       1 
##                                                                 younger 
##                                                                       1 
##                                                                   yours 
##                                                                       1 
##                                                              youseveral 
##                                                                       1 
##                                                                 youwell 
##                                                                       1 
##                                                                youyoure 
##                                                                       1 
##                                                                     zip 
##                                                                       1 
##                                                                     zol 
##                                                                       1 
##                                                                    zola 
##                                                                       1 
##                                                                    zone 
##                                                                       1 
##                                                                   zooms 
##                                                                       1 
##                          あなたは生き残った惑星の半分はそうではなかった 
##                                                                       1 
##                                                           くれromanized 
##                                                                       1 
##                                                       ことですromanized 
##                                                                       1 
##                                                      それはあきひこさん 
##                                                                       1 
##                                                                たすけて 
##                                                                       1 
## なぜあなたはこれをやっている私たちはあなたに何もしませんでしたromanized 
##                                                                       1 
##                                                            分かってるね 
##                                                                       1 
##                                                彼らはサノスを手に入れた 
##                                                                       1 
##                                                                    死ね 
##                                                                       1
endgame_lines[1068] # sanity check
## [1] "AKIHIKO : たすけて くれ! (Romanized: Tasukete kure! (English: Wait! Help me! I'll give you anything! What do you want?)"

Visualizing frequencies

Let’s use a plot to visualize frequencies

nw <- length(endgame_wordtab_sorted)
plot(1:nw, as.numeric(endgame_wordtab_sorted), type="l",
     xlab="Rank", ylab="Frequency")

A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)

Zipf’s law

This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law

For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=1012\) and \(a=0.73\)

C <- 1012; a <- 0.73
endgame_wordtab_zipf <- C*(1/1:nw)^a
cbind(endgame_wordtab_sorted[1:10], endgame_wordtab_zipf[1:10])
##      [,1]      [,2]
## the   891 1012.0000
## to    573  610.1388
## a     493  453.8183
## and   454  367.8550
## you   417  312.5593
## i     289  273.6088
## of    285  244.4888
## in    275  221.7812
## tony  260  203.5089
## it    242  188.4432

Not perfect, but not bad. We can also plot the original sorted word counts, and those estimated by our formula law on top

plot(1:nw, as.numeric(endgame_wordtab_sorted), type = "l",
     xlab = "Rank", ylab = "Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)

We’ll learn about plotting tools in detail a bit later

Summary