Statistical Computing, 36-350
Friday - June 5, 2019
[ ]
and [[ ]]
)if()
, else if()
, else
: standard conditionalsifelse()
: shortcut for using if()
and else
in combinationswitch()
: shortcut for using if()
, else if()
, and else
in combinationfor()
, while()
, repeat
: standard loop constructsfor()
loops, vectorization is your friend!apply()
: can also be very useful (we’ll see them later)String basics
The simplest distinction:
Character: a symbol in a written language, like letters, numerals, punctuation, space, etc.
String: a sequence of characters bound together
## [1] "character"
## [1] "character"
Why do we care about strings?
Whitespaces count as characters and can be included in strings:
" "
for space"\n"
for newline"\t"
for tab## [1] "Dear Mr. Carnegie,\n\tThanks for the great school!\n\nSincerely, Ben"
Use cat()
to print strings to the console, displaying whitespaces properly
## Dear Mr. Carnegie,
## Thanks for the great school!
##
## Sincerely, Ben
The character is a basic data type in R (like numeric, or logical), so we can make vectors or matrices of out them. Just like we would with numbers
str_vec <- c("Statistical", "Computing", "isn't that bad") # Collect 3 strings
str_vec # All elements of the vector
## [1] "Statistical" "Computing" "isn't that bad"
## [1] "isn't that bad"
## [1] "isn't that bad"
str_mat <- matrix("", 2, 3) # Build an empty 2 x 3 matrix
str_mat[1,] <- str_vec # Fill the 1st row with str_vec
str_mat[2,1:2] <- str_vec[1:2] # Fill the 2nd row, only entries 1 and 2, with
# those of str_vec
str_mat[2,3] <- "isn't a fad" # Fill the 2nd row, 3rd entry, with a new string
str_mat # All elements of the matrix
## [,1] [,2] [,3]
## [1,] "Statistical" "Computing" "isn't that bad"
## [2,] "Statistical" "Computing" "isn't a fad"
## [,1] [,2]
## [1,] "Statistical" "Statistical"
## [2,] "Computing" "Computing"
## [3,] "isn't that bad" "isn't a fad"
Easy! Make things into strings with as.character()
## [1] "0.8"
## [1] "8e+09"
## [1] "1" "2" "3" "4" "5"
## [1] "TRUE"
Not as easy! Depends on the given string, of course
## [1] 0.5
## [1] 0.5
## [1] 5e-11
## Warning: NAs introduced by coercion
## [1] NA
## [1] TRUE
## [1] NA
Use nchar()
to count the number of characters in a string
## [1] 6
## [1] 11
## [1] 1
## [1] 2
## [1] 6 11
Substrings, splitting and combining strings
Use substr()
to grab a subseqence of characters from a string, called a substring
## [1] "Give"
## [1] "break"
## [1] ""
substr()
vectorizesJust like nchar()
, and many other string functions
presidents <- c("Clinton", "Bush", "Reagan", "Carter", "Ford")
substr(presidents, 1, 2) # Grab the first 2 letters from each
## [1] "Cl" "Bu" "Re" "Ca" "Fo"
## [1] "C" "u" "a" "t" ""
## [1] "C" "Bu" "Rea" "Cart" "Ford"
## [1] "on" "sh" "an" "er" "rd"
Can also use substr()
to replace a character, or a substring
## [1] "Give me a break"
## [1] "Live me a break"
## [1] "Live me a break"
## [1] "Show me a break"
Use the strsplit()
function to split based on a keyword
ingredients <- "chickpeas, tahini, olive oil, garlic, salt"
split_obj <- strsplit(ingredients, split=",")
split_obj
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
## [1] "list"
## [1] 1
Note that the output is actually a list! (With just one element, which is a vector of strings)
strsplit()
vectorizesJust like nchar()
, substr()
, and the many others
great_profs <- "Nugent, Tibshirani, Genovese, Rinaldo, Shalizi, Ventura"
favorite_cats <- "tiger, leopard, jaguar, lion"
split_list <- strsplit(c(ingredients, great_profs, favorite_cats), split=",")
split_list
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "Nugent" " Tibshirani" " Genovese" " Rinaldo" " Shalizi"
## [6] " Ventura"
##
## [[3]]
## [1] "tiger" " leopard" " jaguar" " lion"
strsplit()
needs to return a list now?Finest splitting you can do is character-by-character: use strsplit()
with split=""
## [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"
## [1] 42
## [1] 42
Use the paste()
function to join two (or more) strings into one, separated by a keyword
## [1] "Spider Man"
## [1] "Spider-Man"
## [1] "Spider, Man, does whatever"
paste()
vectorizesJust like nchar()
, substr()
, strsplit()
, etc. Seeing a theme yet?
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
## [1] "Clinton D" "Bush R" "Reagan R" "Carter D" "Ford R"
## [1] "Clinton D" "Bush R" "Reagan D" "Carter R" "Ford D"
## [1] "Clinton (42)" "Bush (41)" "Reagan (40)" "Carter (39)"
## [5] "Ford (38)"
Can condense a vector of strings into one big string by using paste()
with the collapse
argument
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
## [1] "Clinton; Bush; Reagan; Carter; Ford"
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
Reading in text, summarizing text
How to get text, from an external source, into R? Use the readLines()
function
endgame_lines <- readLines("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt")
class(endgame_lines) # We have a character vector
## [1] "character"
## [1] 3242
## [1] "."
## [2] " [The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows while mentoring his daughter, Lila Barton, on shooting one.]"
## [3] ""
## [4] "CLINT BARTON : Okay, hold on. Don't shoot. You see where you're going?"
## [5] ""
## [6] "LILA BARTON : Mhm. "
We don’t need to use the web; readLines()
can be used on a local file. The following code would read in a text file from my (Ben’s) computer:
This will cause an error for you, unless your folder is set up exactly like the instructor’s laptop! So using web links is more robust
Fancy word, but all it means: make one long string, then split the words
endgame_text <- paste(endgame_lines, collapse = " ")
endgame_words <- strsplit(endgame_text, split = " ")[[1]]
# Sanity check
substr(endgame_text, 4, 200)
## [1] "[The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows"
## [1] "[The" "screen" "first" "panels" "up" "to" "an"
## [8] "arrow" "being" "nocked" "into" "a" "bow." "The"
## [15] "archer" "behind" "firmly" "grips"
The most basic steps we’ll do are - Convert words to lower case - Remove numbers and punctuation - Remove “empty” strings
# Convert to lower case
endgame_words <- tolower(endgame_words)
# Removing numbers and punctuation (Don't worry if this doesn't make sense)
endgame_words <- gsub("[[:punct:]]|[[:digit:]]", "", endgame_words)
endgame_words <- gsub("[[:space:]]", "", endgame_words) # removing spaces
# Remove empty strings
endgame_words <- endgame_words[endgame_words != ""]
endgame_words[1:20]
## [1] "the" "screen" "first" "panels" "up" "to" "an"
## [8] "arrow" "being" "nocked" "into" "a" "bow" "the"
## [15] "archer" "behind" "firmly" "grips" "it" "tight"
Notice that beforehand, punctuation mattered. This is not ideal for us—we’ll learn just a little bit about how to fix this on lab/homework, using regular expressions
Our most basic tool for summarizing text: word counts, retrieved using table()
## [1] "table"
## [1] 3594
## endgame_words
## a aaaaahhh abandoned able aboard
## 493 1 4 2 1
## about above abrupt absentmindedly absolute
## 51 1 1 2 1
What did we get? Alphabetically sorted unique words, and their counts = number of appearances
Note: this is actually a vector of numbers, and the words are the names of the vector
## endgame_words
## a aaaaahhh abandoned able aboard
## 493 1 4 2 1
## abandoned
## TRUE
## [1] TRUE
So with named indexing, we can now use this to look up whatever words we want
## avengers
## 33
## thanos
## 106
## happy
## 11
## <NA>
## NA
Let’s sort in decreasing order, to get the most frequent words
## [1] 3594
## endgame_words
## the to a and you i of in tony it we on
## 891 573 493 454 417 289 285 275 260 242 196 194
## is his that steve he stark scott bruce
## 184 178 173 170 167 160 143 142
## endgame_words
## young
## 1
## younger
## 1
## yours
## 1
## youseveral
## 1
## youwell
## 1
## youyoure
## 1
## zip
## 1
## zol
## 1
## zola
## 1
## zone
## 1
## zooms
## 1
## あなたは生き残った惑星の半分はそうではなかった
## 1
## くれromanized
## 1
## ことですromanized
## 1
## それはあきひこさん
## 1
## たすけて
## 1
## なぜあなたはこれをやっている私たちはあなたに何もしませんでしたromanized
## 1
## 分かってるね
## 1
## 彼らはサノスを手に入れた
## 1
## 死ね
## 1
## [1] "AKIHIKO : たすけて くれ! (Romanized: Tasukete kure! (English: Wait! Help me! I'll give you anything! What do you want?)"
Let’s use a plot to visualize frequencies
nw <- length(endgame_wordtab_sorted)
plot(1:nw, as.numeric(endgame_wordtab_sorted), type="l",
xlab="Rank", ylab="Frequency")
A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)
This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law
For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=1012\) and \(a=0.73\)
C <- 1012; a <- 0.73
endgame_wordtab_zipf <- C*(1/1:nw)^a
cbind(endgame_wordtab_sorted[1:10], endgame_wordtab_zipf[1:10])
## [,1] [,2]
## the 891 1012.0000
## to 573 610.1388
## a 493 453.8183
## and 454 367.8550
## you 417 312.5593
## i 289 273.6088
## of 285 244.4888
## in 275 221.7812
## tony 260 203.5089
## it 242 188.4432
Not perfect, but not bad. We can also plot the original sorted word counts, and those estimated by our formula law on top
plot(1:nw, as.numeric(endgame_wordtab_sorted), type = "l",
xlab = "Rank", ylab = "Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)
We’ll learn about plotting tools in detail a bit later
nchar()
, substr()
: functions for substring extractions and replacementsstrsplit()
, paste()
: functions for splitting and combining stringstable()
: function to get word counts, useful way of summarizing text data