[ ]
and [[ ]]
)if()
, else if()
, else
: standard conditionalsifelse()
: shortcut for using if()
and else
in combinationswitch()
: shortcut for using if()
, else if()
, and else
in combinationfor()
, while()
, repeat
: standard loop constructsfor()
loops, vectorization is your friend!apply()
: can also be very useful (we’ll see them later)String basics
The simplest distinction:
Character: a symbol in a written language, like letters, numerals, punctuation, space, etc.
String: a sequence of characters bound together
class("r")
## [1] "character"
class("Ben")
## [1] "character"
Why do we care about strings?
Whitespaces count as characters and can be included in strings:
" "
for space"\n"
for newline"\t"
for tabmy_str <- "Dear Mr. Carnegie,\n\tThanks for the great school!\n\nSincerely, Ben"
my_str
## [1] "Dear Mr. Carnegie,\n\tThanks for the great school!\n\nSincerely, Ben"
Use cat()
to print strings to the console, displaying whitespaces properly
cat(my_str)
## Dear Mr. Carnegie,
## Thanks for the great school!
##
## Sincerely, Ben
The character is a basic data type in R (like numeric, or logical), so we can make vectors or matrices of out them. Just like we would with numbers
str_vec <- c("Statistical", "Computing", "isn't that bad") # Collect 3 strings
str_vec # All elements of the vector
## [1] "Statistical" "Computing" "isn't that bad"
str_vec[3] # The 3rd element
## [1] "isn't that bad"
str_vec[-(1:2)] # All but the 1st and 2nd
## [1] "isn't that bad"
str_mat <- matrix("", 2, 3) # Build an empty 2 x 3 matrix
str_mat[1,] <- str_vec # Fill the 1st row with str_vec
str_mat[2,1:2] <- str_vec[1:2] # Fill the 2nd row, only entries 1 and 2, with
# those of str_vec
str_mat[2,3] <- "isn't a fad" # Fill the 2nd row, 3rd entry, with a new string
str_mat # All elements of the matrix
## [,1] [,2] [,3]
## [1,] "Statistical" "Computing" "isn't that bad"
## [2,] "Statistical" "Computing" "isn't a fad"
t(str_mat) # Transpose of the matrix
## [,1] [,2]
## [1,] "Statistical" "Statistical"
## [2,] "Computing" "Computing"
## [3,] "isn't that bad" "isn't a fad"
Easy! Make things into strings with as.character()
as.character(0.8)
## [1] "0.8"
as.character(0.8e+10)
## [1] "8e+09"
as.character(1:5)
## [1] "1" "2" "3" "4" "5"
as.character(TRUE)
## [1] "TRUE"
Not as easy! Depends on the given string, of course
as.numeric("0.5")
## [1] 0.5
as.numeric("0.5 ")
## [1] 0.5
as.numeric("0.5e-10")
## [1] 5e-11
as.numeric("Hi!")
## Warning: NAs introduced by coercion
## [1] NA
as.logical("True")
## [1] TRUE
as.logical("TRU")
## [1] NA
Use nchar()
to count the number of characters in a string
nchar("coffee")
## [1] 6
nchar("code monkey")
## [1] 11
length("code monkey")
## [1] 1
length(c("coffee", "code monkey"))
## [1] 2
nchar(c("coffee", "code monkey")) # Vectorization!
## [1] 6 11
Substrings, splitting and combining strings
Use substr()
to grab a subseqence of characters from a string, called a substring
phrase <- "Give me a break"
substr(phrase, 1, 4)
## [1] "Give"
substr(phrase, nchar(phrase)-4, nchar(phrase))
## [1] "break"
substr(phrase, nchar(phrase)+1, nchar(phrase)+10)
## [1] ""
substr()
vectorizesJust like nchar()
, and many other string functions
presidents <- c("Clinton", "Bush", "Reagan", "Carter", "Ford")
substr(presidents, 1, 2) # Grab the first 2 letters from each
## [1] "Cl" "Bu" "Re" "Ca" "Fo"
substr(presidents, 1:5, 1:5) # Grab the first, 2nd, 3rd, etc.
## [1] "C" "u" "a" "t" ""
substr(presidents, 1, 1:5) # Grab the first, first 2, first 3, etc.
## [1] "C" "Bu" "Rea" "Cart" "Ford"
substr(presidents, nchar(presidents)-1, nchar(presidents)) # Grab the last 2
## [1] "on" "sh" "an" "er" "rd"
# letters from each
Can also use substr()
to replace a character, or a substring
phrase
## [1] "Give me a break"
substr(phrase, 1, 1) <- "L"
phrase # "G" changed to "L"
## [1] "Live me a break"
substr(phrase, 1000, 1001) <- "R"
phrase # Nothing happened
## [1] "Live me a break"
substr(phrase, 1, 4) <- "Show"
phrase # "Live" changed to "Show"
## [1] "Show me a break"
Use the strsplit()
function to split based on a keyword
ingredients <- "chickpeas, tahini, olive oil, garlic, salt"
split_obj <- strsplit(ingredients, split=",")
split_obj
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
class(split_obj)
## [1] "list"
length(split_obj)
## [1] 1
Note that the output is actually a list! (With just one element, which is a vector of strings)
strsplit()
vectorizesJust like nchar()
, substr()
, and the many others
great_profs <- "Nugent, Tibshirani, Genovese, Rinaldo, Shalizi, Ventura"
favorite_cats <- "tiger, leopard, jaguar, lion"
split_list <- strsplit(c(ingredients, great_profs, favorite_cats), split=",")
split_list
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "Nugent" " Tibshirani" " Genovese" " Rinaldo" " Shalizi"
## [6] " Ventura"
##
## [[3]]
## [1] "tiger" " leopard" " jaguar" " lion"
strsplit()
needs to return a list now?Finest splitting you can do is character-by-character: use strsplit()
with split=""
split_chars <- strsplit(ingredients, split = "")[[1]]
split_chars
## [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"
length(split_chars)
## [1] 42
nchar(ingredients) # Matches the previous count
## [1] 42
Use the paste()
function to join two (or more) strings into one, separated by a keyword
paste("Spider", "Man") # Default is to separate by " "
## [1] "Spider Man"
paste("Spider", "Man", sep = "-")
## [1] "Spider-Man"
paste("Spider", "Man", "does whatever", sep = ", ")
## [1] "Spider, Man, does whatever"
paste()
vectorizesJust like nchar()
, substr()
, strsplit()
, etc. Seeing a theme yet?
presidents
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
paste(presidents, c("D", "R", "R", "D", "R"))
## [1] "Clinton D" "Bush R" "Reagan R" "Carter D" "Ford R"
paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)
## [1] "Clinton D" "Bush R" "Reagan D" "Carter R" "Ford D"
paste(presidents, " (", 42:38, ")", sep="")
## [1] "Clinton (42)" "Bush (41)" "Reagan (40)" "Carter (39)"
## [5] "Ford (38)"
Can condense a vector of strings into one big string by using paste()
with the collapse
argument
presidents
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
paste(presidents, collapse = "; ")
## [1] "Clinton; Bush; Reagan; Carter; Ford"
paste(presidents, " (", 42:38, ")", sep="", collapse = "; ")
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep = "",
collapse= "; ")
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
paste(presidents, collapse=NULL) # No condensing, the default
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
Reading in text, summarizing text
How to get text, from an external source, into R? Use the readLines()
function
endgame_lines <- readLines("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt")
class(endgame_lines) # We have a character vector
## [1] "character"
length(endgame_lines) # Many lines (elements)!
## [1] 3242
endgame_lines[1:6] # First 6 lines
## [1] "."
## [2] " [The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows while mentoring his daughter, Lila Barton, on shooting one.]"
## [3] ""
## [4] "CLINT BARTON : Okay, hold on. Don't shoot. You see where you're going?"
## [5] ""
## [6] "LILA BARTON : Mhm. "
We don’t need to use the web; readLines()
can be used on a local file. The following code would read in a text file from my (Ben’s) computer:
endgame_lines_2 <- readLines("~/Documents/CMU/third_summer/36-350-summer-data/Week1/endgame.txt")
This will cause an error for you, unless your folder is set up exactly like the instructor’s laptop! So using web links is more robust
Fancy word, but all it means: make one long string, then split the words
endgame_text <- paste(endgame_lines, collapse = " ")
endgame_words <- strsplit(endgame_text, split = " ")[[1]]
# Sanity check
substr(endgame_text, 4, 200)
## [1] "[The screen first panels up to an arrow being nocked into a bow. The archer behind firmly grips it tight as it was aiming towards the target. The camera reveals Clint Barton holding up a few arrows"
endgame_words[3:20]
## [1] "[The" "screen" "first" "panels" "up" "to" "an"
## [8] "arrow" "being" "nocked" "into" "a" "bow." "The"
## [15] "archer" "behind" "firmly" "grips"
The most basic steps we’ll do are - Convert words to lower case - Remove numbers and punctuation - Remove “empty” strings
# Convert to lower case
endgame_words <- tolower(endgame_words)
# Removing numbers and punctuation (Don't worry if this doesn't make sense)
endgame_words <- gsub("[[:punct:]]|[[:digit:]]", "", endgame_words)
endgame_words <- gsub("[[:space:]]", "", endgame_words) # removing spaces
# Remove empty strings
endgame_words <- endgame_words[endgame_words != ""]
endgame_words[1:20]
## [1] "the" "screen" "first" "panels" "up" "to" "an"
## [8] "arrow" "being" "nocked" "into" "a" "bow" "the"
## [15] "archer" "behind" "firmly" "grips" "it" "tight"
Notice that beforehand, punctuation mattered. This is not ideal for us—we’ll learn just a little bit about how to fix this on lab/homework, using regular expressions
Our most basic tool for summarizing text: word counts, retrieved using table()
endgame_wordtab <- table(endgame_words)
class(endgame_wordtab)
## [1] "table"
length(endgame_wordtab)
## [1] 3594
endgame_wordtab[1:10]
## endgame_words
## a aaaaahhh abandoned able aboard
## 493 1 4 2 1
## about above abrupt absentmindedly absolute
## 51 1 1 2 1
What did we get? Alphabetically sorted unique words, and their counts = number of appearances
Note: this is actually a vector of numbers, and the words are the names of the vector
endgame_wordtab[1:5]
## endgame_words
## a aaaaahhh abandoned able aboard
## 493 1 4 2 1
endgame_wordtab[3] == 4
## abandoned
## TRUE
names(endgame_wordtab)[3] == "abandoned"
## [1] TRUE
So with named indexing, we can now use this to look up whatever words we want
endgame_wordtab["avengers"]
## avengers
## 33
endgame_wordtab["thanos"]
## thanos
## 106
endgame_wordtab["happy"]
## happy
## 11
endgame_wordtab["sad"] # I guess we should classify this movie as a joyful one (at least less sad than Infinity Wars)
## <NA>
## NA
Let’s sort in decreasing order, to get the most frequent words
endgame_wordtab_sorted <- sort(endgame_wordtab, decreasing = TRUE)
length(endgame_wordtab_sorted)
## [1] 3594
head(endgame_wordtab_sorted, 20) # First 20
## endgame_words
## the to a and you i of in tony it we on
## 891 573 493 454 417 289 285 275 260 242 196 194
## is his that steve he stark scott bruce
## 184 178 173 170 167 160 143 142
tail(endgame_wordtab_sorted, 20) # Last 20
## endgame_words
## young
## 1
## younger
## 1
## yours
## 1
## youseveral
## 1
## youwell
## 1
## youyoure
## 1
## zip
## 1
## zol
## 1
## zola
## 1
## zone
## 1
## zooms
## 1
## あなたは生き残った惑星の半分はそうではなかった
## 1
## くれromanized
## 1
## ことですromanized
## 1
## それはあきひこさん
## 1
## たすけて
## 1
## なぜあなたはこれをやっている私たちはあなたに何もしませんでしたromanized
## 1
## 分かってるね
## 1
## 彼らはサノスを手に入れた
## 1
## 死ね
## 1
endgame_lines[1068] # sanity check
## [1] "AKIHIKO : たすけて くれ! (Romanized: Tasukete kure! (English: Wait! Help me! I'll give you anything! What do you want?)"
Let’s use a plot to visualize frequencies
nw <- length(endgame_wordtab_sorted)
plot(1:nw, as.numeric(endgame_wordtab_sorted), type="l",
xlab="Rank", ylab="Frequency")
A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)
This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law
For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=1012\) and \(a=0.73\)
C <- 1012; a <- 0.73
endgame_wordtab_zipf <- C*(1/1:nw)^a
cbind(endgame_wordtab_sorted[1:10], endgame_wordtab_zipf[1:10])
## [,1] [,2]
## the 891 1012.0000
## to 573 610.1388
## a 493 453.8183
## and 454 367.8550
## you 417 312.5593
## i 289 273.6088
## of 285 244.4888
## in 275 221.7812
## tony 260 203.5089
## it 242 188.4432
Not perfect, but not bad. We can also plot the original sorted word counts, and those estimated by our formula law on top
plot(1:nw, as.numeric(endgame_wordtab_sorted), type = "l",
xlab = "Rank", ylab = "Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)
We’ll learn about plotting tools in detail a bit later
nchar()
, substr()
: functions for substring extractions and replacementsstrsplit()
, paste()
: functions for splitting and combining stringstable()
: function to get word counts, useful way of summarizing text data