Last week: Visualization…

Visualization

Coding Style

Part I

Function basics

Why do we need functions?

Remember those commands you typed over and over?

Recall from previous lectures and assignments on on text manipulation and regexes:

endgame_lines <- readLines(
  paste0("https://raw.githubusercontent.com/benjaminleroy/",
         "36-350-summer-data/master/Week1/endgame.txt"))
infinitywar_lines <- readLines(
  paste0("https://raw.githubusercontent.com/benjaminleroy/",
         "36-350-summer-data/master/Week1/infinitywar.txt"))


endgame_text <- paste(endgame_lines, collapse = " ")
endgame_words <- strsplit(endgame_text, split = "[[:space:]]|[[:punct:]]")[[1]]
endgame_words <- endgame_words[endgame_words != ""]
endgame_wordtab <- table(endgame_words)

Creating your own function

Call function() to create your own function. Document your function with comments

# Get a word table from text on the web
# 
# Arguments:
# ----------
# str_url : string, specifying URL of a web page 
#
# Returns:
# --------
# wordtable : a word table, i.e., vector with counts as entries 
# and associated words as names
#
# Example:
# --------
# endgame_wordtable <- get_wordtab_from_url(
#  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#  "36-350-summer-data/master/Week1/endgame.txt"))

get_wordtab_from_url <- function(str_url) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
  words <- words[words != ""]
  wordtab <- table(words)
  return(wordtab)
}

Creating your own function (different documentation)

Roxygen2 style (which we’ll be using the rest of lecture)

#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#'
#' @return \code{wordtable}, a word table, i.e., vector with counts as entries 
#' and associated words as names
#'
#' @examples
#' endgame_wordtable <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"))
get_wordtab_from_url <- function(str_url) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
  words <- words[words != ""]
  wordtab <- table(words)
  return(wordtab)
}

Function structure

The structure of a function has three basic parts:

R doesn’t let your function have multiple outputs, but you can return a list

Using your created function

Our created functions can be used just like the built-in ones

# Using our function
endgame_wordtab_new <- get_wordtab_from_url(
  paste0("https://raw.githubusercontent.com/benjaminleroy/",
         "36-350-summer-data/master/Week1/endgame.txt"))
all(endgame_wordtab_new == endgame_wordtab)
## [1] TRUE
# Revealing our function's definition
get_wordtab_from_url
## function(str_url) {
##   lines <- readLines(str_url)
##   text <- paste(lines, collapse = " ")
##   words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
##   words <- words[words != ""]
##   wordtab <- table(words)
##   return(wordtab)
## }

Default return value

With no explicit return() statement, the default is just to return whatever is on the last line. So the following is equivalent to what we had before

get_wordtab_from_url = function(str_url) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
  words <- words[words != ""]
  table(words)
}

Multiple inputs

Our function can take more than one input

#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#' @param split string, specifying what to split on
#'
#' @return \code{wordtable}, a word table, i.e., vector with counts as entries 
#' and associated words as names
#'
#' @examples
#' endgame_wordtable <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"),
#'  split = "[[:space:]]|[[:punct:]]")
get_wordtab_from_url <- function(str_url, split) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = split)[[1]]
  words <- words[words != ""]
  table(words)
}

Default inputs

Our function can also specify default values for the inputs (if the user doesn’t specify an input in the function call, then the default value is used)

#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#' @param split string, specifying what to split on. Default is the regex 
#' pattern "[[:space:]]|[[:punct:]]"
#' @param tolower Boolean, TRUE if words should be converted to lower case 
#' before the word table is computed. Default is TRUE
#'
#' @return \code{wordtable}, a word table, i.e., vector with counts as entries 
#' and associated words as names
#'
#' @examples
#' endgame_wordtable <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"))
get_wordtab_from_url <- function(str_url, split = "[[:space:]]|[[:punct:]]", 
                                 tolower = TRUE) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = split)[[1]]
  words <- words[words != ""]
  
  # Convert to lower case, if we're asked to
  if (tolower) {
    words <- tolower(words)
  }
  
  table(words)
}

Examples of function calls

# Inputs can be called by name, or without names
endgame_wordtable1 <- get_wordtab_from_url(
  str_url = "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  split = "[[:space:]]|[[:punct:]]", tolower = TRUE)

endgame_wordtable2 <- get_wordtab_from_url(
  "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  "[[:space:]]|[[:punct:]]", TRUE)
all(endgame_wordtable2 == endgame_wordtable1)
## [1] TRUE
# Inputs can be called by partial names (if uniquely identifying)
endgame_wordtable3 <- get_wordtab_from_url(
  str = "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  spl = "[[:space:]]|[[:punct:]]", tolower = TRUE)
all(endgame_wordtable3 == endgame_wordtable1)
## [1] TRUE

# When inputs aren't specified, default values are used
endgame_wordtable4 <- get_wordtab_from_url(
  str_url="https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  split="[[:space:]]|[[:punct:]]")
all(endgame_wordtable4 == endgame_wordtable1)
## [1] TRUE
# Named inputs can go in any order
endgame_wordtable5 <- get_wordtab_from_url(
  tolower = TRUE, split = "[[:space:]]|[[:punct:]]",
  str_url = "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt")
all(endgame_wordtable5 == endgame_wordtable1)
## [1] TRUE

The dangers of using inputs without names

While named inputs can go in any order, unnamed inputs must go in the proper order (as they are specified in the function’s definition). E.g., the following code would throw an error:

endgame_wordtable6 <- get_wordtab_from_url("[[:space:]]|[[:punct:]]",
  "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  tolower=FALSE)
## Warning in file(con, "r"): cannot open file '[[:space:]]|[[:punct:]]': No
## such file or directory
## Error in file(con, "r"): cannot open the connection

because our function would try to open up “[[:space:]]|[[:punct:]]” as the URL of a web page

When calling a function with multiple arguments, use input names for safety, unless you’re absolutely certain of the right order for (some) inputs

Part II

Return values and side effects

Returning more than one thing

When creating a function in R, though you cannot return more than one output, you can return a list. This (by definition) can contain an arbitrary number of arbitrary objects

#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#' @param split string, specifying what to split on. Default is the regex 
#' pattern "[[:space:]]|[[:punct:]]"
#' @param tolower Boolean, TRUE if words should be converted to lower case 
#' before the word table is computed. Default is TRUE
#' @param keep_nums Boolean, TRUE if words containing numbers should be kept in 
#' the word table. Default is FALSE
#'
#' @return list, containing word table, and then some basic numeric summaries
#'
#' @examples
#' endgame_speech_list <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"))
get_wordtab_from_url <- function(str_url, split = "[[:space:]]|[[:punct:]]", 
                                 tolower = TRUE,
                                 keep_nums = FALSE) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = split)[[1]]
  words <- words[words != ""]
  
  # Convert to lower case, if we're asked to
  if (tolower) {
    words <- tolower(words)
  }
  if (!keep_nums) {
    words <- grep("[0-9]", words, inv = TRUE, val = TRUE)
  }
  
  wordtab <- table(words)
  
  return(list(wordtab = wordtab,
              number_unique_words = length(wordtab),
              number_total_words = sum(wordtab),
              longest_word = words[which.max(nchar(words))]))
}

# La la land's script
endgame_wordtable <- get_wordtab_from_url(
  "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt")
lapply(endgame_wordtable, head)
## $wordtab
## words
##         a  aaaaahhh abandoned      able    aboard     about 
##       514         1         4         2         1        52 
## 
## $number_unique_words
## [1] 3071
## 
## $number_total_words
## [1] 22912
## 
## $longest_word
## [1] "私たちはあなたに何もしませんでした"

Side effects

A side effect of a function is something that happens as a result of the function’s body, but is not returned. Examples:


library(ggplot2)
#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#' @param split string, specifying what to split on. Default is the regex 
#' pattern "[[:space:]]|[[:punct:]]"
#' @param tolower Boolean, TRUE if words should be converted to lower case 
#' before the word table is computed. Default is TRUE
#' @param keep_nums Boolean, TRUE if words containing numbers should be kept in 
#' the word table. Default is FALSE
#' @param hist Boolean, TRUE if a histogram of word lengths should be plotted as 
#' a side effect. Default is FALSE
#'
#' @return list, containing word table, and then some basic numeric summaries
#'
#' @examples
#' endgame_speech_list <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"))
get_wordtab_from_url <- function(str_url, split = "[[:space:]]|[[:punct:]]", 
                                 tolower = TRUE,
                                 keep_nums = FALSE,
                                 hist = FALSE) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = split)[[1]]
  words <- words[words != ""]
  
  # Convert to lower case, if we're asked to
  if (tolower) {
    words <- tolower(words)
  }
  
  # Get rid of words with numbers, if we're asked to
  if (!keep_nums) {
    words <- grep("[0-9]", words, inv = TRUE, val = TRUE)
  }
  
  # Plot the histogram of the word lengths, if we're asked to
  data_vis <- data.frame(word = words, nchar = nchar(words))
  if (hist) {
    my_plot <- ggplot(data_vis, aes(x = nchar)) +
      geom_histogram(fill = "lightblue",
                     color = "black", binwidth = 1) +
      labs(x = "Word Length")
    
    plot(my_plot) # how to force visualization of a ggplot object
  }
  
  # Compute the word table
  wordtab <- table(words)
  
  return(list(wordtab = wordtab,
              number_unique_words = length(wordtab),
              number_total_words = sum(wordtab),
              longest_word = words[which.max(nchar(words))],
              data_vis = data_vis))
}

# La la land's transcript
endgame_wordtable <- get_wordtab_from_url(
  str_url="https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  hist=TRUE)

lapply(endgame_wordtable, head)
## $wordtab
## words
##         a  aaaaahhh abandoned      able    aboard     about 
##       514         1         4         2         1        52 
## 
## $number_unique_words
## [1] 3071
## 
## $number_total_words
## [1] 22912
## 
## $longest_word
## [1] "私たちはあなたに何もしませんでした"
## 
## $data_vis
##     word nchar
## 1    the     3
## 2 screen     6
## 3  first     5
## 4 panels     6
## 5     up     2
## 6     to     2

Part III

Environments and design

Environment: what the function can see and do

Environment examples

x <- 7
y <- c("A","C","G","T","U")
adder <- function(y) { 
  x <- x + y
  x 
  }
adder(1)
## [1] 8
x
## [1] 7
y
## [1] "A" "C" "G" "T" "U"

circle_area <- function(r) { pi*r^2 }
circle_area(1:3)
## [1]  3.141593 12.566371 28.274334
true.pi <- pi
pi <- 3 # Valid in 1800s Indiana
circle_area(1:3)
## [1]  3 12 27
pi <- true.pi # Restore sanity
circle_area(1:3)
## [1]  3.141593 12.566371 28.274334

Relying on variables outside of the function’s environment

Bad side effects

Not all side effects are desirable. One particularly bad side effect is if the function’s body changes the value of some variable outside of the function’s environment

Top-down function design

  1. Start with the big-picture view of the task
  2. Break the task into a few big parts
  3. Figure out how to fit the parts together
  4. Repeat this for each part

Start off with a code sketch

You can write top-level code, right away, for your function’s design:

# Not actual code
big_job <- function(lots_of_arguments) {
  first_result <- first_step(some_of_the_args)
  second_result <- second_step(first_result, more_of_the_args)
  final_result <- third_step(second_result, rest_of_the_args)
  return(final_result)
}

After you write down your design, go ahead and write the sub-functions (here first_step(), second_step(), third_step()). The process may be iterative, in that you may write these sub-functions, then go back and change the design a bit, etc.

With practice, this design strategy should become natural

Summary

Extra

endgame_wordtable <- get_wordtab_from_url(
  str_url = "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt",
  hist = FALSE)

infinitywar_wordtable <- get_wordtab_from_url(
  str_url = "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/infinitywar.txt",
  hist = FALSE)

both_all_words <- rbind(cbind(endgame_wordtable$data_vis, film = "endgame"),
                        cbind(infinitywar_wordtable$data_vis, film = "infinity war"))

ggplot(both_all_words, aes(x = nchar, fill = film, y =..density..)) +
  geom_histogram(position = "identity",binwidth = 1, alpha = .5)