Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Tuesday 10pm, this week.

This week’s agenda: creating and updating functions; understanding argument and return structures; revisiting Shakespeare’s plays; code refactoring.

Huber loss function

The Huber loss function (or just Huber function, for short) is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and looks like this:

It is often used in place of the usual squared error loss for robust estimation. The sample average, \(\bar{X}\)—which given a sample \(X_1,\ldots,X_n\) minimizes the squared error loss \(\sum_{i=1}^n (X_i-m)^2\) over all choices of \(m\)—can be inaccurate as an estimate of \(\mathbb{E}(X)\) if the distribution of \(X\) is heavy-tailed. In such cases, minimizing Huber loss can give a better estimate. (Interested in hearing more? Come ask Tudor or I!)

Some simple function tasks

oops_df <- data.frame(
  x_vals = seq(0, 5, length=21),
  huber_vals = c(0.0000, 0.0625, 0.2500, 0.5625, 1.0000, 1.5625, 2.2500,
                 3.0625, 4.0000, 5.0625, 6.2500, 7.5625, 9.0000, 10.5000,
                 12.0000, 13.5000, 15.0000, 16.5000, 18.0000, 19.5000, 
                 21.0000))

Shakespeare’s complete works

Recall, as in lab/hw from Week 1, that the complete works of William Shakespeare are available freely from Project Gutenberg. We’ve put this text file up at https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/shakespeare.txt.

Getting lines of text play-by-play

#' Get a word table from text on the web
#'
#' @param str_url string, specifying URL of a web page 
#' @param split string, specifying what to split on. Default is the regex 
#' pattern "[[:space:]]|[[:punct:]]"
#' @param tolower Boolean, TRUE if words should be converted to lower case 
#' before the word table is computed. Default is TRUE
#' @param keep_nums Boolean, TRUE if words containing numbers should be kept in 
#' the word table. Default is FALSE
#'
#' @return list, containing word table, and then some basic numeric summaries
#'
#' @examples
#' endgame_speech_list <- get_wordtab_from_url(
#'  paste0("https://raw.githubusercontent.com/benjaminleroy/",
#'  "36-350-summer-data/master/Week1/endgame.txt"))
get_wordtab_from_url <- function(str_url, split = "[[:space:]]|[[:punct:]]", 
                                 tolower = TRUE,
                                 keep_nums = FALSE) {
  lines <- readLines(str_url)
  text <- paste(lines, collapse = " ")
  words <- strsplit(text, split = split)[[1]]
  words <- words[words != ""]
  
  # Convert to lower case, if we're asked to
  if (tolower) {
    words <- tolower(words)
  }
  
  # Get rid of words with numbers, if we're asked to
  if (!keep_nums) {
    words <- grep("[0-9]", words, inv = TRUE, val = TRUE)
  }

  # Compute the word table
  wordtab <- table(words)
  
  return(list(wordtab = wordtab,
              number_unique_words = length(wordtab),
              number_total_words = sum(wordtab),
              longest_word = words[which.max(nchar(words))]))
}

Getting word tables play-by-play

Refactoring the word table functions