Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Monday 10pm, next week (July 9).

## For reproducibility --- don't change this!

Some R basics

x_list <- list(rnorm(6), letters, sample(c(TRUE,FALSE), size=4, replace=TRUE)) Prostate cancer data set ===

OK, moving along to more interesting things! We’re going to look again, as in lab, at the prostate cancer data set: 9 variables measured on 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

  1. lpsa: log PSA score
  2. lcavol: log cancer volume
  3. lweight: log prostate weight
  4. age: age of patient
  5. lbph: log of the amount of benign prostatic hyperplasia
  6. svi: seminal vesicle invasion
  7. lcp: log of capsular penetration
  8. gleason: Gleason score
  9. pgg45: percent of Gleason scores 4 or 5

To load this prostate cancer data set into your R session, and store it as a matrix pros_data:

pros_data <-

Computing standard deviations using iteration

pros_data_svi_sd <- vector(length = ncol(pros_data))
i <- 1

Computing t-tests using vectorization

magic_denom <- c(0.19092077, 0.08803179, 1.91148819, 0.34076326, 0.00000000,
                0.25730390, 0.15441770, 6.30903678, 0.23021447)

Computing t-tests using iteration

Shakespeare’s complete works

On to the more fun stuff! As in lab, we’re going to look at William Shakespeare’s complete works, taken from Project Gutenberg. The Shakespeare data file is up on our course website, and to load it into your R session, as a string vector called shakespeare_lines:

shakespeare_lines <-

Where are Shakespeare’s plays, in this massive text?

Extracting and analysing a couple of plays

  1. Due to the fact that we are testing multiple hypotheses we are using a Bonferoni correction on the false positive rate of .05 (This will be cover in other classes - but feel free to ask me about it).