Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Tuesday 10pm, this week.

This week’s agenda: understanding training and testing errors, implementing sample-splitting and cross-validation (optional), and trying a bunch of statistical prediction methods (also optional).

Practice with training and test errors

The code below generates and plots training and test data from a simple univariate linear model, as in lecture. (You don’t need to do anything yet.)

library(tidyverse)
set.seed(1)
n <- 30
x <- sort(runif(n, -3, 3))
y <- 2*x + 2*rnorm(n)
x0 <- sort(runif(n, -3, 3))
y0 <- 2*x0 + 2*rnorm(n)

data_vis <- data.frame(x = c(x, x0), 
                       y = c(y, y0),
                       tt = factor(rep(c("Training data", "Test data"), 
                                       each = n),
                                   levels = c("Training data", "Test data")))

ggplot(data_vis) +
  geom_point(aes(x = x, y = y)) +
  facet_grid(~tt)

Sample-splitting with the prostate cancer data

Below, we read in data on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). (You don’t need to do anything yet.)

pros_df <- read.table(
  "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
dim(pros_df)
## [1] 97 10
head(pros_df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE