Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Tuesday 10pm, this week.

This week’s agenda: investigating the differences between data frames and matrices; practicing how to use the apply family of functions.

Crime data set

We’re going to look at a data set containing the number of assaults, murders, and rapes per 100,000 residents, in each of the 50 US states in 1973. This comes from a built-in data frame called USArrests. We’ll rename this to crime_df and append a column that gives the region for each state, from the built-in vector state.region. You can learn more about this crime data set by typing ?USArrests into your R console.

crime_df <- data.frame(USArrests, Region = state.region)
# or crime_df <- cbind(data.frame(USArrests), state.region)
# or (will learn later) crime_df <- data.frame(USArrests) %>% mutate(state.region = state.region)

Data frame basics

Apply functions versus for() loops

The purpose of the next several questions is to help you internalize how the apply functions—specifically, apply(), sapply(), lapply(), and tapply()—are essentially convenient ways to write for() loops.

Here’s an example to get us started. Consider the following list, called lis, which contains 4 vectors of 5 randomly generated numbers.

set.seed(10)
lis <- list(rnorm(5), rnorm(5), rnorm(5), rnorm(5))
lis
## [[1]]
## [1]  0.01874617 -0.18425254 -1.37133055 -0.59916772  0.29454513
## 
## [[2]]
## [1]  0.3897943 -1.2080762 -0.3636760 -1.6266727 -0.2564784
## 
## [[3]]
## [1]  1.1017795  0.7557815 -0.2382336  0.9874447  0.7413901
## 
## [[4]]
## [1]  0.08934727 -0.95494386 -0.19515038  0.92552126  0.48297852

Suppose we wanted to compute the mean of each vector (so we’re looking for 4 numbers). We could do this using a for() loop in the following way, storing the results in mean_vector.

mean_vector <- vector(length = length(lis), mode = "numeric")
for (i in 1:length(lis)) {
  mean_vector[i] <- mean(lis[[i]])
}
mean_vector
## [1] -0.36829190 -0.61302179  0.66963246  0.06955056

We could also do this using a call to sapply(), in the following simpler way, storing the result as mean_vector2. This gives us the same exact answer.

mean_vector2 <- sapply(lis, mean)
all.equal(mean_vector, mean_vector2)
## [1] TRUE

We’re going to ask you to emulate this for each of 3 other apply functions (lapply(), apply() and tapply()) in the next 3 questions. Your goal will be to compute something using one of the apply functions or a for() loop, and show they are the same. The tricky part here will be formatting the for() loop properly to match exactly the apply function’s output.