Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Monday 10pm, next week (July 15).

States data set

Below we construct a data frame, of 50 states x 10 variables. The first 8 variables are numeric and the last 2 are factors. The numeric variables here come from the built-in state.x77 matrix, which records various demographic factors on 50 US states, measured in the 1970s. You can learn more about this state data set by typing ?state.x77 into your R console.

state_df <- data.frame(state.x77, Region = state.region, 
                       Division = state.division)

Basic data frame manipulations

Prostate cancer data set

Let’s return to the prostate cancer data set that we looked at in the lab/homework from Week 2 (taken from the book The Elements of Statistical Learning). Below we read in a data frame of 97 men x 9 variables. You can remind yourself about what’s been measured by looking back at the lab/homework (or by visiting the URL linked above in your web browser, clicking on “Data” on the left-hand menu, and clicking “Info” under “Prostate”).

pros_data <- 

Practice with the apply family

t_test_by_ind <- function(x, ind) {
  stopifnot(all(ind %in% c(0, 1)))
  return(t.test(x[ind == 0], x[ind == 1]))

Rio Olympics data set

We’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio.

rio <- read.csv("")

More practice with data frames and apply

Some advanced practice with apply

Plotting tools

Below, we read in a data set, as in lab, that is related to shark attacks. The data is taken from Kaggle and was originally compiled by the global shark attack file. More information is available here.

shark_attacks <- read.csv("", = TRUE)

One can alternatively use the round function as suggested in the prompt. Due to ambiguity related to rounding off at 5, the following line yields a few decades which are different from the ones above.

More plots

Below you’ll find a data set that contains a subset of all the passengers on the titanic, information about each of them and if they survived. The dataset was taken from kaggle, and you can find more information about the dataset here:

titanic <- read.csv("")
ggplot(data = titanic,
       aes(x = Pclass, fill = factor(Survived))) + geom_bar()

my_volcano <- tidyr::gather(data.frame(row = 1:nrow(volcano), volcano), 
                            key = "column", value = "value", -row)
my_volcano$column <- as.numeric(substr(my_volcano$column, 2, nchar(my_volcano$column))) 

Shakespeare and overlaid histograms (challenge)

Notes: 1. the standard for geom_histogram’s y is “..count..” . 2. geom_histogram(position="identity") makes stacked histograms non-stacked.

Reading (challenge 2)

Go to the Homework 2 assignment page and download and read “Tufte Chapter 9”. This chapter is about making better graphics. Have you ever visualized something that could have been better off with done with a table of a few values? How about the “Unfriendly” graphic attributes on page 183? If you remember such a thing - how might you update it with the recommendations in class and this chapter? This book was written a while ago - what thoughts hold up (in your opinion) and which don’t. That is: what seems less important to you than you expect Tufte would have thought (or visa versa).