Name:
Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Monday 10pm, next week (July 15).

# States data set

Below we construct a data frame, of 50 states x 10 variables. The first 8 variables are numeric and the last 2 are factors. The numeric variables here come from the built-in state.x77 matrix, which records various demographic factors on 50 US states, measured in the 1970s. You can learn more about this state data set by typing ?state.x77 into your R console.

state_df <- data.frame(state.x77, Region = state.region,
Division = state.division)

# Basic data frame manipulations

• 1a. Add a column to state_df, containing the state abbreviations that are stored in the built-in vector state.abb. Name this column Abbr. You can do this in (at least) two ways: by using a call to data.frame(), or by directly defining state_df$Abbr. Display the first 3 rows and all 11 columns of the new state_df. • 1b. Remove the Region column from state_df. You can do this in (at least) two ways: by using negative indexing, or by directly setting state_df$Region to be NULL. Display the first 3 rows and all 10 columns of state_df.

• 1c. Add two columns to state_df, containing the x and y coordinates (longitude and latitude, respectively) of the center of the states, that are stored in the (existing) list state.center. Hint: take a look at this list in the console, to see what its elements are named. Name these two columns Center_x and Center_y. Display the first 3 rows and all 12 columns of state_df.

• 1d. Make a new data.frame which contains only those states whose longitude is less than -100. Do this in two different ways and check that they are equal to each other, using an appropriate function call.

# Prostate cancer data set

Let’s return to the prostate cancer data set that we looked at in the lab/homework from Week 2 (taken from the book The Elements of Statistical Learning). Below we read in a data frame of 97 men x 9 variables. You can remind yourself about what’s been measured by looking back at the lab/homework (or by visiting the URL linked above in your web browser, clicking on “Data” on the left-hand menu, and clicking “Info” under “Prostate”).

pros_data <-
read.table("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/pros.dat")

# Practice with the apply family

• 2a. Using sapply(), calculate the mean of each variable. Also, calculate the standard deviation of each variable. Each should require just one line of code. Display your results.

• 2b. Now, use lapply() to perform t-tests for each variable in the data set, between SVI and non-SVI groups. To be precise, you will perform a t-test for each variable excluding the SVI variable itself. For convenience, we’ve defined a function t_test_by_ind() below, which takes a numeric variable x, and then an indicator variable ind (of 0s and 1s) that defines the groups. Run this function on the columns of pros_data, and save the result as tests. What kind of data structure is tests? Print it to the console.

t_test_by_ind <- function(x, ind) {
stopifnot(all(ind %in% c(0, 1)))
return(t.test(x[ind == 0], x[ind == 1]))
}
• Challenge. Using an appropriate apply function again, extract the p-values from the tests object you created in the last question, with just a single line of code. Hint: run the command "[["(pros_data, "lcavol") in your console—what does this do?

# Rio Olympics data set

We’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from https://github.com/flother/rio2016 (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio.

rio <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week2/rio.csv")

# More practice with data frames and apply

• 3a. Call summary() on rio and display the result. Is there any missing data?

• 3b. Use rio to answer the following questions. How many athletes competed in the 2016 Summer Olympics? How many countries were represented? What were these countries, and how many athletes competed for each one? Which country brought the most athletes, and how many was this?

• 3c. How many medals of each type—gold, silver, bronze—were awarded at this Olympics? Is this result surprising, and can you explain what you are seeing?

• 3d. Create a column called total which adds the number of gold, silver, and bronze medals for each athlete, and add this column to rio. Which athlete had the most number of medals and how many was this? Which athlete had the most silver medals and how many was this? (Ouch! So close, so many times …) In the case of ties, here, display all the relevant athletes.

• 3e. Using tapply(), calculate the total medal count for each country. Save the result as total_by_nat, and print it to the console. Which country had the most number of medals, and how many was this? How many countries had zero medals? Challenge: among the countries that had zero medals, which had the most athletes, and how many athletes was this? (Ouch!)

# Some advanced practice with apply

• 4a. The variable date_of_birth contains strings of the date of birth of each athlete. Use text processing commands to extract the year of birth, and create a new numeric variable called age, equal to 2016 - (the year of birth). (Here we’re ignoring days and months for simplicity.) Add the age variable to the rio data frame. variable Who is the oldest athlete, and how old is he/she? Youngest athlete, and how old is he/she? In the case of ties, here, display all the relevant athletes. Challenge: Answer the same questions, but now only among athletes who won a medal.

• 4b. Create a new data.frame called sports, which we’ll populate with information about each sporting event at the Summer Olympics. Initially, define sports to contain a single variable called sport which contains the names of the sporting events in alphabetical order. Then, add a column called n_participants which contains the number of participants in each sport. Use one of the apply functions to determine the number of gold medals given out for each sport, and add this as a column called n_gold. Using your newly created sports data frame, calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest ratio? Which has the lowest?

# Plotting tools

Below, we read in a data set, as in lab, that is related to shark attacks. The data is taken from Kaggle and was originally compiled by the global shark attack file. More information is available here.

shark_attacks <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week2/shark-attacks.csv", as.is = TRUE)
• 5a. As we did in Lab (2.2, Question 3a), define attack_time to be the Date column of shark_attacks. Define a character vector attack_year to contain the first 4 characters of an entry of attack_time. Hint: hopefully you haven’t forgotten … use substr() here. Finally, convert attack_year into a numeric vector. Do a similar thing for attack_month. Then append these onto the shark_data matrix (as year and month). We will use these added columns later.

• 5b. Additionally create a new column to the shark_attacks data frame named decade that contains the decade the attacked happened (try the floor function). Make a new data.frame called shark_attacks_above which only contains individuals with ages not equal to -5. Using one of the apply functions and shark_attacks_above, compute the average age of the victim for each decade and storage it in average_age_by_decade.

One can alternatively use the round function as suggested in the prompt. Due to ambiguity related to rounding off at 5, the following line yields a few decades which are different from the ones above.

• 5c. Make a barchart to inform yourself of the number of reported attacks per decade (use the shark_attacks_above data frame). With visual in mind, do you have any comments about the values in average_age_by_decade?

• 5d. In lab we examined the distribution of Age with a histogram and conditional on gender in the scatter plot. Use geom_boxplot to examine the distributions of Age conditional on gender (continue using the shark_attacks_above data frame). Try adding coord_flip to your figure as well. How could we make the boxplot just have 1 boxplot (plot this as well)?

• 5e. Remember in class that we showed how to label all parts of the plot using using + labs() update the plots in this problem with quality labels (in their respective locations - not in 5e.).

# More plots

Below you’ll find a data set that contains a subset of all the passengers on the titanic, information about each of them and if they survived. The dataset was taken from kaggle, and you can find more information about the dataset here: https://www.kaggle.com/c/titanic/data.

titanic <- read.csv("https://raw.githubusercontent.com/benjaminleroy/stat315summer_data/master/assignments/assignment03/titanic.csv")
• 6a. Does class (Pclass) effect if you survived or not? Below is a stacked barchart that tries to answer this question. Replace factor(Survived) with just Survived. What happened - and why do you think this happened?
library(ggplot2)
ggplot(data = titanic,
aes(x = Pclass, fill = factor(Survived))) + geom_bar()

• 6b. In geom_bar you can also change the position to ='fill' or ='dodge'. Do so for the above graphic and describe what these 2 variants do. Which do you think best answers the question in part 6a. - that is “Does class (Pclass) effect if you survived or not?”.

• 6c. During the time of the titanic, there was a code of conduct called Woman and Children First. As seen in lecture, facet (hint: facet_grid) on the gender of the individual to see how gender relates to the interaction of class and survival (please use use the default position). Specifically try both ~Sex and Sex~. - select the one that makes the relationship clearer? We can also add marginals by doing + facet_grid(...,margin = TRUE) - try it.

• 6f The volcanoobject in R is a matrix of dimension 87 x 61. It is a digitized version of a topographic map of the Maungawhau volcano in Auckland, New Zealand. To visualize in ggplot we need to do a slight transformation of the object into a different format (we’ll learn how tidyr::gather works next week). Examine both volcano and the data.frame my_volcano created below. Express in words what data transformation occured. Next, using the geom_tile create an image of the volcano (hint: fill = value). Note that first row of the volcano matrix is actually to the row at the bottom of the outputted display - do a minor correction to the aesthetics for the row to rotate the image to the correct orientation.

library(tidyr)
my_volcano <- tidyr::gather(data.frame(row = 1:nrow(volcano), volcano),
key = "column", value = "value", -row)
my_volcano$column <- as.numeric(substr(my_volcano$column, 2, nchar(my_volcano\$column))) 

# Shakespeare and overlaid histograms (challenge)

• 7. Return to the Shakespeare data set from lab/homework in Week 1 (taken from Project Gutenberg). Following the commands you worked out in lab/homework, extract the text for at least two of Shakespeare’s plays. Then using table(), compute counts of the word lengths, separately, for each play you are considering. Produce a plot that displays histograms of the word lengths—i.e., one histogram for each play, overlaid. In the geom have aes(..., y= ..density..) (so that all histograms are on the probability scale, rather than the frequency scale). Set the title and axes labels appropriately. Use transparent colors. Describe any differences/similarities that you are seeing between plays, according to Shakepeare’s word length useage.

Notes: 1. the standard for geom_histogram’s y is “..count..” . 2. geom_histogram(position="identity") makes stacked histograms non-stacked.