This week’s agenda: learning to master pipes and dplyr.
# Load the tidyverse!
library(tidyverse)
For each of the following code blocks, which are written with pipes, write equivalent code in base R (to do the same thing).
letters %>%
toupper %>%
paste(collapse="+")
## [1] "A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+U+V+W+X+Y+Z"
" Ceci n'est pas une pipe " %>%
gsub("une", "un", .) %>%
trimws
## [1] "Ceci n'est pas un pipe"
ggplot)rnorm(1000) %>%
data.frame(x = .) %>%
ggplot(.) + # I'm giving you a hint here
geom_histogram(aes(x = x, y = ..density..)) +
labs(title = "N(0,1) draws")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
rnorm(1000) %>%
hist(breaks=30, plot=FALSE) %>% # use the ?hist to figure out what this does when plot=FALSE
.[["density"]] %>%
max
## [1] 0.405
For each of the following code blocks, which are written in base R, write equivalent code with pipes (to do the same thing).
., as seen above in Q1b, or in the lecture notes.paste("Your grade is", sample(c("A","B","C","D","R"), size = 1))
## [1] "Your grade is D"
. again, in order to index state.name directly in the last pipe command.state.name[which.max(state.x77[,"Illiteracy"])]
## [1] "Louisiana"
x is a list of length 1, then x[[1]] is the same as unlist(x).str_url <-
paste0("https://raw.githubusercontent.com/benjaminleroy/",
"36-350-summer-data/master/Week1/endgame.txt")
# Base R:
lines <- readLines(str_url)
text <- paste(lines, collapse = " ")
words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
wordtab <- table(words)
wordtab <- sort(wordtab, decreasing = TRUE)
head(wordtab, 10)
## words
## the to I a and you s of it
## 10146 780 553 478 466 408 375 374 282 261
words = words[words != ""]. This is a bit tricky line to do with pipes: use the dot ., once more, and manipulate it as if were a variable name.# Base R:
lines <- readLines(str_url)
text <- paste(lines, collapse = " ")
words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
words <- words[words != ""]
wordtab <- table(words)
wordtab <- sort(wordtab, decreasing = TRUE)
head(wordtab, 10)
## words
## the to I a and you s of it in
## 780 553 478 466 408 375 374 282 261 251
With your vectorized function we’ll start exploring the dplyr tools we saw in lecture.
4a. Quick clean up. Convert the Time column in the shark_attacks to numeric using time_factor_to_numeric_vectorized. Hint: use mutate_at(), and reassign shark_attacks to be the output.
4b. Using dplyr’s mutate function create 3 new numeric columns year, month, and day for the shark_attacks data.frame Before you do, check the class of the Date class using the dplyr pipes and the pull function.
4c. Suppose you’re writing a story about shark attacks in the United States that were fatal. You’re interested in tracking down family of some of these individuals. Moreover, for journalistic purposes - it’s probably best that you select families based on how recently the attack happened. First, create a new data.frame fatal_shark_attacks_usa that contain information of all attacks in the US that were fatal.
4d. Using fatal_shark_attacks_usa produce a table that contains individuals that meet the rest of the conditions above, and show the top 3 individuals’s Name, Age, Area of attack, and year of attack. Hint: this will use arrange, select, and slice. I expect this will take some thinking - maybe pull out a piece of paper and write down the goals and try to decide how they relate to dplyr functions.
4e. Remembering that ggplot does some tabulation internally in certain geoms: Use dplyr piping and functions to explore the following question. Are there particular months of the year where its there are fewer fatal attacks than others (all countries)?
4f. That seems weird. Facet the above graphic on country. Describe what you see. What is the month with the most fatal attacks in the US?
4g. Challenge: Use the built in data vector month.abb to create a new column in shark_attacks month_abb that is a factor variable will the correct month associated with the attack and is ordered. Update 3f and 3e figures with this information. Also add scale_x_discrete(drop = FALSE) to your plots. What does this do?