This week’s agenda: learning to master pipes and dplyr
.
# Load the tidyverse!
library(tidyverse)
For each of the following code blocks, which are written with pipes, write equivalent code in base R (to do the same thing).
letters %>%
toupper %>%
paste(collapse="+")
## [1] "A+B+C+D+E+F+G+H+I+J+K+L+M+N+O+P+Q+R+S+T+U+V+W+X+Y+Z"
" Ceci n'est pas une pipe " %>%
gsub("une", "un", .) %>%
trimws
## [1] "Ceci n'est pas un pipe"
ggplot
)rnorm(1000) %>%
data.frame(x = .) %>%
ggplot(.) + # I'm giving you a hint here
geom_histogram(aes(x = x, y = ..density..)) +
labs(title = "N(0,1) draws")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
rnorm(1000) %>%
hist(breaks=30, plot=FALSE) %>% # use the ?hist to figure out what this does when plot=FALSE
.[["density"]] %>%
max
## [1] 0.405
For each of the following code blocks, which are written in base R, write equivalent code with pipes (to do the same thing).
.
, as seen above in Q1b, or in the lecture notes.paste("Your grade is", sample(c("A","B","C","D","R"), size = 1))
## [1] "Your grade is D"
.
again, in order to index state.name
directly in the last pipe command.state.name[which.max(state.x77[,"Illiteracy"])]
## [1] "Louisiana"
x
is a list of length 1, then x[[1]]
is the same as unlist(x)
.str_url <-
paste0("https://raw.githubusercontent.com/benjaminleroy/",
"36-350-summer-data/master/Week1/endgame.txt")
# Base R:
lines <- readLines(str_url)
text <- paste(lines, collapse = " ")
words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
wordtab <- table(words)
wordtab <- sort(wordtab, decreasing = TRUE)
head(wordtab, 10)
## words
## the to I a and you s of it
## 10146 780 553 478 466 408 375 374 282 261
words = words[words != ""]
. This is a bit tricky line to do with pipes: use the dot .
, once more, and manipulate it as if were a variable name.# Base R:
lines <- readLines(str_url)
text <- paste(lines, collapse = " ")
words <- strsplit(text, split = "[[:space:]]|[[:punct:]]")[[1]]
words <- words[words != ""]
wordtab <- table(words)
wordtab <- sort(wordtab, decreasing = TRUE)
head(wordtab, 10)
## words
## the to I a and you s of it in
## 780 553 478 466 408 375 374 282 261 251
With your vectorized function we’ll start exploring the dplyr
tools we saw in lecture.
4a. Quick clean up. Convert the Time column in the shark_attacks
to numeric using time_factor_to_numeric_vectorized
. Hint: use mutate_at()
, and reassign shark_attacks
to be the output.
4b. Using dplyr
’s mutate
function create 3 new numeric columns year
, month
, and day
for the shark_attacks data.frame
Before you do, check the class
of the Date
class using the dplyr pipes and the pull
function.
4c. Suppose you’re writing a story about shark attacks in the United States that were fatal. You’re interested in tracking down family of some of these individuals. Moreover, for journalistic purposes - it’s probably best that you select families based on how recently the attack happened. First, create a new data.frame fatal_shark_attacks_usa
that contain information of all attacks in the US that were fatal.
4d. Using fatal_shark_attacks_usa
produce a table that contains individuals that meet the rest of the conditions above, and show the top 3 individuals’s Name
, Age
, Area
of attack, and year
of attack. Hint: this will use arrange
, select
, and slice
. I expect this will take some thinking - maybe pull out a piece of paper and write down the goals and try to decide how they relate to dplyr
functions.
4e. Remembering that ggplot
does some tabulation internally in certain geom
s: Use dplyr
piping and functions to explore the following question. Are there particular months of the year where its there are fewer fatal attacks than others (all countries)?
4f. That seems weird. Facet the above graphic on country. Describe what you see. What is the month with the most fatal attacks in the US?
4g. Challenge: Use the built in data vector month.abb
to create a new column in shark_attacks
month_abb
that is a factor variable will the correct month associated with the attack and is ordered. Update 3f and 3e figures with this information. Also add scale_x_discrete(drop = FALSE)
to your plots. What does this do?