Statistical Computing, 36-350
Friday July 19, 2019
%>% operator)dplyr is a package for data wrangling, with several key verbs (functions)slice() and filter(): subset rows based on numbers or conditionsselect() and pull(): select columns or a single column as a vectorarrange(): order rows by one or multiple columnsrename() and mutate(): rename or create columnsmutate_at(): apply a function to given columns%>%)Tidyverse functions are at their best when composed together using the pipe (%>%) operator!
library(tidyverse)
mtcars %>%
filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>%
head(2)## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Shortcut: use ctrl + shift + m in RStudio as a shortcut for typing %>%
Mastering the tidyr verbs
tidyr verbsOur tidyr journey starts of with learning the following verbs (functions):
gather(): make “wide” data longerspread(): make “long” data widerseparate(): split a single column into multiple columnsunite(): combine multiple columns into a single columnKey takeaway: as with dplyr, think of data frames as nouns and tidyr verbs as actions that you apply to manipulate them—especially natural when using pipes
gather()Use gather() to make “wide” data longer:
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases %>%
head(3)## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
gather(key = "year", value = "n", 2:4) %>%
head(5)## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
year columnntidyr did all the heavy lifting of the transposing work# different approach to do the same thing
EDAWR::cases %>%
gather("year", "n", -country) %>%
# also do gather("year", "n", `2011`, `2012`, `2013`) %>%
# or gather("year", "n", `2011`:`2013`) %>%
head(5)## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
spread()Use spread() to make “long” data wider:
EDAWR::pollution %>%
head(5)## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
spread(key = size, value = amount)## city large small
## 1 Beijing 121 56
## 2 London 22 16
## 3 New York 23 14
tidyr did all the heavy lifting againspread() and gather() are inversesggplot visualizations (remember you need to map a column to an aesethic) (I used spread() for the my_volcano example)gather())separate()EDAWR::storms %>%
head(3)## storm wind pressure date
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
storms2 <- EDAWR::storms %>%
separate(date, c("y", "m", "d")) # sep = "-"unite()Use unite() to combine multiple columns into a single column:
storms2 %>%
unite(date, y, m, d, sep = "-")## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Note that unite() and separate() are inverse operations
dplyr: group_by() and summarize()
dplyr and SQLdplyr you should find SQL very natural, and vice versa!group_by() and summarize(), which are used to aggregate data (now)left_join() and inner_join() verbs (shortly)group_by()Use group_by() to define a grouping of rows based on a column:
mtcars %>%
group_by(cyl) %>%
head(4)## # A tibble: 4 x 11
## # Groups: cyl [2]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
mtcars %>%
group_by(cyl) %>%
head(4) %>% class## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
dplyr verbs actsummarize()Use summarise() (or summarize() for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:
# Ungrouped
mtcars %>%
summarize(mpg = mean(mpg),
hp = mean(hp))## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg),
hp = mean(hp))## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
mtcars %>%
group_by(cyl) %>%
summarize(mpg_mean = mean(mpg),
mpg_max = max(mpg),
hp_mean = mean(hp),
hp_max = max(hp))## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()Use ungroup() to remove groupings structure from a data frame:
mtcars %>%
group_by(cyl) %>%
ungroup() %>%
summarize(hp = mean(hp),
mpg = mean(mpg))## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
In last week’s lab we encountered the following example:
str_url <- "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt"
str_url %>%
readLines %>%
paste(collapse = " ") %>%
strsplit(split = "[[:space:]]|[[:punct:]]") %>%
unlist() %>%
.[. != ""] %>%
table() %>%
sort(decreasing = TRUE) %>%
head()## .
## the to I a and you
## 780 553 478 466 408 375
This required a tricky indexing operation using %>% on vectors: .[. != ""]
Note: I actually scraped the the endgame screenplay with code that looks somewhat like this see my R script.
We can perform the same operations using dplyr if we convert to a data frame:
str_url %>%
readLines() %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist() %>%
data.frame(value = .) %>% # Convert to a data frame (dplyr version)
rename(word_list = value) %>% # Rename the column to `word_list`
filter(word_list != "") %>% # Now indexing is easy!
group_by(word_list) %>%
summarize(word_count = n()) %>%
arrange(desc(word_count)) %>%
slice(1:6)## # A tibble: 6 x 2
## word_list word_count
## <fct> <int>
## 1 the 780
## 2 to 553
## 3 I 478
## 4 a 466
## 5 and 408
## 6 you 375
dplyr: left_join() and inner_join()
A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:
Column values that cannot be filled in are assigned NA values
It helps to visualize the join types:
has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"),
children = c(3, 1, 3),
stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
age = c(38, 54, 34),
stringsAsFactors = FALSE)
has_kids_tab1## name children
## 1 Robert Downey, Jr 3
## 2 Scarlett Johansson 1
## 3 Chris Hemsworth 3
americans_tab2## name age
## 1 Chris Evans 38
## 2 Robert Downey, Jr 54
## 3 Scarlett Johansson 34
inner_join()Suppose we want to join tab1 and tab2 by name, but keep only actors in intersection (aka in both tables):
inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
left_join()Suppose we want to join tab1 and tab2 by name, but keep all actors from tab1:
left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
right_join()Suppose we want to join tab1 and tab2 by name, but keep all actors from tab2:
right_join(x = has_kids_tab1, y = americans_tab2, by = "name")## name children age
## 1 Chris Evans NA 38
## 2 Robert Downey, Jr 3 54
## 3 Scarlett Johansson 1 34
full_join()Finally, suppose we want to join tab1 and tab2 by name, and keep all actors from both:
full_join(x = has_kids_tab1, y = americans_tab2, by = "name")## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
## 4 Chris Evans NA 38
my_peeps %T>% print() %>% dim()## pol gender state IQ
## 1 R F AZ 101
## 2 R M AZ 108
## 3 D M AZ 100
## 4 R F PA 112
## 5 D M PA 101
## 6 D F PA 111
## 7 D F AZ 107
## 8 R M AZ 93
## 9 D M PA 101
## 10 D F AZ 100
## [1] 10 4
politics %T>% print() %>% dim()## senator pol gender STATE
## 1 Kyrsten Sinema D F AZ
## 2 Martha McSally R F AZ
## 3 Pat Toomey R M PA
## 4 Boy Casey Jr. D M PA
## [1] 4 4
%T>% is a special pipe that passes my_peeps into print() as a “side-effect” and then also passes my_peeps onto the rest of the chain (which in this case is just dim())my_peeps %>% left_join(politics,by = c("state" = "STATE",
"pol" = "pol")) %>% head(6)## pol gender.x state IQ senator gender.y
## 1 R F AZ 101 Martha McSally F
## 2 R M AZ 108 Martha McSally F
## 3 D M AZ 100 Kyrsten Sinema F
## 4 R F PA 112 Pat Toomey M
## 5 D M PA 101 Boy Casey Jr. M
## 6 D F PA 111 Boy Casey Jr. M
Notice:
gender get gender.x and gender.y indicating columns for the first (x) and second (y) data framestidyr is a package for manipulating the structure of data framesgather(): make wide data longerspread(): make long data widerunite() and separate(): combine or split columnsdplyr has advanced functionality that mirrors SQLgroup_by(): create groups of rows according to a conditionsummarize(): apply computations across groups of rows*_join() where * = inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep