Statistical Computing, 36-350
Friday July 19, 2019
%>%
operator)dplyr
is a package for data wrangling, with several key verbs (functions)slice()
and filter()
: subset rows based on numbers or conditionsselect()
and pull()
: select columns or a single column as a vectorarrange()
: order rows by one or multiple columnsrename()
and mutate()
: rename or create columnsmutate_at()
: apply a function to given columns%>%
)Tidyverse functions are at their best when composed together using the pipe (%>%
) operator!
library(tidyverse)
mtcars %>%
filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>%
head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Shortcut: use ctrl + shift + m
in RStudio as a shortcut for typing %>%
Mastering the tidyr
verbs
tidyr
verbsOur tidyr
journey starts of with learning the following verbs (functions):
gather()
: make “wide” data longerspread()
: make “long” data widerseparate()
: split a single column into multiple columnsunite()
: combine multiple columns into a single columnKey takeaway: as with dplyr
, think of data frames as nouns and tidyr
verbs as actions that you apply to manipulate them—especially natural when using pipes
gather()
Use gather()
to make “wide” data longer:
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases %>%
head(3)
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
gather(key = "year", value = "n", 2:4) %>%
head(5)
## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
year
columnn
tidyr
did all the heavy lifting of the transposing work# different approach to do the same thing
EDAWR::cases %>%
gather("year", "n", -country) %>%
# also do gather("year", "n", `2011`, `2012`, `2013`) %>%
# or gather("year", "n", `2011`:`2013`) %>%
head(5)
## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
spread()
Use spread()
to make “long” data wider:
EDAWR::pollution %>%
head(5)
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
spread(key = size, value = amount)
## city large small
## 1 Beijing 121 56
## 2 London 22 16
## 3 New York 23 14
tidyr
did all the heavy lifting againspread()
and gather()
are inversesggplot
visualizations (remember you need to map a column to an aesethic) (I used spread()
for the my_volcano
example)gather()
)separate()
EDAWR::storms %>%
head(3)
## storm wind pressure date
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
storms2 <- EDAWR::storms %>%
separate(date, c("y", "m", "d")) # sep = "-"
unite()
Use unite()
to combine multiple columns into a single column:
storms2 %>%
unite(date, y, m, d, sep = "-")
## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Note that unite()
and separate()
are inverse operations
dplyr
: group_by()
and summarize()
dplyr
and SQLdplyr
you should find SQL very natural, and vice versa!group_by()
and summarize()
, which are used to aggregate data (now)left_join()
and inner_join()
verbs (shortly)group_by()
Use group_by()
to define a grouping of rows based on a column:
mtcars %>%
group_by(cyl) %>%
head(4)
## # A tibble: 4 x 11
## # Groups: cyl [2]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
mtcars %>%
group_by(cyl) %>%
head(4) %>% class
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
dplyr
verbs actsummarize()
Use summarise()
(or summarize()
for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:
# Ungrouped
mtcars %>%
summarize(mpg = mean(mpg),
hp = mean(hp))
## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg),
hp = mean(hp))
## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
mtcars %>%
group_by(cyl) %>%
summarize(mpg_mean = mean(mpg),
mpg_max = max(mpg),
hp_mean = mean(hp),
hp_max = max(hp))
## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()
Use ungroup()
to remove groupings structure from a data frame:
mtcars %>%
group_by(cyl) %>%
ungroup() %>%
summarize(hp = mean(hp),
mpg = mean(mpg))
## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
In last week’s lab we encountered the following example:
str_url <- "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt"
str_url %>%
readLines %>%
paste(collapse = " ") %>%
strsplit(split = "[[:space:]]|[[:punct:]]") %>%
unlist() %>%
.[. != ""] %>%
table() %>%
sort(decreasing = TRUE) %>%
head()
## .
## the to I a and you
## 780 553 478 466 408 375
This required a tricky indexing operation using %>%
on vectors: .[. != ""]
Note: I actually scraped the the endgame screenplay with code that looks somewhat like this see my R script.
We can perform the same operations using dplyr
if we convert to a data frame:
str_url %>%
readLines() %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist() %>%
data.frame(value = .) %>% # Convert to a data frame (dplyr version)
rename(word_list = value) %>% # Rename the column to `word_list`
filter(word_list != "") %>% # Now indexing is easy!
group_by(word_list) %>%
summarize(word_count = n()) %>%
arrange(desc(word_count)) %>%
slice(1:6)
## # A tibble: 6 x 2
## word_list word_count
## <fct> <int>
## 1 the 780
## 2 to 553
## 3 I 478
## 4 a 466
## 5 and 408
## 6 you 375
dplyr
: left_join()
and inner_join()
A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:
Column values that cannot be filled in are assigned NA
values
It helps to visualize the join types:
has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"),
children = c(3, 1, 3),
stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
age = c(38, 54, 34),
stringsAsFactors = FALSE)
has_kids_tab1
## name children
## 1 Robert Downey, Jr 3
## 2 Scarlett Johansson 1
## 3 Chris Hemsworth 3
americans_tab2
## name age
## 1 Chris Evans 38
## 2 Robert Downey, Jr 54
## 3 Scarlett Johansson 34
inner_join()
Suppose we want to join tab1
and tab2
by name, but keep only actors in intersection (aka in both tables):
inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
left_join()
Suppose we want to join tab1
and tab2
by name, but keep all actors from tab1
:
left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
right_join()
Suppose we want to join tab1
and tab2
by name, but keep all actors from tab2
:
right_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Chris Evans NA 38
## 2 Robert Downey, Jr 3 54
## 3 Scarlett Johansson 1 34
full_join()
Finally, suppose we want to join tab1
and tab2
by name, and keep all actors from both:
full_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
## 4 Chris Evans NA 38
my_peeps %T>% print() %>% dim()
## pol gender state IQ
## 1 R F AZ 101
## 2 R M AZ 108
## 3 D M AZ 100
## 4 R F PA 112
## 5 D M PA 101
## 6 D F PA 111
## 7 D F AZ 107
## 8 R M AZ 93
## 9 D M PA 101
## 10 D F AZ 100
## [1] 10 4
politics %T>% print() %>% dim()
## senator pol gender STATE
## 1 Kyrsten Sinema D F AZ
## 2 Martha McSally R F AZ
## 3 Pat Toomey R M PA
## 4 Boy Casey Jr. D M PA
## [1] 4 4
%T>%
is a special pipe that passes my_peeps
into print()
as a “side-effect” and then also passes my_peeps
onto the rest of the chain (which in this case is just dim()
)my_peeps %>% left_join(politics,by = c("state" = "STATE",
"pol" = "pol")) %>% head(6)
## pol gender.x state IQ senator gender.y
## 1 R F AZ 101 Martha McSally F
## 2 R M AZ 108 Martha McSally F
## 3 D M AZ 100 Kyrsten Sinema F
## 4 R F PA 112 Pat Toomey M
## 5 D M PA 101 Boy Casey Jr. M
## 6 D F PA 111 Boy Casey Jr. M
Notice:
gender
get gender.x
and gender.y
indicating columns for the first (x
) and second (y
) data framestidyr
is a package for manipulating the structure of data framesgather()
: make wide data longerspread()
: make long data widerunite()
and separate()
: combine or split columnsdplyr
has advanced functionality that mirrors SQLgroup_by()
: create groups of rows according to a conditionsummarize()
: apply computations across groups of rows*_join()
where *
= inner
, left
, right
, or full
: join two data frames together according to common values in certain columns, and *
indicates how many rows to keep