Tidyverse II: Tidyr and Advanced Dplyr

Statistical Computing, 36-350

Friday July 19, 2019

Last week: Pipes and dplyr

Recall: using pipes (%>%)

Tidyverse functions are at their best when composed together using the pipe (%>%) operator!

library(tidyverse) 
mtcars %>% 
  filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>% 
  head(2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Shortcut: use ctrl + shift + m in RStudio as a shortcut for typing %>%

Part I

Mastering the tidyr verbs

tidyr verbs

Our tidyr journey starts of with learning the following verbs (functions):

Key takeaway: as with dplyr, think of data frames as nouns and tidyr verbs as actions that you apply to manipulate them—especially natural when using pipes

gather()

Use gather() to make “wide” data longer:

# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets

EDAWR::cases %>% 
  head(3)
##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000
EDAWR::cases %>% 
  gather(key = "year", value = "n", 2:4) %>% 
  head(5)
##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
# different approach to do the same thing
EDAWR::cases %>% 
  gather("year", "n", -country) %>%  
  # also do gather("year", "n", `2011`, `2012`, `2013`) %>%
  # or gather("year", "n", `2011`:`2013`) %>%
  head(5)
##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000

spread()

Use spread() to make “long” data wider:

EDAWR::pollution %>% 
  head(5)
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
EDAWR::pollution %>% 
  spread(key = size, value = amount)
##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14

When could I used these?

separate()

EDAWR::storms %>% 
  head(3)
##     storm wind pressure       date
## 1 Alberto  110     1007 2000-08-03
## 2    Alex   45     1009 1998-07-27
## 3 Allison   65     1005 1995-06-03
storms2 <- EDAWR::storms %>% 
  separate(date, c("y", "m", "d")) # sep = "-"

unite()

Use unite() to combine multiple columns into a single column:

storms2 %>%
  unite(date, y, m, d, sep = "-")
## # A tibble: 6 x 4
##   storm    wind pressure date      
##   <chr>   <int>    <int> <chr>     
## 1 Alberto   110     1007 2000-08-03
## 2 Alex       45     1009 1998-07-27
## 3 Allison    65     1005 1995-06-03
## 4 Ana        40     1013 1997-06-30
## 5 Arlene     50     1010 1999-06-11
## 6 Arthur     45     1010 1996-06-17

Note that unite() and separate() are inverse operations

Part II

dplyr: group_by() and summarize()

Recall: dplyr and SQL

group_by()

Use group_by() to define a grouping of rows based on a column:

mtcars %>% 
  group_by(cyl) %>%
  head(4)
## # A tibble: 4 x 11
## # Groups:   cyl [2]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
mtcars %>% 
  group_by(cyl) %>%
  head(4) %>% class
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

summarize()

Use summarise() (or summarize() for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:

# Ungrouped
mtcars %>% 
  summarize(mpg = mean(mpg),
            hp = mean(hp))
##        mpg       hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
  group_by(cyl) %>%
  summarize(mpg = mean(mpg),
            hp = mean(hp))
## # A tibble: 3 x 3
##     cyl   mpg    hp
##   <dbl> <dbl> <dbl>
## 1     4  26.7  82.6
## 2     6  19.7 122. 
## 3     8  15.1 209.
mtcars %>%
  group_by(cyl) %>%
  summarize(mpg_mean = mean(mpg),
            mpg_max = max(mpg),
            hp_mean = mean(hp),
            hp_max = max(hp))
## # A tibble: 3 x 5
##     cyl mpg_mean mpg_max hp_mean hp_max
##   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
## 1     4     26.7    33.9    82.6    113
## 2     6     19.7    21.4   122.     175
## 3     8     15.1    19.2   209.     335

ungroup()

Use ungroup() to remove groupings structure from a data frame:

mtcars %>%
  group_by(cyl) %>%
  ungroup() %>%
  summarize(hp = mean(hp),
            mpg = mean(mpg))
## # A tibble: 1 x 2
##      hp   mpg
##   <dbl> <dbl>
## 1  147.  20.1

Simplifying indexing

In last week’s lab we encountered the following example:

str_url <- "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt" 
str_url %>% 
  readLines %>% 
  paste(collapse = " ") %>%
  strsplit(split = "[[:space:]]|[[:punct:]]") %>%
  unlist() %>%
  .[. != ""] %>%
  table() %>%
  sort(decreasing = TRUE) %>%
  head()
## .
## the  to   I   a and you 
## 780 553 478 466 408 375

This required a tricky indexing operation using %>% on vectors: .[. != ""]

Note: I actually scraped the the endgame screenplay with code that looks somewhat like this see my R script.

We can perform the same operations using dplyr if we convert to a data frame:

str_url %>%
  readLines() %>%
  paste(collapse=" ") %>%
  strsplit(split="[[:space:]]|[[:punct:]]") %>%
  unlist() %>%
  data.frame(value = .) %>%      # Convert to a data frame (dplyr version)
  rename(word_list = value) %>%  # Rename the column to `word_list`
  filter(word_list != "") %>%    # Now indexing is easy!
  group_by(word_list) %>% 
  summarize(word_count = n()) %>%
  arrange(desc(word_count)) %>% 
  slice(1:6)
## # A tibble: 6 x 2
##   word_list word_count
##   <fct>          <int>
## 1 the              780
## 2 to               553
## 3 I                478
## 4 a                466
## 5 and              408
## 6 you              375

Part III

dplyr: left_join() and inner_join()

Join operations

A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:

Column values that cannot be filled in are assigned NA values

It helps to visualize the join types:

Two toy data frames

has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"), 
                  children = c(3, 1, 3),
                  stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
                  age = c(38, 54, 34),
                  stringsAsFactors = FALSE)
has_kids_tab1
##                 name children
## 1  Robert Downey, Jr        3
## 2 Scarlett Johansson        1
## 3    Chris Hemsworth        3
americans_tab2
##                 name age
## 1        Chris Evans  38
## 2  Robert Downey, Jr  54
## 3 Scarlett Johansson  34

inner_join()

Suppose we want to join tab1 and tab2 by name, but keep only actors in intersection (aka in both tables):

inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34

left_join()

Suppose we want to join tab1 and tab2 by name, but keep all actors from tab1:

left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NA

right_join()

Suppose we want to join tab1 and tab2 by name, but keep all actors from tab2:

right_join(x = has_kids_tab1, y = americans_tab2, by = "name")
##                 name children age
## 1        Chris Evans       NA  38
## 2  Robert Downey, Jr        3  54
## 3 Scarlett Johansson        1  34

full_join()

Finally, suppose we want to join tab1 and tab2 by name, and keep all actors from both:

full_join(x = has_kids_tab1, y = americans_tab2, by = "name")
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NA
## 4        Chris Evans       NA  38

More nuanced structure

my_peeps %T>% print() %>% dim()
##    pol gender state  IQ
## 1    R      F    AZ 101
## 2    R      M    AZ 108
## 3    D      M    AZ 100
## 4    R      F    PA 112
## 5    D      M    PA 101
## 6    D      F    PA 111
## 7    D      F    AZ 107
## 8    R      M    AZ  93
## 9    D      M    PA 101
## 10   D      F    AZ 100
## [1] 10  4
politics %T>% print() %>% dim()
##          senator pol gender STATE
## 1 Kyrsten Sinema   D      F    AZ
## 2 Martha McSally   R      F    AZ
## 3     Pat Toomey   R      M    PA
## 4  Boy Casey Jr.   D      M    PA
## [1] 4 4
my_peeps %>% left_join(politics,by = c("state" = "STATE",
                                       "pol" = "pol")) %>% head(6)
##   pol gender.x state  IQ        senator gender.y
## 1   R        F    AZ 101 Martha McSally        F
## 2   R        M    AZ 108 Martha McSally        F
## 3   D        M    AZ 100 Kyrsten Sinema        F
## 4   R        F    PA 112     Pat Toomey        M
## 5   D        M    PA 101  Boy Casey Jr.        M
## 6   D        F    PA 111  Boy Casey Jr.        M

Notice:

Summary

References