Last week: Pipes and dplyr

The tidyverse is a collection of packages for common data science tasks
Tidyverse functionality is greatly enhanced using pipes (%>% operator)
Pipes allow you to string together commands to get a flow of results
dplyr is a package for data wrangling, with several key verbs (functions)
slice() and filter(): subset rows based on numbers or conditions
select() and pull(): select columns or a single column as a vector
arrange(): order rows by one or multiple columns
rename() and mutate(): rename or create columns
mutate_at(): apply a function to given columns

Recall: using pipes (`%>%`)

Tidyverse functions are at their best when composed together using the pipe (%>%) operator!

library(tidyverse) 
mtcars %>% 
  filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>% 
  head(2)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Shortcut: use ctrl + shift + m in RStudio as a shortcut for typing %>%

Part I

Mastering the tidyr verbs

`tidyr` verbs

Our tidyr journey starts of with learning the following verbs (functions):

gather(): make “wide” data longer
spread(): make “long” data wider
separate(): split a single column into multiple columns
unite(): combine multiple columns into a single column

Key takeaway: as with dplyr, think of data frames as nouns and tidyr verbs as actions that you apply to manipulate them—especially natural when using pipes

`gather()`

Use gather() to make “wide” data longer:

# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets

EDAWR::cases %>% 
  head(3)

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

EDAWR::cases %>% 
  gather(key = "year", value = "n", 2:4) %>% 
  head(5)

##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000

Here we transposed columns 2:4 into a year column
We put the corresponding count values into a column called n
Note tidyr did all the heavy lifting of the transposing work
We just had to declaratively specify the output

# different approach to do the same thing
EDAWR::cases %>% 
  gather("year", "n", -country) %>%  
  # also do gather("year", "n", `2011`, `2012`, `2013`) %>%
  # or gather("year", "n", `2011`:`2013`) %>%
  head(5)

##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000

`spread()`

Use spread() to make “long” data wider:

EDAWR::pollution %>% 
  head(5)

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121

EDAWR::pollution %>% 
  spread(key = size, value = amount)

##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14

Here we transposed to a wide format by size
We tabulated the corresponding amount for each size
Note tidyr did all the heavy lifting again
We just had to declaratively specify the output
Note that spread() and gather() are inverses

When could I used these?

in ggplot visualizations (remember you need to map a column to an aesethic) (I used spread() for the my_volcano example)
to make prettier / more readable “tables” (gather())

`separate()`

EDAWR::storms %>% 
  head(3)

##     storm wind pressure       date
## 1 Alberto  110     1007 2000-08-03
## 2    Alex   45     1009 1998-07-27
## 3 Allison   65     1005 1995-06-03

storms2 <- EDAWR::storms %>% 
  separate(date, c("y", "m", "d")) # sep = "-"

`unite()`

Use unite() to combine multiple columns into a single column:

storms2 %>%
  unite(date, y, m, d, sep = "-")

## # A tibble: 6 x 4
##   storm    wind pressure date      
##   <chr>   <int>    <int> <chr>     
## 1 Alberto   110     1007 2000-08-03
## 2 Alex       45     1009 1998-07-27
## 3 Allison    65     1005 1995-06-03
## 4 Ana        40     1013 1997-06-30
## 5 Arlene     50     1010 1999-06-11
## 6 Arthur     45     1010 1996-06-17

Note that unite() and separate() are inverse operations

Part II

dplyr: group_by() and summarize()

Recall: `dplyr` and SQL

Once you learn dplyr you should find SQL very natural, and vice versa!
This will make it much easier for tasks that require using both R and SQL to munge data and build statistical models
One major link is through powerful verbs like group_by() and summarize(), which are used to aggregate data (now)
Another major link to SQL is through merging/joining data frames, via left_join() and inner_join() verbs (shortly)

`group_by()`

Use group_by() to define a grouping of rows based on a column:

mtcars %>% 
  group_by(cyl) %>%
  head(4)

## # A tibble: 4 x 11
## # Groups:   cyl [2]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1

mtcars %>% 
  group_by(cyl) %>%
  head(4) %>% class

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

This doesn’t actually change anything about the way the df looks
Only difference is that when it prints, we’re told about the groups
But it will play a big role in how other dplyr verbs act

`summarize()`

Use summarise() (or summarize() for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:

# Ungrouped
mtcars %>% 
  summarize(mpg = mean(mpg),
            hp = mean(hp))

##        mpg       hp
## 1 20.09062 146.6875

# Grouped by number of cylinders
mtcars %>%
  group_by(cyl) %>%
  summarize(mpg = mean(mpg),
            hp = mean(hp))

## # A tibble: 3 x 3
##     cyl   mpg    hp
##   <dbl> <dbl> <dbl>
## 1     4  26.7  82.6
## 2     6  19.7 122. 
## 3     8  15.1 209.

mtcars %>%
  group_by(cyl) %>%
  summarize(mpg_mean = mean(mpg),
            mpg_max = max(mpg),
            hp_mean = mean(hp),
            hp_max = max(hp))

## # A tibble: 3 x 5
##     cyl mpg_mean mpg_max hp_mean hp_max
##   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
## 1     4     26.7    33.9    82.6    113
## 2     6     19.7    21.4   122.     175
## 3     8     15.1    19.2   209.     335

`ungroup()`

Use ungroup() to remove groupings structure from a data frame:

mtcars %>%
  group_by(cyl) %>%
  ungroup() %>%
  summarize(hp = mean(hp),
            mpg = mean(mpg))

## # A tibble: 1 x 2
##      hp   mpg
##   <dbl> <dbl>
## 1  147.  20.1

Simplifying indexing

In last week’s lab we encountered the following example:

str_url <- "https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week1/endgame.txt" 
str_url %>% 
  readLines %>% 
  paste(collapse = " ") %>%
  strsplit(split = "[[:space:]]|[[:punct:]]") %>%
  unlist() %>%
  .[. != ""] %>%
  table() %>%
  sort(decreasing = TRUE) %>%
  head()

## .
## the  to   I   a and you 
## 780 553 478 466 408 375

This required a tricky indexing operation using %>% on vectors: .[. != ""]

Note: I actually scraped the the endgame screenplay with code that looks somewhat like this see my R script.

We can perform the same operations using dplyr if we convert to a data frame:

str_url %>%
  readLines() %>%
  paste(collapse=" ") %>%
  strsplit(split="[[:space:]]|[[:punct:]]") %>%
  unlist() %>%
  data.frame(value = .) %>%      # Convert to a data frame (dplyr version)
  rename(word_list = value) %>%  # Rename the column to `word_list`
  filter(word_list != "") %>%    # Now indexing is easy!
  group_by(word_list) %>% 
  summarize(word_count = n()) %>%
  arrange(desc(word_count)) %>% 
  slice(1:6)

## # A tibble: 6 x 2
##   word_list word_count
##   <fct>          <int>
## 1 the              780
## 2 to               553
## 3 I                478
## 4 a                466
## 5 and              408
## 6 you              375

Part III

dplyr: left_join() and inner_join()

Join operations

A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:

Inner join (or just join): retain just the rows each table that match the condition
Left outer join (or just left join): retain all rows in the first table, and just the rows in the second table that match the condition
Right outer join (or just right join): retain just the rows in the first table that match the condition, and all rows in the second table
Full outer join (or just full join): retain all rows in both tables

Column values that cannot be filled in are assigned NA values

It helps to visualize the join types:

Two toy data frames

has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"), 
                  children = c(3, 1, 3),
                  stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
                  age = c(38, 54, 34),
                  stringsAsFactors = FALSE)
has_kids_tab1

##                 name children
## 1  Robert Downey, Jr        3
## 2 Scarlett Johansson        1
## 3    Chris Hemsworth        3

americans_tab2

##                 name age
## 1        Chris Evans  38
## 2  Robert Downey, Jr  54
## 3 Scarlett Johansson  34

`inner_join()`

Suppose we want to join tab1 and tab2 by name, but keep only actors in intersection (aka in both tables):

inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")

##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34

`left_join()`

Suppose we want to join tab1 and tab2 by name, but keep all actors from tab1:

left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))

##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NA

`right_join()`

Suppose we want to join tab1 and tab2 by name, but keep all actors from tab2:

right_join(x = has_kids_tab1, y = americans_tab2, by = "name")

##                 name children age
## 1        Chris Evans       NA  38
## 2  Robert Downey, Jr        3  54
## 3 Scarlett Johansson        1  34

`full_join()`

Finally, suppose we want to join tab1 and tab2 by name, and keep all actors from both:

full_join(x = has_kids_tab1, y = americans_tab2, by = "name")

##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NA
## 4        Chris Evans       NA  38

More nuanced structure

my_peeps %T>% print() %>% dim()

##    pol gender state  IQ
## 1    R      F    AZ 101
## 2    R      M    AZ 108
## 3    D      M    AZ 100
## 4    R      F    PA 112
## 5    D      M    PA 101
## 6    D      F    PA 111
## 7    D      F    AZ 107
## 8    R      M    AZ  93
## 9    D      M    PA 101
## 10   D      F    AZ 100

## [1] 10  4

politics %T>% print() %>% dim()

##          senator pol gender STATE
## 1 Kyrsten Sinema   D      F    AZ
## 2 Martha McSally   R      F    AZ
## 3     Pat Toomey   R      M    PA
## 4  Boy Casey Jr.   D      M    PA

## [1] 4 4

%T>% is a special pipe that passes my_peeps into print() as a “side-effect” and then also passes my_peeps onto the rest of the chain (which in this case is just dim())

my_peeps %>% left_join(politics,by = c("state" = "STATE",
                                       "pol" = "pol")) %>% head(6)

##   pol gender.x state  IQ        senator gender.y
## 1   R        F    AZ 101 Martha McSally        F
## 2   R        M    AZ 108 Martha McSally        F
## 3   D        M    AZ 100 Kyrsten Sinema        F
## 4   R        F    PA 112     Pat Toomey        M
## 5   D        M    PA 101  Boy Casey Jr.        M
## 6   D        F    PA 111  Boy Casey Jr.        M

Notice:

you can join on multiple columns
multiple columns with not used in “by” gender get gender.x and gender.y indicating columns for the first (x) and second (y) data frames

Summary

tidyr is a package for manipulating the structure of data frames
gather(): make wide data longer
spread(): make long data wider
unite() and separate(): combine or split columns
dplyr has advanced functionality that mirrors SQL
group_by(): create groups of rows according to a condition
summarize(): apply computations across groups of rows
*_join() where * = inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep

Tidyverse II: Tidyr and Advanced Dplyr

Last week: Pipes and dplyr

Recall: using pipes (`%>%`)

Part I

`tidyr` verbs

`gather()`

`spread()`

When could I used these?

`separate()`

`unite()`

Part II

Recall: `dplyr` and SQL

`group_by()`

`summarize()`

`ungroup()`

Simplifying indexing

Part III

Join operations

Two toy data frames

`inner_join()`

`left_join()`

`right_join()`

`full_join()`

More nuanced structure

Summary

References

Tidyverse II: Tidyr and Advanced Dplyr

Last week: Pipes and dplyr

Recall: using pipes (%>%)

Part I

tidyr verbs

gather()

spread()

When could I used these?

separate()

unite()

Part II

Recall: dplyr and SQL

group_by()

summarize()

ungroup()

Simplifying indexing

Part III

Join operations

Two toy data frames

inner_join()

left_join()

right_join()

full_join()

More nuanced structure

Summary

References

Recall: using pipes (`%>%`)

`tidyr` verbs

`gather()`

`spread()`

`separate()`

`unite()`

Recall: `dplyr` and SQL

`group_by()`

`summarize()`

`ungroup()`

`inner_join()`

`left_join()`

`right_join()`

`full_join()`