Tidyverse I: Pipes and Dplyr

Statistical Computing, 36-350

Wednesday July 17, 2019

Logistics

On Monday: Functions

Summary:

Part I

Welcome to the tidyverse

What is the tidyverse?

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2

Why the tidyverse?

Data wrangling the tidy way

Part II

Mastering the pipe

All behold the glorius pipe

ls -l | grep tidy | wc -l
##       11

How to read pipes: single arguments

Passing a single argument through pipes, we interpret something like:

x %>% f %>% g %>% h

as h(g(f(x)))

Key takeaway: in your mind, when you see %>%, read this as “and then”

Example

We can write exp(1) with pipes as 1 %>% exp(), and log(exp(1)) as 1 %>% exp() %>% log()

exp(1)
## [1] 2.718282
1 %>% exp()
## [1] 2.718282
1 %>% exp() %>% log()
## [1] 1

How to read pipes: multiple arguments

Now for multi-arguments functions, we interpret something like:

x %>% f(y) 

as f(x,y)

Example

mtcars %>% # data.frame
  head(4) 

And what’s the “old school” (base R) way?

head(mtcars, 4)

Notice that, with pipes: - Your code is more readable (arguably) - You can run partial commands more easily

The dot

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)

What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:

x %>% f(y, .)

which is equivalent to f(y,x)

Example

Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:

state_df <- data.frame(state.x77)
state.region %>% 
  tolower %>%
  tapply(state_df$Income, ., summary)

A more complicated example:

x <- "Ben really loves piping"
x %>% 
  strsplit(split = " ") %>%
  .[[1]] %>% # subsetting
  nchar() %>% 
  max # note the lack of paretheses 
## [1] 6

Part III

Mastering the dyplr verbs

dplyr verbs

Our dplyr journey starts of with learning the following verbs (functions):

Key takeaway: think of data frames as nouns and dplyr verbs as actions that you apply to manipulate them—especially natural when using pipes

slice()

Use slice() when you want to indicate certain row numbers need to be kept:

mtcars %>% 
  slice(c(7,8,14:15))
##    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 1 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## 2 24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## 3 15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## 4 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4
# Base R:
mtcars[c(7,8,14:15),]
##                     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Duster 360         14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D          24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 450SLC        15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4

We can also do negative slicing:

mtcars %>% 
  slice(-c(1:2,19:23)) %>% 
  nrow()
## [1] 25
# Base R:
nrow(mtcars[-c(1:2,19:23),])
## [1] 25

filter()

Use filter() when you want to subset rows based on logical conditions:

mtcars %>% 
  filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>% 
  head(2) # note rownames are silenced
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
# Base R:
head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3)), 2)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcars$drat <= 3),], 2)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

select()

Use select() when you want to pick out certain columns:

mtcars %>% 
  select(cyl, disp, hp) %>% 
  head(2)
##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110
# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)
##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

Handy select() helpers

Very handy selections using dplyr helper functions:

mtcars %>% 
  select(starts_with("d")) %>% 
  head(2)
##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9
# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

We can do many other things as well:

mtcars %>% select(ends_with('t')) %>% head(2)
##               drat    wt
## Mazda RX4      3.9 2.620
## Mazda RX4 Wag  3.9 2.875
mtcars %>% select(ends_with('yl')) %>% head(2)
##               cyl
## Mazda RX4       6
## Mazda RX4 Wag   6
mtcars %>% select(contains('ar')) %>% head(2)
##               gear carb
## Mazda RX4        4    4
## Mazda RX4 Wag    4    4

See documentation (either ?select and then go to the “Useful functions” section or to the website with the same info.)

Additional, less important function: pull

mtcars %>% pull(mpg)
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
# same as: mtcars$mpg
mtcars %>% select(mpg)
##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4

arrange()

Use arrange() to order rows by values of a column:

mtcars %>% 
  arrange(desc(disp)) %>% 
  select(mpg, disp, drat) %>%
  head(2)
##    mpg disp drat
## 1 10.4  472 2.93
## 2 10.4  460 3.00
# Base R:
drat_inds <- order(mtcars$drat, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
##                mpg  disp drat
## Honda Civic   30.4  75.7 4.93
## Porsche 914-2 26.0 120.3 4.43

We can order by multiple columns too:

mtcars %>% 
  arrange(desc(gear), desc(hp)) %>%
  select(gear, hp, everything()) %>%
  head(8)
##   gear  hp  mpg cyl  disp drat    wt  qsec vs am carb
## 1    5 335 15.0   8 301.0 3.54 3.570 14.60  0  1    8
## 2    5 264 15.8   8 351.0 4.22 3.170 14.50  0  1    4
## 3    5 175 19.7   6 145.0 3.62 2.770 15.50  0  1    6
## 4    5 113 30.4   4  95.1 3.77 1.513 16.90  1  1    2
## 5    5  91 26.0   4 120.3 4.43 2.140 16.70  0  1    2
## 6    4 123 19.2   6 167.6 3.92 3.440 18.30  1  0    4
## 7    4 123 17.8   6 167.6 3.92 3.440 18.90  1  0    4
## 8    4 110 21.0   6 160.0 3.90 2.620 16.46  0  1    4

mutate()

Use mutate() when you want to create one or several columns:

mtcars <- mtcars %>% 
  mutate(hp_wt = hp/wt, 
         mpg_wt = mpg/wt) 

# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt
mtcars <- mtcars %>% 
  mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348
# base R
mtcars$hp_wt <- 1
mtcars <- mtcars %>% 
  mutate(hp_wt_correct = hp/wt,
         hp_wt_cyl = hp_wt_correct/cyl) 
mtcars %>% head(2)
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812
# base R
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl

mutate_at()

Use mutate_at() when you want to apply a function to one or several columns:

# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)

mtcars <- mtcars %>%
  mutate_at(c("hp_wt", "mpg_wt"), log) 

# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)

Note: again, calling mutate_at() outputs a new data frame, it does not alter the given data frame, so to keep the column transformations, we have to reassign mtcars to be the output of the pipe!

rename()

Use rename() to easily rename columns:

mtcars %>% 
  rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
  head(2)
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812
# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_log"
head(mtcars, 2)
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812

Important note

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame

So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate() and mutate_at())

dplyr and SQL

Summary

References