Summary:
Welcome to the tidyverse
tidyverse
is a coherent collection of packages for data scienceggplot2
, plyr
)dplyr
, tidyr
, tibble
, readr
ggplot2
purrr
tidyverse
package, we can see what’s in there:library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
%>%
operator) to fluidly glue functionality togetherdplyr
and tidyr
are going to be our main workhorses for data wrangling%>%
dplyr
verbsdplyr
functions are analogous to SQL counterparts, so learn dplyr
and get SQL for free!R
data science communityMastering the pipe
%>%
. shortcut: use ctrl + shift + m
as RStudio shortcutmagrittr
package (automatically included in tidyverse
)Piping at its most basic level:
Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:
ls -l | grep tidy | wc -l
## 11
Passing a single argument through pipes, we interpret something like:
x %>% f %>% g %>% h
as h(g(f(x)))
Key takeaway: in your mind, when you see %>%
, read this as “and then”
We can write exp(1)
with pipes as 1 %>% exp()
, and log(exp(1))
as 1 %>% exp() %>% log()
exp(1)
## [1] 2.718282
1 %>% exp()
## [1] 2.718282
1 %>% exp() %>% log()
## [1] 1
Now for multi-arguments functions, we interpret something like:
x %>% f(y)
as f(x,y)
mtcars %>% # data.frame
head(4)
And what’s the “old school” (base R) way?
head(mtcars, 4)
Notice that, with pipes: - Your code is more readable (arguably) - You can run partial commands more easily
The command x %>% f(y)
can be equivalently written in dot notation as:
x %>% f(., y)
What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:
x %>% f(y, .)
which is equivalent to f(y,x)
Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:
state_df <- data.frame(state.x77)
state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)
A more complicated example:
x <- "Ben really loves piping"
x %>%
strsplit(split = " ") %>%
.[[1]] %>% # subsetting
nchar() %>%
max # note the lack of paretheses
## [1] 6
Mastering the dyplr
verbs
dplyr
verbsOur dplyr
journey starts of with learning the following verbs (functions):
slice()
: subset rows based on integer indexingfilter()
: subset rows based on logical criteriaselect()
: select certain columnspull()
: pull out a individual columnarrange()
: order rows by value of a columnrename()
: rename columnsmutate()
: create new columnsmutate_at()
: apply a function to given columnsKey takeaway: think of data frames as nouns and dplyr
verbs as actions that you apply to manipulate them—especially natural when using pipes
slice()
Use slice()
when you want to indicate certain row numbers need to be kept:
mtcars %>%
slice(c(7,8,14:15))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## 2 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## 3 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## 4 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
# Base R:
mtcars[c(7,8,14:15),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
We can also do negative slicing:
mtcars %>%
slice(-c(1:2,19:23)) %>%
nrow()
## [1] 25
# Base R:
nrow(mtcars[-c(1:2,19:23),])
## [1] 25
filter()
Use filter()
when you want to subset rows based on logical conditions:
mtcars %>%
filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>%
head(2) # note rownames are silenced
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Base R:
head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3)), 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcars$drat <= 3),], 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
select()
Use select()
when you want to pick out certain columns:
mtcars %>%
select(cyl, disp, hp) %>%
head(2)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
select()
helpersVery handy selections using dplyr
helper functions:
mtcars %>%
select(starts_with("d")) %>%
head(2)
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
We can do many other things as well:
mtcars %>% select(ends_with('t')) %>% head(2)
## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875
mtcars %>% select(ends_with('yl')) %>% head(2)
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
mtcars %>% select(contains('ar')) %>% head(2)
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
See documentation (either ?select
and then go to the “Useful functions” section or to the website with the same info.)
pull
pull
select
preserves column structure even with a single columnmtcars %>% pull(mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
# same as: mtcars$mpg
mtcars %>% select(mpg)
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
arrange()
Use arrange()
to order rows by values of a column:
mtcars %>%
arrange(desc(disp)) %>%
select(mpg, disp, drat) %>%
head(2)
## mpg disp drat
## 1 10.4 472 2.93
## 2 10.4 460 3.00
# Base R:
drat_inds <- order(mtcars$drat, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
## mpg disp drat
## Honda Civic 30.4 75.7 4.93
## Porsche 914-2 26.0 120.3 4.43
We can order by multiple columns too:
mtcars %>%
arrange(desc(gear), desc(hp)) %>%
select(gear, hp, everything()) %>%
head(8)
## gear hp mpg cyl disp drat wt qsec vs am carb
## 1 5 335 15.0 8 301.0 3.54 3.570 14.60 0 1 8
## 2 5 264 15.8 8 351.0 4.22 3.170 14.50 0 1 4
## 3 5 175 19.7 6 145.0 3.62 2.770 15.50 0 1 6
## 4 5 113 30.4 4 95.1 3.77 1.513 16.90 1 1 2
## 5 5 91 26.0 4 120.3 4.43 2.140 16.70 0 1 2
## 6 4 123 19.2 6 167.6 3.92 3.440 18.30 1 0 4
## 7 4 123 17.8 6 167.6 3.92 3.440 18.90 1 0 4
## 8 4 110 21.0 6 160.0 3.90 2.620 16.46 0 1 4
mutate()
Use mutate()
when you want to create one or several columns:
mtcars <- mtcars %>%
mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)
# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt
data.frame
with updated/ added columnsmtcars <- mtcars %>%
mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
# base R
mtcars$hp_wt <- 1
mtcars <- mtcars %>%
mutate(hp_wt_correct = hp/wt,
hp_wt_cyl = hp_wt_correct/cyl)
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
# base R
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl
mutate_at()
Use mutate_at()
when you want to apply a function to one or several columns:
# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)
mtcars <- mtcars %>%
mutate_at(c("hp_wt", "mpg_wt"), log)
# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)
Note: again, calling mutate_at()
outputs a new data frame, it does not alter the given data frame, so to keep the column transformations, we have to reassign mtcars
to be the output of the pipe!
rename()
Use rename()
to easily rename columns:
mtcars %>%
rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log mpg_wt_log
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365 0.7330158
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199 0.6873654
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_log"
head(mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log mpg_wt_log
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365 0.7330158
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199 0.6873654
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
Calling dplyr
verbs always outputs a new data frame, it does not alter the existing data frame
So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate()
and mutate_at()
)
dplyr
and SQLdplyr
you should find SQL very natural, and vice versa!select
is SELECT
, filter
is WHERE
, arrange
is ORDER BY
etc.group_by()
and summarize()
, which are used to aggregate data (next lecture)left_join()
and inner_join()
verbs (also next lecture)%>%
operator)dplyr
is a package for data wrangling, with several key verbs (functions)slice()
and filter()
: subset rows based on numbers or conditionsselect()
and pull
: select columns or pull out as single column vectorarrange()
: order rows by one or multiple columnsrename()
and mutate()
: rename or create columnsmutate_at()
: apply a function to given columns