Statistical Computing, 36-350
Wednesday, August 7, 2019
Advanced\(^2\) Tidyverse: Split-Apply-Combine
dplyr
and tidyr
%>%
operator)dplyr
has advanced functionality that mirrors SQL
slice()
and filter()
: subset rows based on numbers or conditionsselect()
and mutate()
: select columns or create/ override columnsgroup_by()
: create groups of rows according to a conditionsummarize()
: apply computations across groups of rowstidyr
is a package for manipulating the structure of data frames
gather()
: make wide data longerspread()
: make long data widerToday we will learn a general strategy that can be summmarized in three conceptual steps:
These are conceptual steps; often the apply and combine steps can be performed in multiple steps (in the tidyverse
reable fashion), and the combine
can also see reformating (leveraging gather
and spread
)
for()
loops, often requires far less code and more synatically easy to readtidyverse
tidyverse
: goes much beyond the single number summarize
that we’ve seen beforeData set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:
country
, year
: country and year of data collectionstrike.volume
: days on strike per 1000 workersunemployment
: unemployment rateinflation
: inflation rateleft.parliament
: leftwing share of the govermentcentralization
: centralization of unionsdensity
: density of unionslibrary(tidyverse)
strikes_df <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week6/strikes.csv")
dim(strikes_df) # Since 18 × 35 = 630, some years missing from some countries
## [1] 625 8
head(strikes_df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?
How could we approach this?
for()
loop, where we loop over countriesStep 0: design a function that will work on one chunk of data
So let’s write code to do regression on the data from (say) just Italy
strikes_df_italy <- strikes_df %>% filter(country == "Italy") # Data for Italy
dim(strikes_df_italy)
## [1] 35 8
head(strikes_df_italy)
## country year strike.volume unemployment inflation left.parliament
## 1 Italy 1951 437 8.8 14.3 37.5
## 2 Italy 1952 337 9.5 1.9 37.5
## 3 Italy 1953 545 10.0 1.4 40.2
## 4 Italy 1954 493 8.7 2.4 40.2
## 5 Italy 1955 511 7.5 2.3 40.2
## 6 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 1 0.2513799 NA
## 2 0.2489860 NA
## 3 0.2482739 NA
## 4 0.2466577 NA
## 5 0.2540366 NA
## 6 0.2457069 NA
italy_lm <- lm(strike.volume ~ left.parliament, data = strikes_df_italy)
summary(italy_lm)
##
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes_df_italy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -930.2 -411.6 -137.3 387.2 1901.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -738.75 1200.62 -0.615 0.543
## left.parliament 40.29 27.76 1.451 0.156
##
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared: 0.05999, Adjusted R-squared: 0.0315
## F-statistic: 2.106 on 1 and 33 DF, p-value: 0.1562
ggplot(strikes_df_italy,
aes(x = left.parliament, y = strike.volume)) +
geom_jitter() +
geom_abline(intercept = coef(italy_lm)[1],
slope = coef(italy_lm)[2]) +
labs(title = "Italy strike volume versus leftwing alignment",
y = "Strike volume", x = "Leftwing alignment")
Now let’s turn this into a function
coef(italy_lm) # Old way
## (Intercept) left.parliament
## -738.74531 40.29109
Recalling tidyverse
really likes data.frames we introduce a new package broom
:
broom
converts “tidy”s up different objects (like models) into a data.frame
s using the tidy
functionmy_strike_lm <- function(data) {
coef_table <- lm(strike.volume ~ left.parliament, data = data) %>%
summary() %>%
broom::tidy()
# needs to provide a data frame (to preserve name structure)
return(coef_table %>% select(term, estimate))
}
my_strike_lm(strikes_df_italy) # New way, same results in different format
## # A tibble: 2 x 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) -739.
## 2 left.parliament 40.3
Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function nest()
is often helpful: df %>% group_by(my_factor) %>% nest()
splits a data frame df
into several data frames, defined by constant levels of the factor my_factor
and “nests” these data.frames into the “tibble/data.frame” in the data column.
So we want to split strikes_df
into 18 smaller data frames, each of which has the data for just one country
strikes_by_country <-
strikes_df %>% group_by(country) %>%
nest()
class(strikes_by_country) # It's a tibble (just a fancy data.frame)
## [1] "tbl_df" "tbl" "data.frame"
names(strikes_by_country) # It has a new column called `data`
## [1] "country" "data"
head(strikes_by_country, 3)
## # A tibble: 3 x 2
## country data
## <fct> <list>
## 1 Australia <tibble [35 × 7]>
## 2 Austria <tibble [35 × 7]>
## 3 Belgium <tibble [30 × 7]>
head(strikes_by_country$data[strikes_by_country$country == "Italy"]) # Same as what we saw before
## [[1]]
## # A tibble: 35 x 7
## year strike.volume unemployment inflation left.parliament
## <int> <int> <dbl> <dbl> <dbl>
## 1 1951 437 8.8 14.3 37.5
## 2 1952 337 9.5 1.9 37.5
## 3 1953 545 10 1.4 40.2
## 4 1954 493 8.7 2.4 40.2
## 5 1955 511 7.5 2.3 40.2
## 6 1956 372 9.3 3.4 40.2
## 7 1957 407 7.4 1.1 40.2
## 8 1958 363 6.4 3.2 41.4
## 9 1959 792 5.4 0 41.4
## 10 1960 483 4 2.1 41.4
## # … with 25 more rows, and 2 more variables: centralization <dbl>,
## # density <dbl>
Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, we’ll apply our functions using map
from purrr
.
So we want to apply my_strikes_lm()
to each subset of the data in strikes_by_country
. Think about what the output will be from each function call: vector of length 2 (intercept and slope):
nested <- strikes_df %>%
group_by(country) %>%
nest() %>%
mutate(coef = map(data, my_strike_lm))
head(nested, 3)
## # A tibble: 3 x 3
## country data coef
## <fct> <list> <list>
## 1 Australia <tibble [35 × 7]> <tibble [2 × 2]>
## 2 Austria <tibble [35 × 7]> <tibble [2 × 2]>
## 3 Belgium <tibble [30 × 7]> <tibble [2 × 2]>
Now, we’re really looking for a data frame that looks like the following:
country | (Intercept) | left.parliament |
---|---|---|
Austrialia | 415 | -0.86 |
Austria | 423 | -8.21 |
So we first just grab the columns of interest (country
and coef
) and then “unnest” the nested columns (which in this case is only coef
).
examine_coef <- nested %>% select(country, coef) %>%
unnest()
head(examine_coef, 4)
## # A tibble: 4 x 3
## country term estimate
## <fct> <chr> <dbl>
## 1 Australia (Intercept) 415.
## 2 Australia left.parliament -0.864
## 3 Austria (Intercept) 423.
## 4 Austria left.parliament -8.21
This is slightly different than our desired output, but tidyr::spread
can help us out:
strikes_coef <- examine_coef %>% spread(term, estimate)
head(strikes_coef, 2)
## # A tibble: 2 x 3
## country `(Intercept)` left.parliament
## <fct> <dbl> <dbl>
## 1 Australia 415. -0.864
## 2 Austria 423. -8.21
Now that we’ve create the coefficient matrix, let’s visual the effect of a leftwing leaning parliment on the number of strikes:
ggplot(strikes_coef,
aes(x = forcats::fct_reorder(country, left.parliament),
#^ forcats is a great tool to reorder levels of factors
y = left.parliament)) +
geom_point() +
geom_hline(yintercept = 0, color = "grey") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Country", y = "Leftwing alignment Coefficient")
broom::tidy
) to convert it to a data.frame
.# my inner function
summarize_left_parliament <- function(df) {
broom::tidy(summary(df$left.parliament))
}
# with the nexted data
summary_lp_nest <- nested %>%
mutate(sum_lp = map(data, summarize_left_parliament))
summary_lp_nest %>% head(2)
## # A tibble: 2 x 4
## country data coef sum_lp
## <fct> <list> <list> <list>
## 1 Australia <tibble [35 × 7]> <tibble [2 × 2]> <tibble [1 × 6]>
## 2 Austria <tibble [35 × 7]> <tibble [2 × 2]> <tibble [1 × 6]>
# creating the desired data frame.
summary_lp_unnested <- summary_lp_nest %>% select(country, sum_lp) %>%
unnest
summary_lp_unnested %>% head(5)
## # A tibble: 5 x 7
## country minimum q1 median mean q3 maximum
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 28.3 36.9 41 41.9 47.2 60
## 2 Austria 43.6 46.7 47.9 48.4 50.8 51.9
## 3 Belgium 29.3 30.2 33 35.6 41.6 42.5
## 4 Canada 21.5 51.3 59 55.3 67 78.7
## 5 Denmark 36 44 46.9 46.4 48.8 50.9
split
to put data frames into a list and try doing things like lapply
(see lectures like: Ryan Tibshirani’s - search “split-apply-combine”.)for
-loops (and thinking about how you need to save your data)Parallel Computing
for
loops, which do not require the previous to be computed befor the next stepMuch R code runs fast and fine on a single processor. But at times, computations can be:
To help with cpu-bound computations, one can take advantage of modern processor architectures that provide multiple cores on a single processor, and thereby enable multiple computations to take place at the same time. In addition, some machines ship with multiple processors, allowing large computations to occur across the entire cluster of those computers. Plus, these machines also have large amounts of memory to avoid memory-bound computing jobs.
Amdahl’s Law: where the speedup of the computation (y-axis) is a function of both the number of cores (x-axis) and the proportion of the computation that can be parallelized (see colored lines):
parallel
library (mclapply
)The first package we will look at is parallel
, which includes the function mclapply
which can be used very similar to an lapply
.
library(parallel)
numCores <- detectCores()
numCores # number of cores on my computer :)
## [1] 12
set.seed(1)
my_k <- rep(1:14, each = 20)
my_simulation <- function(k) {
x <- runif(30,-3, 3)
y <- 2*x + 2* rnorm(n = 30)
x0 <- runif(10000,-3, 3)
y0 <- 2*x0 + 2*rnorm(n = 10000)
my_lm <- lm(y ~ poly(x,k))
y0_hat <- predict(my_lm, data.frame(x = x0))
return(mean((y0-y0_hat)^2))
}
# without parallelizing :
system.time({
results <- lapply(my_k, my_simulation)
my_data <- data.frame(k = my_k,
values = unlist(results))
})
## user system elapsed
## 1.837 0.647 2.484
# with:
system.time({
results <- mclapply(my_k, my_simulation, mc.cores = numCores)
my_data <- data.frame(k = my_k,
values = unlist(results))
})
## user system elapsed
## 2.823 1.890 0.563
ggplot(my_data, aes(x = k, y = values)) +
geom_point(alpha = .1) +
labs(x = "Polynomial Fit",
y = "Test Error") +
geom_smooth() +
ylim(0, 40)
parallel
manualforeach
and doParallel
library (foreach () %dopar%
)foreach
and doParallel
provide parallelization that mirrors for
loops:
#base for-loop
system.time({
storage_f <- c()
for (k in my_k) {
storage_f <- c(storage_f, my_simulation(k))
}
})
## user system elapsed
## 1.809 0.651 2.463
#single-core `foreach`
library(foreach)
system.time({
storage_fe <- foreach (k=my_k, .combine=c) %do% {
my_simulation(k)
}
})
## user system elapsed
## 1.874 0.640 2.516
#parallel `foreach`
library(doParallel)
registerDoParallel(numCores) # use multicore, set to the number of our cores
system.time({
storage_p <- foreach (k=my_k, .combine=c) %dopar% {
my_simulation(k)
}
})
## user system elapsed
## 3.484 2.502 0.610
# to stop implicitly using the clusters
stopImplicitCluster()
foreach
has a specific parameter .combine
which combines results in a set of different ways (you see above .combine=c
, but it can also do .combine=rbind
or cbind
). The default is returned as a list.
Resource: doParallel
+ foreach
vignette
parallel
library (mclapply
- similar to lapply
)foreach
library + doParallel
(foreach () %dopar%
similar to for
loops)Deep Learning in R
Name | Mathematics | Visual |
---|---|---|
logistic | \(f(x) = 1/(1+e^{-x})\) | |
reLU | \(f(x) = x \mathbb{I}(X > 0)\) | |
softmax | \(f(\overset{\to}{x}) = \frac{e^{x_i}}{\sum_{j=1}^J e^{x_j}}\) |
useR
group just had a session from Dick Sporting Goods about how they use it.)library(keras)
# install_keras() # installs backend to Tensorflow
*I’m basically going to walk you through a basic example (that is found at https://keras.rstudio.com/)*
Motivation: MNIST
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
dim(x_train)
## [1] 60000 28 28
head(y_train, 4); length(y_train)
## [1] 5 0 4 1
## [1] 60000
library(ggDiagnose) # just for quick viz
library(gridExtra)
grid.arrange(grobs = lapply(1:4,
function(idx) {ggDiagnose.matrix(
t(x_train[idx,seq(28,1),]),
return = T,
show.plot = F,
type = "image")$ggout +
theme_void() + theme(legend.position = "none") +
labs(x = "", y = "") +
scale_fill_gradient(low = "white", high = "black")}),
nrow = 1)
# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255
# to categorical (create dummy columns)
y_train_c <- to_categorical(y_train, 10)
y_test_c <- to_categorical(y_test, 10)
head(y_train_c, 2); dim(y_train_c)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 1 0 0 0 0
## [2,] 1 0 0 0 0 0 0 0 0 0
## [1] 60000 10
Defining the structure
model <- keras_model_sequential() # instead of RNN or CNN (the standard NN)
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 10, activation = 'softmax')
summary(model)
## Model: "sequential"
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense (Dense) (None, 256) 200960
## ___________________________________________________________________________
## dropout (Dropout) (None, 256) 0
## ___________________________________________________________________________
## dense_1 (Dense) (None, 128) 32896
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 10) 1290
## ===========================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ___________________________________________________________________________
Compiling model
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
history <- model %>% fit(
x_train, y_train_c,
epochs = 30, batch_size = 128,
validation_split = 0.2, verbose = T)
Predict labels
yhat <- model %>% predict_classes(x_test)
# accuracy
mean(y_test == yhat)
## [1] 0.9799
interpreting layers
This model doesn’t have really descriptive layers, but see Visualizing intermediate activation in CNN with Keras