Last week: Tidyverse II: tidyr

Part I

Simulation basics

Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

Sampling from a given vector

To sample from a given vector, use sample()

sample(x = letters, size = 10) # Without replacement, the default
##  [1] "f" "t" "q" "p" "a" "z" "g" "j" "w" "n"
sample(x = c(0,1), size = 10, replace = TRUE) # With replacement
##  [1] 0 1 0 0 1 0 0 0 1 0
sample(x = 10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  5  8  4  7  2  3  1  6  9 10

Random number generation

To sample from a normal distribution, we have the utility functions:

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Random number examples

Standard normal random variables (mean 0 and variance 1)

n <- 200
z <- rnorm(n, mean = 0, sd = 1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.006089724
var(z)   # Check: sample variance is approximately 1
## [1] 1.092156

Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function—use ecdf()

library(tidyverse)
ecdf_fun <- ecdf(z) # Create the ECDF
class(ecdf_fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf_fun(0)
## [1] 0.495
# We can plot it 

gaussian_data <- data.frame(z = sort(z)) %>%
  mutate(ecdf = ecdf_fun(z),
         true = pnorm(z)) 

gaussian_data_lines <- gaussian_data %>%
  gather("line", "value", 
         ecdf, true, -z) %>%
  mutate(line = forcats::fct_recode(line, 
                                    Empirical = "ecdf", 
                                    Actual = "true"))

ggplot(gaussian_data_lines, aes(x = z, y = value, color = line)) +
  geom_line() +
  labs(y = "CDF",
       title = "ECDF",
       color = "")

Estimated density function

To compute histogram —a basic estimator of the density based on binning—use hist()

hist_obj <- hist(z, breaks = 30, plot = FALSE) # base R function
class(hist_obj) # It's a list
## [1] "histogram"
hist_obj$breaks # These are the break points that were used
##  [1] -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6
## [15] -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2
## [29]  2.4  2.6  2.8
hist_obj$density # These are the estimated probabilities
##  [1] 0.025 0.025 0.000 0.025 0.000 0.075 0.100 0.050 0.175 0.150 0.150
## [12] 0.250 0.275 0.400 0.425 0.350 0.375 0.350 0.425 0.400 0.150 0.200
## [23] 0.175 0.100 0.125 0.075 0.025 0.050 0.025 0.050
# We can plot it
gaussian_data <- gaussian_data %>%
  mutate(PDF = dnorm(z))

ggplot(gaussian_data, aes(x = z, y = ..density..)) + 
  geom_histogram(bins = 30, 
                 fill = "lightblue",
                 col = "black") +
  geom_line(aes(y = PDF)) +
  labs(title = "True vs Empirical Distribution")


hist_data <- data.frame(breaks = hist_obj$breaks[1:(length(hist_obj$breaks) - 1)] + .1,
                        density = hist_obj$density)

ggplot(hist_data, aes(x = breaks, y = density)) +
  geom_histogram(stat = "identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Tangent on estimation (did you know?)

Interesting statistical facts:

Part II

Pseudorandomness and seeds

Same function call, different results

Not surprisingly, we get different draws each time we call rnorm()

mean(rnorm(n))
## [1] -0.06128671
mean(rnorm(n))
## [1] 0.06798459
mean(rnorm(n))
## [1] 0.0256657
mean(rnorm(n))
## [1] -0.07122027

Is it really random?

Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom

Setting the random seed

All pseudorandom number generators depend on what is called a seed value

Seed examples

# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414

# Different seeds, different numbers
set.seed(1); rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(2); rnorm(5)
## [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176
set.seed(3); rnorm(5)
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828

# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995

Part III

Iteration and simulation

Drug effect model

What would you do?

What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?

So, let’s simulate!

# Simulate, supposing 60 subjects in each group 
set.seed(0)
n <- 60 
mu_drug <- 2
mu_nodrug <- runif(n, min = 0, max = 1)
x_drug <- 100 * rexp(n, rate = 1/mu_drug) 
x_nodrug <- 100 * rexp(n, rate = 1/mu_nodrug)

sim_data <- data.frame(x = c(x_drug, x_nodrug),
                       class = factor(rep(c("Drug","No Drug"), each = 60)))
                          
sim_data %>%
  ggplot(aes(x = x, y = ..density.., color = class)) +
  geom_histogram(aes(fill = class), 
                 alpha = .5, color = "black",
                 position = "identity") +
  geom_density(aes(color = class))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.