Last week: Tidyverse II: tidyr

Part I

Simulation basics

Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

Sampling from a given vector

To sample from a given vector, use sample()

sample(x = letters, size = 10) # Without replacement, the default
##  [1] "f" "t" "q" "p" "a" "z" "g" "j" "w" "n"
sample(x = c(0,1), size = 10, replace = TRUE) # With replacement
##  [1] 0 1 0 0 1 0 0 0 1 0
sample(x = 10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  5  8  4  7  2  3  1  6  9 10

Random number generation

To sample from a normal distribution, we have the utility functions:

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Random number examples

Standard normal random variables (mean 0 and variance 1)

n <- 200
z <- rnorm(n, mean = 0, sd = 1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.006089724
var(z)   # Check: sample variance is approximately 1
## [1] 1.092156

Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function—use ecdf()

library(tidyverse)
ecdf_fun <- ecdf(z) # Create the ECDF
class(ecdf_fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf_fun(0)
## [1] 0.495
# We can plot it 

gaussian_data <- data.frame(z = sort(z)) %>%
  mutate(ecdf = ecdf_fun(z),
         true = pnorm(z)) 

gaussian_data_lines <- gaussian_data %>%
  gather("line", "value", 
         ecdf, true, -z) %>%
  mutate(line = forcats::fct_recode(line, 
                                    Empirical = "ecdf", 
                                    Actual = "true"))

ggplot(gaussian_data_lines, aes(x = z, y = value, color = line)) +
  geom_line() +
  labs(y = "CDF",
       title = "ECDF",
       color = "")

Estimated density function

To compute histogram —a basic estimator of the density based on binning—use hist()

hist_obj <- hist(z, breaks = 30, plot = FALSE) # base R function
class(hist_obj) # It's a list
## [1] "histogram"
hist_obj$breaks # These are the break points that were used
##  [1] -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6
## [15] -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2
## [29]  2.4  2.6  2.8
hist_obj$density # These are the estimated probabilities
##  [1] 0.025 0.025 0.000 0.025 0.000 0.075 0.100 0.050 0.175 0.150 0.150
## [12] 0.250 0.275 0.400 0.425 0.350 0.375 0.350 0.425 0.400 0.150 0.200
## [23] 0.175 0.100 0.125 0.075 0.025 0.050 0.025 0.050
# We can plot it
gaussian_data <- gaussian_data %>%
  mutate(PDF = dnorm(z))

ggplot(gaussian_data, aes(x = z, y = ..density..)) + 
  geom_histogram(bins = 30, 
                 fill = "lightblue",
                 col = "black") +
  geom_line(aes(y = PDF)) +
  labs(title = "True vs Empirical Distribution")


hist_data <- data.frame(breaks = hist_obj$breaks[1:(length(hist_obj$breaks) - 1)] + .1,
                        density = hist_obj$density)

ggplot(hist_data, aes(x = breaks, y = density)) +
  geom_histogram(stat = "identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Tangent on estimation (did you know?)

Interesting statistical facts:

Part II

Pseudorandomness and seeds

Same function call, different results

Not surprisingly, we get different draws each time we call rnorm()

mean(rnorm(n))
## [1] -0.06128671
mean(rnorm(n))
## [1] 0.06798459
mean(rnorm(n))
## [1] 0.0256657
mean(rnorm(n))
## [1] -0.07122027

Is it really random?

Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom

Setting the random seed

All pseudorandom number generators depend on what is called a seed value

Seed examples

# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414

# Different seeds, different numbers
set.seed(1); rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(2); rnorm(5)
## [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176
set.seed(3); rnorm(5)
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828

# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995

Part III

Iteration and simulation

Drug effect model

What would you do?

What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?

So, let’s simulate!

# Simulate, supposing 60 subjects in each group 
set.seed(0)
n <- 60 
mu_drug <- 2
mu_nodrug <- runif(n, min = 0, max = 1)
x_drug <- 100 * rexp(n, rate = 1/mu_drug) 
x_nodrug <- 100 * rexp(n, rate = 1/mu_nodrug)

sim_data <- data.frame(x = c(x_drug, x_nodrug),
                       class = factor(rep(c("Drug","No Drug"), each = 60)))
                          
sim_data %>%
  ggplot(aes(x = x, y = ..density.., color = class)) +
  geom_histogram(aes(fill = class), 
                 alpha = .5, color = "black",
                 position = "identity") +
  geom_density(aes(color = class))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Kernel Density Estimates


If we just want to calculate the a density function we can change the “spread” by changing the bandwidth (bw). For ggplot we can access this by using the stat_density instead of geom_density:

library(gridExtra)

gg_vis <- list()
for (bw in c(5,10,30)) {
  gg_vis[[as.character(bw)]] <- sim_data %>%
  ggplot(aes(x = x, color = class)) +
  geom_density(aes(color = class), bw = bw) +
  geom_rug() +
  labs(title = bw)
}

grid.arrange(grobs = gg_vis, ncol = 1)

Repetition and reproducibility

Iteration and simulation (and functions): good friends

Code sketch

Consider the code below for a generic simulation. Think about how you would frame this for the drug effect example, which you’ll revisit in lab

# Function to do one simulation run
one_sim <- function(param1, param2 = value2, param3 = value3) {
  # Possibly intricate simulation code goes here
}

# Function to do repeated simulation runs
rep_sim <- function(nreps, param1, param2 = value2, 
                   param3 = value3, seed = NULL) {
  # Set the seed, if we need to
  if(!is.null(seed)) set.seed(seed)
  
  # Run the simulation over and over
  sim_objs <- vector(length = nreps, mode = "list")
  for (r in 1:nreps) {
    sim_objs[r] <- one_sim(param1, param2, param3)
  }
  
  # Aggregate the results somehow, and then return something
}

Part IV

Reading in and writing out data

Saving results

Sometimes simulations take a long time to run, and we want to save intermediate or final output, for quick reference later

There two different ways of saving things from R (there are more than two, but here are two useful ones):

Note: there is a big difference between how these two treat variable names

Loading results

Corresponding to the two different ways of saving, we have two ways of loading things into R:

Commentary on R based saving and loading

Advantage: these can be a lot more memory efficient than other tools

Disadvantage: they’re limited in scope in the sense that they can only communicate with R

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

This week we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

read.table(), read.csv()

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in:

The function read.csv() is just a shortcut for using read.table() with sep=",". (But note: these two actually differ on some of the default inputs!)

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.table() but FALSE in read.csv())

Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

Example of reading in data

shark_attacks <- read.table(
  file="https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week2/shark-attacks-clean.csv", 
  sep = ",", header = TRUE)
head(shark_attacks)
##   X  Case.Number       Date       Type   Country              Area
## 1 1 2016.09.18.c 2016-09-18 Unprovoked       USA           Florida
## 2 2 2016.09.18.b 2016-09-18 Unprovoked       USA           Florida
## 3 3 2016.09.18.a 2016-09-18 Unprovoked       USA           Florida
## 4 4   2016.09.17 2016-09-17 Unprovoked AUSTRALIA          Victoria
## 5 5   2016.09.15 2016-09-16 Unprovoked AUSTRALIA          Victoria
## 6 6 2016.09.15.R 2016-09-15       Boat AUSTRALIA Western Australia
##                           Location Activity                   Name Sex Age
## 1 New Smyrna Beach, Volusia County  Surfing                   male   M  24
## 2 New Smyrna Beach, Volusia County  Surfing         Chucky Luciano   M  74
## 3 New Smyrna Beach, Volusia County  Surfing                   male   M  85
## 4                 Thirteenth Beach  Surfing        Rory Angiolella   M  -5
## 5                      Bells Beach  Surfing                   male   M  -5
## 6                          Bunbury  Fishing Occupant: Ben Stratton   U  -5
##                                     Injury  Time   Species
## 1                    Minor injury to thigh 13h00      <NA>
## 2                     Lacerations to hands 11h00      <NA>
## 3                 Lacerations to lower leg 10h43      <NA>
## 4             Struck by fin on chest & leg  <NA>      <NA>
## 5    No injury: Knocked off board by shark  <NA> 2 m shark
## 6 Shark rammed boat. No injury to occupant  <NA>      <NA>
##        Investigator.or.Source Fatal SharkLength
## 1 Orlando Sentinel, 9/19/2016     N          10
## 2 Orlando Sentinel, 9/19/2016     N           9
## 3 Orlando Sentinel, 9/19/2016     N           5
## 4          The Age, 9/18/2016     N           8
## 5          The Age, 9/16/2016     N           9
## 6  West Australian, 9/15/2016     N          10
sapply(shark_attacks, class)
##                      X            Case.Number                   Date 
##              "integer"               "factor"               "factor" 
##                   Type                Country                   Area 
##               "factor"               "factor"               "factor" 
##               Location               Activity                   Name 
##               "factor"               "factor"               "factor" 
##                    Sex                    Age                 Injury 
##               "factor"              "integer"               "factor" 
##                   Time                Species Investigator.or.Source 
##               "factor"               "factor"               "factor" 
##                  Fatal            SharkLength 
##               "factor"              "integer"

write.table(), write.csv()

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in:

Note that quote = FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote = TRUE). Also, setting row.names = FALSE and col.names = FALSE will disable the printing of row and column names (defaults are row.names = TRUE and col.names = TRUE)

Summary