`tidyr`

`tidyr`

is a package for manipulating the structure of data frames`gather()`

: make wide data longer`spread()`

: make long data wider`unite()`

and`separate()`

: combine or split columns`dplyr`

has advanced functionality that mirrors SQL`group_by()`

: create groups of rows according to a condition`summarize()`

: apply computations across groups of rows`*_join()`

where`*`

=`inner`

,`left`

,`right`

, or`full`

: join two data frames together according to common values in certain columns, and`*`

indicates how many rows to keep

*Simulation basics*

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

- Often, simulations can be easier than hand calculations
- Often, simulations can be made more realistic than hand calculations

To sample from a given vector, use `sample()`

`sample(x = letters, size = 10) # Without replacement, the default`

`## [1] "f" "t" "q" "p" "a" "z" "g" "j" "w" "n"`

`sample(x = c(0,1), size = 10, replace = TRUE) # With replacement`

`## [1] 0 1 0 0 1 0 0 0 1 0`

`sample(x = 10) # Arguments set as x=1:10, size=10, replace=FALSE`

`## [1] 5 8 4 7 2 3 1 6 9 10`

To sample from a normal distribution, we have the utility functions:

`rnorm()`

: generate normal random variables`pnorm()`

: normal distribution function, \(\Phi(x)=P(Z \leq x)\)`dnorm()`

: normal density function, \(\phi(x)= \Phi'(x)\)`qnorm()`

: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Standard normal random variables (mean 0 and variance 1)

```
n <- 200
z <- rnorm(n, mean = 0, sd = 1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
```

`## [1] 0.006089724`

`var(z) # Check: sample variance is approximately 1`

`## [1] 1.092156`

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function—use `ecdf()`

```
library(tidyverse)
ecdf_fun <- ecdf(z) # Create the ECDF
class(ecdf_fun) # It's a function!
```

`## [1] "ecdf" "stepfun" "function"`

`ecdf_fun(0)`

`## [1] 0.495`

```
# We can plot it
gaussian_data <- data.frame(z = sort(z)) %>%
mutate(ecdf = ecdf_fun(z),
true = pnorm(z))
gaussian_data_lines <- gaussian_data %>%
gather("line", "value",
ecdf, true, -z) %>%
mutate(line = forcats::fct_recode(line,
Empirical = "ecdf",
Actual = "true"))
ggplot(gaussian_data_lines, aes(x = z, y = value, color = line)) +
geom_line() +
labs(y = "CDF",
title = "ECDF",
color = "")
```

To compute histogram —a basic estimator of the density based on binning—use `hist()`

```
hist_obj <- hist(z, breaks = 30, plot = FALSE) # base R function
class(hist_obj) # It's a list
```

`## [1] "histogram"`

`hist_obj$breaks # These are the break points that were used`

```
## [1] -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6
## [15] -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
## [29] 2.4 2.6 2.8
```

`hist_obj$density # These are the estimated probabilities`

```
## [1] 0.025 0.025 0.000 0.025 0.000 0.075 0.100 0.050 0.175 0.150 0.150
## [12] 0.250 0.275 0.400 0.425 0.350 0.375 0.350 0.425 0.400 0.150 0.200
## [23] 0.175 0.100 0.125 0.075 0.025 0.050 0.025 0.050
```

```
# We can plot it
gaussian_data <- gaussian_data %>%
mutate(PDF = dnorm(z))
ggplot(gaussian_data, aes(x = z, y = ..density..)) +
geom_histogram(bins = 30,
fill = "lightblue",
col = "black") +
geom_line(aes(y = PDF)) +
labs(title = "True vs Empirical Distribution")
```

```
hist_data <- data.frame(breaks = hist_obj$breaks[1:(length(hist_obj$breaks) - 1)] + .1,
density = hist_obj$density)
ggplot(hist_data, aes(x = breaks, y = density)) +
geom_histogram(stat = "identity")
```

`## Warning: Ignoring unknown parameters: binwidth, bins, pad`

Interesting statistical facts:

- When you have lots of data, the empirical distribution function will always be close to the actual distribution function (Glivenko-Cantelli, “Fundamental Theorem of Statistics”)
- When you have lots of data, the histogram estimator is not always close to the density function, only when the underlying density is quite smooth
- In general, density estimation is tricky, there are many estimators beyond histograms worth studying/knowing about; e.g.,
`density()`

uses a kernel density estimator (gives a function) and you’ve already seen`geom_density`

*Pseudorandomness and seeds*

Not surprisingly, we get different draws each time we call `rnorm()`

`mean(rnorm(n))`

`## [1] -0.06128671`

`mean(rnorm(n))`

`## [1] 0.06798459`

`mean(rnorm(n))`

`## [1] 0.0256657`

`mean(rnorm(n))`

`## [1] -0.07122027`

Random numbers generated in R (in any language) are not “truly” random; they are what we call **pseudorandom**

- These are numbers generated by computer algorithms that very closely mimick “truly” random numbers
- The study of such algorithms is an interesting research area in its own right!
- The default algorithm in R (and in nearly all software languages) is called the “Mersenne Twister”
- Type
`?Random`

in your R console to read more about this (and to read how to change the algorithm used for pseudorandom number generation, which you should never really have to do, by the way)

All pseudorandom number generators depend on what is called a **seed** value

- This puts the random number generator in a well-defined “state”, so that the numbers it generates, from then on, will be reproducible
- The seed is just an integer, and can be set with
`set.seed()`

- The reason we set it: so that when someone else runs our simulation code, they can see the same—albeit, still random—results that we do

```
# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
```

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

```
# Different seeds, different numbers
set.seed(1); rnorm(5)
```

`## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078`

`set.seed(2); rnorm(5)`

`## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176`

`set.seed(3); rnorm(5)`

`## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828`

```
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
```

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

*Iteration and simulation*

- Let’s start with a motivating example: suppose we had a model for the way a drug affected certain patients
- All patients will undergo chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1) \]
- And those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2) \] (Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution)

What would you do if you had such a model, and your scientist collaborators asked you: *how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?*

- Answer used to be: get out your pen and paper, make some approximations
- Answer is now: simulate from the model, no approximations required!

```
# Simulate, supposing 60 subjects in each group
set.seed(0)
n <- 60
mu_drug <- 2
mu_nodrug <- runif(n, min = 0, max = 1)
x_drug <- 100 * rexp(n, rate = 1/mu_drug)
x_nodrug <- 100 * rexp(n, rate = 1/mu_nodrug)
```

```
sim_data <- data.frame(x = c(x_drug, x_nodrug),
class = factor(rep(c("Drug","No Drug"), each = 60)))
sim_data %>%
ggplot(aes(x = x, y = ..density.., color = class)) +
geom_histogram(aes(fill = class),
alpha = .5, color = "black",
position = "identity") +
geom_density(aes(color = class))
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`