Name:

Andrew ID:

Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an knitted HTML file on Canvas, by Friday 10pm, this week.

**This week’s agenda**: practice writing functions, creating simulations, using ``replicate()’’

We are going to continue the drug effect model that was discussed in the “Simulation” lecture. That is, we will simulate the effects of using a drug and not using a drug to see hypothetically. This will allow us to investigate how different parameters of our model affect the number of subjects needed to observe a significant difference without calculating complicated math.

Suppose that there is a new drug that can be optionally given before chemotherapy. We follow the setup given in the “Simulation” lecture. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1), \] whereas those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2). \] Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution. Now consider the following scenario. In the following questions, we will set up a way to simulate this model. Throughout this question, we call all subjects who are given the drug experience as “cases” and those who are not given the drug experience as “controls”.

**1a.**The first function we write will generate the simulation data. Write a function called`simulate_data()`

that produces measurements in the drug and no drug groups. Your function should take two inputs:`n`

, the sample size (i.e., number of cases and controls), with a default value of 200; and`mu_drug`

, the mean for the exponential distribution that defines the drug tumor reduction measurements, with a default value of 2. Your function should return a list with two vectors called`no_drug`

and`drug`

. Each of these two vectors should have length`n`

, containing the percentage reduction in tumor size under the appropriate condition (not taking the drug or taking the drug). (Hint: This function should use`rexp`

appropriately. You can use code snippets from the slides to help you out. You’ll need to recall some properties of the Exponential distribution to make sense of the code snippets in the slides.)**1b.**We will now use`simulate_data()`

for different parameter settings to see if we are properly generating data. Run`simulate_data()`

without any arguments (hence, relying on the default values of`n`

and`mu_drug`

), and store the results in`data_1`

. Create a visualization of the distribution of the percent change in cancer size between the 2 groups (those given and those not given the drug).**(Hint: since ggplot needs to take in a data frame, try doing as.data.frame to the output of**Try to make a good visualization given that later we’ll be interesting in comparing the means of the change in size between groups. Additional for both the`simulate_data`

and then using a`tidyr`

function to convert the data frame into something useable for ggplot.)`data_1$no_drug`

and`data_1$drug`

compute the mean and standard deviation of both vectors. Now, run`simulate_data()`

again, and store its value in`data_2`

. Again, visualize the difference in both the`data_2$no_drug`

and`data_2$drug`

vectors, and the mean and standard deviation once again. We have effectively simulated two hypothetical datasets. Hence, the values in each of these 4 vectors should all be different.**1c.**In the next section to come, we will be generating many hypothetical datasets to see how many subjects we need to observe a difference between taking the drug and not taking the drug. To do this, we will write a function called`simulate_difference()`

, which takes in the same two arguments as`simulate_data()`

, which are`n`

and`mu_drug`

, both of which use the same default parameters as`simulate_data()`

. This function will generate a new dataset using`simulate_data()`

using the appropriate inputs. If the mean value of`drug`

is larger than the mean value of`no_drug`

by 100, return “Positive” (the string, to denote the drug is useful). If the mean value of`drug`

is less than the mean value of`no_drug`

by 100, return “Negative”. Otherwise, if neither of the above cases hold, return “Inconclusive”.

Run this function four times. The first two times, run this function twice with no arguments (hence, using the default parameters) to see that your function is returning different numbers. The third time, run your function with with `n = 10`

and `mu_drug = 0.5`

. Print out all 3 return values.

With your simulation set up, we can now investigate how the parameters of our simulation (namely, `n`

and `mu_drug`

) affect the outcomes. While the relationship between `n`

, `mu_drug`

and the outcome of `simulate_difference()`

are not too hard to mathematically derive in this particular lab, you can imagine much more complicated models where it’s easier to simulate the model instead of mathematically deriving the answer.

The next few questions will work with this hypothetical: suppose we work for a drug company that wants to put this new drug out on the market. In order to get FDA approval, your company must demonstrate that the drug is positive. Suppose this is defined as the patients who had the drug had **on average** a reduction in tumor size **at least 100 percent greater than** those who didn’t receive the drug, or in math:

\[ \overline{X}_{\mathrm{drug}} - \overline{X}_{\mathrm{no\,drug}} \geq 100. \]

This is what we coded in Question 1f above. Your drug company wants to spend as little money as possible. They want the smallest number `n`

such that, if they were to run a clinical trial with `n`

cases and `n`

controls, they would likely succeed in demonstrating that the effect size (as above) is at least 100. Of course, the result of a clinical trial is random; your drug company is willing to take “likely” to mean **successful with probability 0.95**, i.e., successful in 190 of 200 hypothetical clinical trials (though only 1 will be run in reality).

**2a.**Following the code sketch provided in the “Simulation” lecture (Slide 24), write a function called`rep_sim`

. This function takes in 4 arguments,`nreps`

(the number of repetitions, with default value of 200),`n`

and`mu_drug`

(the values needed for`simulate_difference()`

, with the same default values) and`seed`

(with default value`NULL`

). This function should run`simulate_difference()`

`nreps`

number of replications, and then return the number of times your drug was considered effective, i.e., the number of replications that the output of`simulate_difference()`

exceeds 100.

Demonstrate your function works by using it on `mu_drug = 1.5`

. (Note: While you could use a for-loop (shown in the slides) or one of the *apply functions, for this question, you can also use the `replicate()`

function. Be sure to check the documentation for `replicate()`

if you are unfamiliar with it. Essentially, `replicate()`

takes in two arguments, the number of replications you want to perform and the function you are replicating.)

**2b.**We will now investiagate the effect of`n`

(the number of cases or controls), where`mu_drug = 2`

. For each value of the input`n`

in between 5 and 200 (inclusive), run your function`rep_sim`

. You can do this using a for-loop or one of the *apply functions, and store the number of success in a vector. So to be clear, for each value of`n`

in between 5 and 200, you should have a corresponding number of positives. Plot the number of positives versus the sample size, and label the axes appropriately. Based on your simulation, what is the smallest number of`n`

for which the number of positives is 190 or more?

In these questions, we will be modifying the simulation setup and see how it changes the results we observe. Recall, in practice, we (the drug company) wants to collect as few subjects as needed to save money. We will investigate a “stopping rule”. That is, we include more subjects into our study slowly. If our stopping rule decides that we’ve included enough subjects that we can decide a “Positive” or “Negative”, we stop collecting subjects. If we are “Inconclusive”, we collect more subjects. And then we stop when we reach a maximum number of subjects. In these last few questions, we will not provide you with the step-by-step details of how to explicitly change the setup.

**3a.**Suppose we start with`n = 5`

subjects (as before, this means 5 cases and 5 controls). We compute the difference in means between using the drug and not using the drug (just like before). If this difference is equal to or larger than 100, we declare that the the drug is “Positive” and we stop collecting data. If this difference is equal to or less than -100, we declare the drug is “Negative”. Here is the change: If the result was not “Positive” or “Negative”, we collect 5 new cases and 5 new controls and try again. We keep incrementing by 5 new cases and 5 new controls until we have a total of`n`

cases and`n`

controls. If we*still*do not conclude “Positive” or “Negative”, then we declare “Inconclusive”.

Change the function `simulate_difference()`

to accommodate this new scheme, keeping the default values. Then, similar to Question 2a, run this simulation using `rep_sim()`

with `nreps = 200`

and `mu_drug = 1.5`

, and print out how many positives there were (this is possible since we’ve redefined `simulate_difference()`

). How does this number compare with the result in Question 2a? (Hint: Implementing this can be tricky. We suggest you to simulate the entire dataset with `n`

cases and `n`

controls first, using `simulate_data()`

. And then using a for-loop, sequentially see if the first 5 (or first 10, or first 15, etc.) cases and controls yield a “Positive” or “Negative”.)

**3b.**In these last two questions, we will allow`simulate_difference()`

to take an arbitrary stopping rule. These questions are very similar to Question 2d. We can design a function`stopping_rule_threshold()`

that takes as input two vectors as input,`drug_vector`

and`no_drug_vector`

and a numeric`threshold`

, and outputs either`TRUE`

or`FALSE`

. If the output is`TRUE`

, we stop collecting subjects. If`FALSE`

, we continue collecting subjects. For example, in Questions 2b to 2d, we have effectively been using a stopping rule like the following.

```
stopping_rule_threshold <- function(drug_vector, no_drug_vector,
threshold = 100){
mean(results$drug) - mean(results$no_drug) > threshold
}
```

You will design a different stopping rule. Write a function called `stopping_rule_ttest()`

that takes as input `drug_vector`

and `no_drug_vector`

(same as before) and a numeric `alpha`

with default value of `0.05`

. This function will use use a one-sided t-test to test if the p-value when testing the mean between `drug_vector`

and `no_drug_vector`

is less than `alpha`

. If it is, `stopping_rule_ttest()`

returns `TRUE`

. If not, `stopping_rule_ttest()`

returns `FALSE`

. (Hint: The function to perform a t-test is `t.test()`

. Be sure to set the `alternative`

argument appropriately, where our alternative is that the mean of `drug_vector`

is greater than that of `no_drug_vector`

. This function returns a list, in which the element called `p.value`

contains the p-value.)

**3c.**Lastly, change`simulate_difference()`

one last time. Now, it takes in as input`n`

,`mu_drug`

(same as before) as well as`stopping_rule`

, which will be an function that determines whether or not to collect more subjects. By default, set`stopping_rule = stopping_rule_ttest`

. You will need to make sure you understand how to pass a function into another function for this question. When you use`stopping_rule()`

and the result is`TRUE`

: your function should return “Positive” if the mean of the currently used cases is larger than the mean of the currently used controls. Otherwise, return “Negative”. However, if`stopping_rule()`

returns`FALSE`

, include more subjects in your study until you have`n`

cases and controls.

Similar to Question 2a, run this simulation using `rep_sim()`

with `nreps = 200`

and `mu_drug = 1.5`

, and print out how many positives there were. How does these numbers of positives compare with the result in Question 2a and 3a?

**3d:***Let’s pause for a moment.*The set-up of the output of our`simulate_data`

returning a list was very useful for the function`stopping_rule_ttest`

. At the same time, it probably was a little annoying to visualize the output using ggplot. Using the new definition of`simulate_data`

(`simulate_data_tidy`

) below (remove the`eval = F`

) write a function`simulate_difference`

as`simulate_difference_tidy`

that operates the exact same as the function defined in*3a*but uses`simulate_data_tidy`

(if possibly try using`slice`

after a`group_by`

). Which function was easier / cleaner to write? If you knew you were to be using the`stopping_rule_ttest`

would you use the`tidyverse`

version or the`list`

/ base`R`

version? What am I trying to teach you here (give me your best guess)?

```
simulate_data_tidy <- function(n = 200, mu_drug = 2){
output <- simulate_data(n, mu_drug)
tidy_output <- output %>% data.frame() %>%
rename(`no drug` = no_drug) %>%
gather(key = "class", value = "value") %>%
mutate(class = factor(class))
return(tidy_output)
}
```