Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Friday 10pm, this week.

This week’s agenda: practice writing functions, creating simulations, using ``replicate()’’

Setting up your simulation

We are going to continue the drug effect model that was discussed in the “Simulation” lecture. That is, we will simulate the effects of using a drug and not using a drug to see hypothetically. This will allow us to investigate how different parameters of our model affect the number of subjects needed to observe a significant difference without calculating complicated math.

Suppose that there is a new drug that can be optionally given before chemotherapy. We follow the setup given in the “Simulation” lecture. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1), \] whereas those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2). \] Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution. Now consider the following scenario. In the following questions, we will set up a way to simulate this model. Throughout this question, we call all subjects who are given the drug experience as “cases” and those who are not given the drug experience as “controls”.

Run this function four times. The first two times, run this function twice with no arguments (hence, using the default parameters) to see that your function is returning different numbers. The third time, run your function with with n = 10 and mu_drug = 0.5. Print out all 3 return values.

Investigating your simulation

With your simulation set up, we can now investigate how the parameters of our simulation (namely, n and mu_drug) affect the outcomes. While the relationship between n, mu_drug and the outcome of simulate_difference() are not too hard to mathematically derive in this particular lab, you can imagine much more complicated models where it’s easier to simulate the model instead of mathematically deriving the answer.

The next few questions will work with this hypothetical: suppose we work for a drug company that wants to put this new drug out on the market. In order to get FDA approval, your company must demonstrate that the drug is positive. Suppose this is defined as the patients who had the drug had on average a reduction in tumor size at least 100 percent greater than those who didn’t receive the drug, or in math:

\[ \overline{X}_{\mathrm{drug}} - \overline{X}_{\mathrm{no\,drug}} \geq 100. \]

This is what we coded in Question 1f above. Your drug company wants to spend as little money as possible. They want the smallest number n such that, if they were to run a clinical trial with n cases and n controls, they would likely succeed in demonstrating that the effect size (as above) is at least 100. Of course, the result of a clinical trial is random; your drug company is willing to take “likely” to mean successful with probability 0.95, i.e., successful in 190 of 200 hypothetical clinical trials (though only 1 will be run in reality).

Demonstrate your function works by using it on mu_drug = 1.5. (Note: While you could use a for-loop (shown in the slides) or one of the *apply functions, for this question, you can also use the replicate() function. Be sure to check the documentation for replicate() if you are unfamiliar with it. Essentially, replicate() takes in two arguments, the number of replications you want to perform and the function you are replicating.)

Changing your simulation setup

In these questions, we will be modifying the simulation setup and see how it changes the results we observe. Recall, in practice, we (the drug company) wants to collect as few subjects as needed to save money. We will investigate a “stopping rule”. That is, we include more subjects into our study slowly. If our stopping rule decides that we’ve included enough subjects that we can decide a “Positive” or “Negative”, we stop collecting subjects. If we are “Inconclusive”, we collect more subjects. And then we stop when we reach a maximum number of subjects. In these last few questions, we will not provide you with the step-by-step details of how to explicitly change the setup.

Change the function simulate_difference() to accommodate this new scheme, keeping the default values. Then, similar to Question 2a, run this simulation using rep_sim() with nreps = 200 and mu_drug = 1.5, and print out how many positives there were (this is possible since we’ve redefined simulate_difference()). How does this number compare with the result in Question 2a? (Hint: Implementing this can be tricky. We suggest you to simulate the entire dataset with n cases and n controls first, using simulate_data(). And then using a for-loop, sequentially see if the first 5 (or first 10, or first 15, etc.) cases and controls yield a “Positive” or “Negative”.)

stopping_rule_threshold <- function(drug_vector, no_drug_vector, 
                                    threshold = 100){
  mean(results$drug) - mean(results$no_drug) > threshold
}

You will design a different stopping rule. Write a function called stopping_rule_ttest() that takes as input drug_vector and no_drug_vector (same as before) and a numeric alpha with default value of 0.05. This function will use use a one-sided t-test to test if the p-value when testing the mean between drug_vector and no_drug_vector is less than alpha. If it is, stopping_rule_ttest() returns TRUE. If not, stopping_rule_ttest() returns FALSE. (Hint: The function to perform a t-test is t.test(). Be sure to set the alternative argument appropriately, where our alternative is that the mean of drug_vector is greater than that of no_drug_vector. This function returns a list, in which the element called p.value contains the p-value.)

Similar to Question 2a, run this simulation using rep_sim() with nreps = 200 and mu_drug = 1.5, and print out how many positives there were. How does these numbers of positives compare with the result in Question 2a and 3a?

simulate_data_tidy <- function(n = 200, mu_drug = 2){
  output <- simulate_data(n, mu_drug)
  tidy_output <- output %>% data.frame() %>%
    rename(`no drug` = no_drug) %>%
    gather(key = "class", value = "value") %>%
    mutate(class = factor(class))
  return(tidy_output)
}