Name:
Andrew ID:
Collaborated with:
On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Monday 10pm, next week.
In spirit of the Women’s world cup (and just summer sports in general), let’s simulate some tournaments. One model of competition is the Bradley-Terry model. Each team is assigned an ability score (team \(i\) has score \(z_{i}\), team \(j\) has score \(z_{j}\), etc). Then for each game the probability that team \(i\) defeats team \(j\) is \(\frac{z_{i}}{z_{i} + z_{j}}\).
bt_compete(z1, z2)
which takes as its arguments the scores z1
and z2
of two teams. Your function then simulates a game between the two teams (with probabilities of victory as described above) and returns 1 if the first team won and 2 if the second team won. Hint: look at the documentation for the probs
argument for sample()
To test your function run it 10,000 times for several values of z1
and z2
and check that the proportion of times each team wins is roughly equal to the probability (as per the Law of Large Numbers).
bt_tournament(scores, names)
where scores
is a vector of 2^m
scores and names
is the team names. We start by simulating a round of games. Team 2k-1
will play team 2k
with the loser being eliminated. Thus if we started with 2n
teams after a single round we have n
teams. We then take all of the winners (in the same order as they started) and continue this process until there’s only a single team left. Our function returns the name of the winning team. Fill in the following function skeleton to finish this function#' Simulates a single-elimination tournament
#'
#' @param scores a vector of Bradley-Terry scores. Length must be a power of
#' two.
#' @param names a vector of identifiers for teams. Length must be the same as
#' scores
#'
#' @return The name of the victorious team.
bt_tournament <- function(scores, names = seq_along(scores)) { # look at this structure
n_left <- length(scores)
# Make sure it has length equal to a power of two
power_left <- log(n_left)/log(2)
if (as.integer(power_left) != power_left) {
stop("scores need to be of length length 2^k, for some integer k")
}
if (length(scores) != length(names)) {
stop("scores and names need to be vectors of the same length")
}
while (n_left != 1) { # iterate until we only have one team remaining
for (round_idx in seq_len(n_left / 2)) {
winner <- bt_compete(scores[2 * round_idx - 1], scores[2 * round_idx]) # returns 1 or 2
# Move winner to round_idx-th place for next round; we can reuse the
# array since we don't access element round_idx until the next round.
## winner_index <- ??
scores[round_idx] <- scores[winner_index]
names[round_idx] <- names[winner_index] # keep names in sync with scores
}
n_left <- n_left / 2
}
## return(names[??])
}
Now create 32 teams with scores initialized as Gamma(5, 10)
random variables and then run 10000 tournaments (with the scores held constant). Report how many times the best team won. Plot the scatterplot of score vs the number of times a team wins the tournament.
The following function will provide the usual seeding for single-elimination tournament (the NCAA is a bit different).
tournament_order <- function(scores) {
n <- length(scores)
power <- log(n)/log(2)
if (as.integer(power) != power) {
stop("scores need to be of length length 2^k, for some integer k")
}
seeding = c(1, 2)
while(length(seeding) != n) {
new_seeding <- rep(NA, length(seeding) * 2)
new_seeding[c(TRUE, FALSE)] <- seeding
new_seeding[c(FALSE, TRUE)] <- length(new_seeding) + 1 - seeding
seeding <- new_seeding
}
return(order(scores, decreasing = TRUE)[seeding])
}
We’ll then compare against a “worst” case seeding determined by the following function.
worst_order <- function(scores) {
return(order(scores, decreasing = TRUE))
}
First, explain how these functions work and why they’re named as they are (aka document them with Roxygen). Next, using the same scores as before run 10,000 tournaments with these two seedings. Record the number of times the /best/ team wins under each seeding. Is there a difference between the two? Should we conclude anything about whether seeding matters?
The questions in this section build off of Lab 3.2. We will continue our investigation of an important statistical topic called the “Stopping rules”. These questions will have you write and use functions to get a grasp of what this topic is about, and students who are curious for more details can read about it by Googling “stopping rules clinical trials”.
To reiterate, in this field, the idea is: it’s sometimes expensive to collect more data, so if you have determined that the effect of a drug is already substantial, you want to stop the clinical trial early (and save some money!). However, the exact procedure to determine when to stop collecting data can dramatically impact the conclusions you find. Luckily, it’s possible to study the impact of different stopping rules at the safety of a simulation, which is what we will do.
Recall what we ended with in Lab 3.2. We have quite a few functions: simulate_data()
which generates the data, stopping_rule_ttest()
which is a stopping rule using a t-test, simulate_difference()
which generates data and computes the difference according to a stopping rule, and rep_sim()
which replicates simulate_difference()
over many replications. We will make our setup much more general (by changing the inputs) but this generality will make our functions increasingly more complicated.
2a. (2 points) We will be modifying simulate_data()
so we can pass a function as an input to change how the data is simulated. Redefine simulate_difference()
to take as input n
as the number of subjects to take drug or not take drug with default value of 200, drug_distr()
as the function to generate random numbers for subjects who take the drug, and no_drug_distr()
as the function to generate random numbers for subjects who do not take the drug. Set the default function of drug_dist()
to be be a function that takes n
as input and simulate data from the following distribution, \[
X_{\mathrm{drug}} \sim \mathrm{Normal}(\mathrm{mean}=150, \mathrm{sd} = 10).
\] Similarly, set the default function of no_drug_distr()
to be a function that takes n
as input and simulate from the following distribution, \[
X_{\mathrm{no\,drug}} \sim \mathrm{Normal}(\mathrm{mean}=100, \mathrm{sd} = 10).
\] Similar to Lab 3.2, simulate_data()
will return a list should return a list with two vectors called no_drug
and drug
. Each of these two vectors should have length n
. Using these default values, store the result of simulate_data()
in a variable calld data_1
. Compute the mean and standard deviation of data_1$drug
and data_1$no_drug
to ensure you properly simulated data.
2b. (2 points) Now we want to design our stopping rules. First, copy-and-paste the stopping rule, stopping_rule_ttest()
, from the Lab 4.1 (once again, it takes in two vectors drug_vector
and no_drug_vector
and a numeric alpha
with default value of 0.05 as input, and outputs TRUE
or FALSE
). Then, write a new stopping rule called stopping_rule_adaptive()
. Like before, it uses t.test()
, but here’s the difference: the threshold of the p-value will vary with the length of drug_vector
and no_drug_vector
. Let m
denote the length of drug_vector
(assumed to be the same as the length of no_drug_vector
). stopping_rule_adaptive()
should output TRUE
if the p-value is less than
\[ \frac{\exp(m/25)}{60000}, \]
and return FALSE
otherwise. After writing stopping_rule_ttest()
and stopping_rule_adaptive()
, use both stopping rules on data_1
(where drug_vector
is data_1$drug
, and no_drug_vector
is data_1$no_drug
) to demonstrate your function works.
simulate_data()
to allow for any distributions for drug_distr()
and no_drug_distr()
, we have not changed simulate_difference()
or rep_sim()
yet to accommodate for this. In this question, you’ll redefine simulate_difference()
to accommodate for this. We also want to return the number of subjects we effectively used in our simulation.Specifically, simulate_difference()
should take in 4 arguments, n
as the number of subjects (same as before), drug_distr()
and no_drug_distr()
(same as in Question 3a) and stopping_rule()
(same as in Lab 4.2 Question 3c). This function should generate data using simulate_data()
according to drug_distr()
and no_drug_distr()
, and then (just like in Lab 4.1 Question 3c), apply stopping_rule()
starting from 5 cases and controls. If stopping_rule()
returns TRUE
, stop considering more subjects. Otherwise, apply stopping_rule()
to 5 additional cases and controls, until all n
cases and controls have been considered. Use the same default values for each of the 4 arguments as before, where the default stopping rule is stopping_rule_ttest()
. simulate_difference()
returns a list (Note, the output has changed from the Lab) with two elements: result
which is a string that says either “Positive”, “Negative” or “Inconclusive”, and samples
which is a numeric that says how many cases were effectively considered (i.e., a number between 5
and n
). Print out the result of simulate_difference()
using the default values to demonstrate your function works.
2d. (3 points) We need to modify rep_sim()
as well before we can run our experiments. Modify rep_sim()
(originally defined in Lab 4.1) to accommodate different stopping rules and distributions. Your new rep_sim()
function will take in the following 6 inputs: nreps
, n
, drug_distr()
, no_drug_distr()
, stopping_rule()
and seed
. Use the default values that we’ve been using thus-far. This function outputs a vector of length n
containing Positive
and Negative
strings. rep_sim()
should appropriately use all 6 inputs (which you will have to use your judgement to assess how to do this. This is not intended to be hard, but it might take some thinking to understand how the 6 inputs are used.) rep_sim()
returns a list (Note, the output has changed from the Lab) with two elements: table
(which is a vector that counts how many instances of “Positive”, “Negative” and “Inconclusive” results there were among the nreps
repetitions), and mean_samples
(which is the mean number of cases considered when the result was NOT “Inconclusive”). Print out the result of rep_sim()
using the default values to demonstrate your function works.
2e. (2 points) Now we are finally ready to use our generalized simulation step. We will be running rep_sim()
four times, each time with slightly different arguments. In the first run, use all default arguments for rep_sim()
. In the second run, set stopping_rule()
to be stopping_rule_adaptive()
. In the third run, set drug_distr()
to be exactly the same as no_drug_distr()
(i.e, both cases and controls are drawing from the same distribution, meaning there is no difference). In the last run, set stopping_rule()
to be stopping_rule_adaptive()
and drug_distr()
to be exactly the same as no_drug_distr()
. Each of these four runs should require exactly one line of code, where you are changing only the inputs to rep_sim()
. Print all four results.
Challenge This question is more about statistics than coding. Provide a short (at most 5 sentences) explanation to interpret your results in Question 2e. In particular, think about the following things: 1 ) Our goal in this entire section was to design stopping rules so instead of collecting all n = 200
cases and controls, we only needed to collect a smaller number of subjects. Was this achieved in simulation? 2) In some of the simulations in Question 2e, drug_distr
and no_drug_distr
were exactly the same. What do you hope would’ve happened simulate.difference()
would have returned in this case, and what actually happened based on your simulation results? 3) How exactly are stopping_rule_ttest()
and stopping_rule_adaptive()
different? What does this difference mean statistically?
read.csv()
and save the resulting data frame as wage_df
. Check that wage_df
has the right dimensions, and display its first 3 rows. Hint: the first several lines of the linked file just explain the nature of the data; open up the file (either directly in your web browser or after you download it to your computer), and count how many lines must be skipped before getting to the data; then use an appropriate setting for the skip
argument to read.csv()
.wage_df <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/wage.csv",
skip = 16, stringsAsFactors = T)
dim(wage_df)
## [1] 3000 11
head(wage_df, 3)
## year age sex maritl race education
## 231655 2006 18 1. Male 1. Never Married 1. White 1. < HS Grad
## 86582 2004 24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003 45 1. Male 2. Married 1. White 3. Some College
## region jobclass health health_ins
## 231655 2. Middle Atlantic 1. Industrial 1. <=Good 2. No
## 86582 2. Middle Atlantic 2. Information 2. >=Very Good 2. No
## 161300 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes
## wage
## 231655 75.04315
## 86582 70.47602
## 161300 130.98218
wage_df
, set up a plotting grid of appropriate dimensions, and then plot each of these factor variables, with appropriate titles (don’t use geom_histogram()
- these are a discrete variables). What do you notice about the distributions? (You may want to add + theme(axis.text.x = element_text(angle = 15, size = 5))
)4a. (2 points) Fit a linear regression model, using lm()
, with response variable wage
and predictor variables year
and age
, using the wage_df
data frame. Call the result wage_lm
. Display the coefficient estimates, using coef()
, for year
and age
. Do they have the signs you would expect, i.e., can you explain their signs? Display a summary, using summary()
, of this linear model. Report the standard errors and p-values associated with the coefficient estimates for year
and age
. Do both of these predictors appear to be significant, based on their p-values?
4b. (2 point) Save the standard errors of year
and age
into a vector called wage_se
, and print it out to the console. Don’t just type the values in you see from summary()
; you need to determine these values programmatically. Hint: define wage_sum
to be the result of calling summary()
on wage_lm
; then figure out what kind of R object wage_sum
is, and how you can extract the standard errors.
4c. (3 points) Plot diagnostics of the linear model fit in the previous question, using autoplot()
on wage_lm
(library(ggfortify)
). Look at the “Residuals vs Fitted”, “Scale-Location”, and “Residuals vs Leverage” plots—are there any groups of points away from the main bulk of points along the x-axis? Look at the “Normal Q-Q” plot—do the standardized residuals lie along the line \(y = x\)? Note: don’t worry too if you’re generally unsure how to interpret these diagnostic plots; you’ll learn a lot more in your Modern Regression 36-401 course; for now, you can just answer the questions we asked. Challenge: what is causing the discrepancies you are (should be) seeing in these plots? Hint: look back at the histogram of the wage
column you plotted above.
4d. (3 points) Use your fitted linear model wage_lm
to predict: (a) what a 30 year old person should be making this year; (b) what Bill Gates be making this year; (c) what you should be making 5 years from now. Comment on the results—which do you think is the most accurate prediction?
5a. (3 points) Fit a logistic regression model, using glm()
with family = "binomial"
, with the response variable being the indicator that wage
is larger than 250, and the predictor variables being year
and age
. Call the result wage_glm
. Note: you can set this up in two different ways: (i) you can manually define a new column (say) wage_high
in the wage_df
data frame to be the indicator that the wage
column is larger than 250; or (ii) you can define an indicator variable “on-the-fly” in the call to glm()
with an appropriate usage of I()
. Display a summary, reporting the coefficient estimates for year
and age
, their standard errors, and associated p-values. Are the predictors year
and age
both significant?
5b. (2 points) Refit a logistic regression model with the same response variable as in the last question, but now with predictors year
, age
, and education
. Note that the third predictor is stored as a factor variable, which we call a categorical variable (rather than a continuous variable, like the first two predictors) in the context of regression modeling. Display a summary. What do you notice about the predictor education
: how many coefficients are associated with it in the end? Challenge: can you explain why the number of coefficients associated with education
makes sense?
5c. (3 points) In general, one must be careful fitting a logistic regression model on categorial predictors. In order for logistic regression to make sense, for each level of the categorical predictor, we should have observations at this level for which the response is 0, and observations at this level for which the response is 1. In the context of our problem, this means that for each level of the education
variable, we should have people at this education level that have a wage less than or equal to 250, and also people at this education level that have a wage above 250. Which levels of education
fail to meet this criterion? Let’s call these levels “incomplete”, and the other levels “complete”. (Try using tidyverse
and make the data.frame the most readable.)
5d. (3 points) Refit the logistic regression model as in Q4b, with the same response and predictors, but now throwing out all data in wage_df
that corresponds to the incomplete education levels (equivalently, using only the data from the complete education levels). Display a summary, and comment on the differences seen to the summary for the logistic regression model fitted in Q4b. Did any predictors become more significant, according to their p-values?
In this problem you will be extending the class we saw in lecture on Friday. To get access to the code we saw in class run the following line of code once in your terminal:
devtools::install_github("benjaminleroy/coin")
github
), load the library using library(coin)
, examine the documentation of coin
and create a single my_coin <- coin()
object. What does it print out? Do unclass(my_coin)
what do you see? Run flip(my_coin,times = 5)
and then try flip(c(0,1), times = 5)
(make sure to put in the `{r}
“error = TRUE”). Type ?flip.coin
in your console, does it say in the “usage” area (aka is it a regular function)?In this homework problem you will be completing the implimentation of coin
’s functionality by creating a new object that could help understand basic coin flipping. Specifically, in class we talked about how we’d be interested in knowing information about the number of flips.
-6b. (5 points) Create a new object toss
(using a function) that takes in a coin
object, and a times
parameter (default is 1). Use the flip function from the coin
package to generate a sequence of flips (let’s say we called the vector of these flips: flips
). And then define a class object that looks like the following
list(coin = coin, # the coin instance not the function / object class coin
tosses = flips,
total = length(flips),
heads = sum(flips == coin[1])
)
Document your class function. Then create a bias coin (prob: .25, .75) named my_coin
and create a toss object (my_toss
) associated with it and flip the coin 10 times (show what my_toss
looks like).
6c. (2 points) Write up a helper function to validate the input parameters of the toss
function. Specifically, the times
parameter (call it valid_times()
). Make sure it errors (use some if
statements with stop()
s) if times
is invalid and return TRUE
otherwise. Specifically make sure it correct sees that times = 5.5
, times = "abc"
, and times = -5
(run it with these parameters - {r error = TRUE}
). Update the class function to include this validation.
6d. (3 points) In the lecture on OOP (slide 14) - only one with a figure. We see a figure that visualizes the cumulative number of heads from a simulation. Create a method for the toss
object to create such a figure. Note: you do not need to initialize the plot
generic, just define plot.toss
(please create the plot in ggplot
). Apply this function to my_toss2
(same as my_toss
but with 1000 tosses) using plot.toss
’s generic name.
Challenge: Beyond the standard generics we’ve already see plot
, summary
, predict
etc, we can also define functions like indexing and adding. Define the addition function for a toss object as "+.toss" <- function(toss, times)
such that we toss the coin and additional amount of times
(update all attributes that should be updated and make sure the function returns this new object). Use the valid_times
function you wrote above to check make sure times parameter meets expectations. Try defining a new object my_toss3 <- my_toss2 + 1000
and plot this object.