Name:
Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Monday 10pm, next week.

Bradley-Terry Model

In spirit of the Women’s world cup (and just summer sports in general), let’s simulate some tournaments. One model of competition is the Bradley-Terry model. Each team is assigned an ability score (team \(i\) has score \(z_{i}\), team \(j\) has score \(z_{j}\), etc). Then for each game the probability that team \(i\) defeats team \(j\) is \(\frac{z_{i}}{z_{i} + z_{j}}\).

To test your function run it 10,000 times for several values of z1 and z2 and check that the proportion of times each team wins is roughly equal to the probability (as per the Law of Large Numbers).

#' Simulates a single-elimination tournament
#'
#' @param scores a vector of Bradley-Terry scores. Length must be a power of 
#' two.
#' @param names a vector of identifiers for teams. Length must be the same as 
#' scores
#'
#' @return The name of the victorious team.
bt_tournament <- function(scores, names = seq_along(scores)) { # look at this structure
  n_left <- length(scores)
  # Make sure it has length equal to a power of two
  power_left <- log(n_left)/log(2)
  if (as.integer(power_left) != power_left) {
    stop("scores need to be of length length 2^k, for some integer k")
  }
  if (length(scores) != length(names)) {
    stop("scores and names need to be vectors of the same length") 
  }

  while (n_left != 1) { # iterate until we only have one team remaining
    for (round_idx in seq_len(n_left / 2)) {
      winner <- bt_compete(scores[2 * round_idx - 1], scores[2 * round_idx]) # returns 1 or 2

      # Move winner to round_idx-th place for next round; we can reuse the
      # array since we don't access element round_idx until the next round.
      ## winner_index <- ??
      scores[round_idx] <- scores[winner_index]
      names[round_idx] <- names[winner_index] # keep names in sync with scores
    }
    n_left <- n_left / 2
  }

  ## return(names[??])
}

Now create 32 teams with scores initialized as Gamma(5, 10) random variables and then run 10000 tournaments (with the scores held constant). Report how many times the best team won. Plot the scatterplot of score vs the number of times a team wins the tournament.

The following function will provide the usual seeding for single-elimination tournament (the NCAA is a bit different).

tournament_order <- function(scores) {
  n <- length(scores)
  power <- log(n)/log(2)

  if (as.integer(power) != power) {
    stop("scores need to be of length length 2^k, for some integer k")
  }
  
  seeding = c(1, 2)
  while(length(seeding) != n) {
    new_seeding <- rep(NA, length(seeding) * 2)
    new_seeding[c(TRUE, FALSE)] <- seeding
    new_seeding[c(FALSE, TRUE)] <- length(new_seeding) + 1 - seeding
    seeding <- new_seeding
  }
  return(order(scores, decreasing = TRUE)[seeding])
}

We’ll then compare against a “worst” case seeding determined by the following function.

worst_order <- function(scores) {
  return(order(scores, decreasing = TRUE))
}

First, explain how these functions work and why they’re named as they are (aka document them with Roxygen). Next, using the same scores as before run 10,000 tournaments with these two seedings. Record the number of times the /best/ team wins under each seeding. Is there a difference between the two? Should we conclude anything about whether seeding matters?

AB Testing

The questions in this section build off of Lab 3.2. We will continue our investigation of an important statistical topic called the “Stopping rules”. These questions will have you write and use functions to get a grasp of what this topic is about, and students who are curious for more details can read about it by Googling “stopping rules clinical trials”.

To reiterate, in this field, the idea is: it’s sometimes expensive to collect more data, so if you have determined that the effect of a drug is already substantial, you want to stop the clinical trial early (and save some money!). However, the exact procedure to determine when to stop collecting data can dramatically impact the conclusions you find. Luckily, it’s possible to study the impact of different stopping rules at the safety of a simulation, which is what we will do.

Recall what we ended with in Lab 3.2. We have quite a few functions: simulate_data() which generates the data, stopping_rule_ttest() which is a stopping rule using a t-test, simulate_difference() which generates data and computes the difference according to a stopping rule, and rep_sim() which replicates simulate_difference() over many replications. We will make our setup much more general (by changing the inputs) but this generality will make our functions increasingly more complicated.

\[ \frac{\exp(m/25)}{60000}, \]

and return FALSE otherwise. After writing stopping_rule_ttest() and stopping_rule_adaptive(), use both stopping rules on data_1 (where drug_vector is data_1$drug, and no_drug_vector is data_1$no_drug) to demonstrate your function works.

Specifically, simulate_difference() should take in 4 arguments, n as the number of subjects (same as before), drug_distr() and no_drug_distr() (same as in Question 3a) and stopping_rule() (same as in Lab 4.2 Question 3c). This function should generate data using simulate_data() according to drug_distr() and no_drug_distr(), and then (just like in Lab 4.1 Question 3c), apply stopping_rule() starting from 5 cases and controls. If stopping_rule() returns TRUE, stop considering more subjects. Otherwise, apply stopping_rule() to 5 additional cases and controls, until all n cases and controls have been considered. Use the same default values for each of the 4 arguments as before, where the default stopping rule is stopping_rule_ttest(). simulate_difference() returns a list (Note, the output has changed from the Lab) with two elements: result which is a string that says either “Positive”, “Negative” or “Inconclusive”, and samples which is a numeric that says how many cases were effectively considered (i.e., a number between 5 and n). Print out the result of simulate_difference() using the default values to demonstrate your function works.

Reading in, exploring wage data

wage_df <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/wage.csv", 
                   skip = 16, stringsAsFactors = T)
dim(wage_df)
## [1] 3000   11
head(wage_df, 3)
##        year age     sex           maritl     race       education
## 231655 2006  18 1. Male 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003  45 1. Male       2. Married 1. White 3. Some College
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
##             wage
## 231655  75.04315
## 86582   70.47602
## 161300 130.98218

Linear regression modeling

Logistic regression modeling

Object Oriented Programming

In this problem you will be extending the class we saw in lecture on Friday. To get access to the code we saw in class run the following line of code once in your terminal:

devtools::install_github("benjaminleroy/coin")

In this homework problem you will be completing the implimentation of coin’s functionality by creating a new object that could help understand basic coin flipping. Specifically, in class we talked about how we’d be interested in knowing information about the number of flips.

-6b. (5 points) Create a new object toss (using a function) that takes in a coin object, and a times parameter (default is 1). Use the flip function from the coin package to generate a sequence of flips (let’s say we called the vector of these flips: flips). And then define a class object that looks like the following

list(coin = coin, # the coin instance not the function / object class coin
     tosses = flips,
     total = length(flips),
     heads = sum(flips == coin[1])
)

Document your class function. Then create a bias coin (prob: .25, .75) named my_coin and create a toss object (my_toss) associated with it and flip the coin 10 times (show what my_toss looks like).