Fitting Models to Data

Statistical Computing, 36-350

Wednesday July 24, 2019

Logistics

Last week: Simulation

Why fit statistical (regression) models?

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

Part I

Exploratory data analysis

Crime by State Data

For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).

It has 51 observations (all states + DC).

library(tidyverse)
crime <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/crime.csv") %>% select(-X)
dim(crime)
## [1] 51 10
head(crime)
##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 2    al   780   11.6     67.4     73.5  66.9    17.4   11.5      2   3624
## 3    ar   593   10.2     44.7     82.9  66.3    20.0   10.7      2   3378
## 4    az   715    8.6     84.7     88.6  78.7    15.4   12.1      4   4530
## 5    ca  1078   13.1     96.7     79.3  76.2    18.2   12.5      4   5114
## 6    co   567    5.8     81.8     92.5  84.4     9.9   12.1      4   4884

Some example questions we might be interested in:

Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

Distributions of state crime variables

library(gridExtra)
colnames(crime) # These are the variables
##  [1] "state"    "crime"    "murder"   "pctmetro" "pctwhite" "pcths"   
##  [7] "poverty"  "single"   "region"   "income"
gghist <- list()
for (col_var in colnames(crime)[-1]) { # excluding state abb
  gghist[[col_var]] <- ggplot(crime) +
    geom_histogram(aes_string(x = col_var),
       fill = "lightblue", color= "black", bins = 20) +
    labs(x = paste("Histogram of",col_var))
}
grid.arrange(grobs = gghist)

What did we learn? A bunch of things! E.g.,

After asking our resident demographer some questions, we learn:

crime %>% arrange(desc(single)) %>% head
##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1      2   5299
## 2    la  1062   20.3     75.0     66.7  68.3    26.4   14.9      2   3545
## 3    ms   434   13.5     30.7     63.3  64.3    24.7   14.7      2   3098
## 4    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 5    nm   930    8.0     56.0     87.1  75.1    17.4   13.8      4   3601
## 6    ga   723   11.4     67.7     70.8  70.9    13.5   13.0      2   4091

Let’s find the second biggest correlation (in absolute value):

state_cor_sorted[2]
## [1] 0.8589106
vars_big_cor <- arrayInd(which(abs(state_cor) == state_cor_sorted[2]), 
                        dim(state_cor))
colnames(state_cor)[vars_big_cor] 
## [1] "murder" "single"

This is more interesting! If we wanted to predict murder from the other variables, then it seems like we should at least include single as a predictor

Visualizing relationships among variables, with ggpairs()

Can easily look at multiple scatter plots at once, using the ggpairs() function.

library(GGally)
crime %>% select(crime, single, murder, pcths, pctmetro) %>%
  ggpairs()

Inspecting relationships over a subset of the observations

DC seems to really be messing up our visuals. Let’s get rid of it and see what the relationships look like

crime_subset <- crime %>% filter(state != "dc", pctmetro > 60) 
nrow(crime_subset) # only the high metropolitan cities
## [1] 33
crime_subset %>% select(crime, single, murder, pcths) %>%
  ggpairs()

Testing means between two different groups

Recall that region contains which region an state comes from

crime %>% pull(region) %>% table
## .
##  1  2  3  4 
##  9 17 12 13
# table(crime$region)

Does the region that state is in relate to rates of single individual? rates of metropoltian area?

Let’s do some plotting first:

crime$region_fac <- factor(crime$region,
                           levels = 1:4,
                           labels = c("Northeast",
                                      "South",
                                      "North Central",
                                      "West"))

ggvis_compare <- list()
for (col_var in c("single", "pctmetro")) {
  ggvis_compare[[col_var]] <- ggplot(crime) +
    geom_boxplot(aes_string(y = col_var,
                            x = "region_fac")) +
      # notes quotes for x making as well
    labs(x = "Region",
         y = c("single" = "Proportion of Single Individuals",
               "pctmetro" = "Proportion of Population Living in Metro Area")[col_var],
         title = paste(col_var, "verse region"))
}

grid.arrange(grobs = ggvis_compare, ncol = 2)