# Fitting Models to Data

Wednesday July 24, 2019

# Logistics

• Additional Office Hour: Friday 4:30 - 5:30 pm (Wean 3715)
• Extra Lecture, Thursday (1:45 - 2:30): dplyr and ggplot (will be mostly focused on review and building up better intuition - definitely not required / will not you impact on homework if you don’t go)
• Reminders:
• Review notes from Tudor on Assignments
• Look at solutions for examples of what I expect the solution to look like
• Next week: Tudor will be covering Monday and Wednesday Lectures

# Last week: Simulation

• Running simulations is an integral part of being a statistician in the 21st century
• R provides us with a utility functions for simulations from a wide variety of distributions
• To make your simulation results reproducible, you must set the seed, using set.seed()
• There is a natural connection between iteration, functions, and simulations
• Saving and loading results can be done in two formats: rds and rdata formats
• Read in data from a previous R session with readRDS(), load()
• Read in data from the outside with read.table(), read.csv()
• Can sometimes be tricky to get arguments right in read.table(), read.csv(); helps sometimes take a look at the original data files to see their structure

# Why fit statistical (regression) models?

You have some data $$X_1,\ldots,X_p,Y$$: the variables $$X_1,\ldots,X_p$$ are called predictors, and $$Y$$ is called a response. You’re interested in the relationship that governs them

So you posit that $$Y|X_1,\ldots,X_p \sim P_\theta$$, where $$\theta$$ represents some unknown parameters. This is called regression model for $$Y$$ given $$X_1,\ldots,X_p$$. Goal is to estimate parameters. Why?

• To assess model validity, predictor importance (inference)
• To predict future $$Y$$’s from future $$X_1,\ldots,X_p$$’s (prediction)

# Part I

Exploratory data analysis

# Crime by State Data

For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).

• sid: state id
• state: state abbreviation
• crime: violent crimes per 100,000 people
• murder: murders per 1,000,000
• pctmetro: the percent of the population living in metropolitan areas
• pctwhite: the percent of the population that is white
• pcths: percent of population with a high school education or above
• poverty: percent of population living under poverty line
• single: and percent of population that are single parents
• region: region the state is in
• income: median income of the state

It has 51 observations (all states + DC).

library(tidyverse)
dim(crime)
## [1] 51 10
head(crime)
##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 2    al   780   11.6     67.4     73.5  66.9    17.4   11.5      2   3624
## 3    ar   593   10.2     44.7     82.9  66.3    20.0   10.7      2   3378
## 4    az   715    8.6     84.7     88.6  78.7    15.4   12.1      4   4530
## 5    ca  1078   13.1     96.7     79.3  76.2    18.2   12.5      4   5114
## 6    co   567    5.8     81.8     92.5  84.4     9.9   12.1      4   4884

Some example questions we might be interested in:

• What is the relationship between single and crime?
• What is the relationship between region and murder, poverty?
• Can we predict crime from the other variables?
• Can we predict whether crime is high or low, from other variables?

# Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

• What are the distributions of the variables?
• Are there distinct subgroups of samples?
• Are there any noticeable outliers?
• Are there interesting relationship/trends to model?

# Distributions of state crime variables

library(gridExtra)
colnames(crime) # These are the variables
##  [1] "state"    "crime"    "murder"   "pctmetro" "pctwhite" "pcths"
##  [7] "poverty"  "single"   "region"   "income"
gghist <- list()
for (col_var in colnames(crime)[-1]) { # excluding state abb
gghist[[col_var]] <- ggplot(crime) +
geom_histogram(aes_string(x = col_var),
fill = "lightblue", color= "black", bins = 20) +
labs(x = paste("Histogram of",col_var))
}
grid.arrange(grobs = gghist)

What did we learn? A bunch of things! E.g.,

• region, is a variable with 4 levels
• We see a single outlier in rates of crime, murder and proportion single
• We see two outliers in pctwhite
• We see that the distributution of poverty is skewed left
• and that the pctmetro is pretty uniformly distributed.

After asking our resident demographer some questions, we learn:

• region: has levels 1: Northeast, 2: South, 3: North Central 4: West
• DC is the extreme outlier in most of our graphics (which makes sense)
crime %>% arrange(desc(single)) %>% head
##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1      2   5299
## 2    la  1062   20.3     75.0     66.7  68.3    26.4   14.9      2   3545
## 3    ms   434   13.5     30.7     63.3  64.3    24.7   14.7      2   3098
## 4    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 5    nm   930    8.0     56.0     87.1  75.1    17.4   13.8      4   3601
## 6    ga   723   11.4     67.7     70.8  70.9    13.5   13.0      2   4091

Let’s find the second biggest correlation (in absolute value):

state_cor_sorted[2]
## [1] 0.8589106
vars_big_cor <- arrayInd(which(abs(state_cor) == state_cor_sorted[2]),
dim(state_cor))
colnames(state_cor)[vars_big_cor] 
## [1] "murder" "single"

This is more interesting! If we wanted to predict murder from the other variables, then it seems like we should at least include single as a predictor

# Visualizing relationships among variables, with ggpairs()

Can easily look at multiple scatter plots at once, using the ggpairs() function.

library(GGally)
crime %>% select(crime, single, murder, pcths, pctmetro) %>%
ggpairs()

# Inspecting relationships over a subset of the observations

DC seems to really be messing up our visuals. Let’s get rid of it and see what the relationships look like

crime_subset <- crime %>% filter(state != "dc", pctmetro > 60)
nrow(crime_subset) # only the high metropolitan cities
## [1] 33
crime_subset %>% select(crime, single, murder, pcths) %>%
ggpairs()

# Testing means between two different groups

Recall that region contains which region an state comes from

crime %>% pull(region) %>% table
## .
##  1  2  3  4
##  9 17 12 13
# table(crime$region) Does the region that state is in relate to rates of single individual? rates of metropoltian area? Let’s do some plotting first: crime$region_fac <- factor(crime\$region,
levels = 1:4,
labels = c("Northeast",
"South",
"North Central",
"West"))

ggvis_compare <- list()
for (col_var in c("single", "pctmetro")) {
ggvis_compare[[col_var]] <- ggplot(crime) +
geom_boxplot(aes_string(y = col_var,
x = "region_fac")) +
# notes quotes for x making as well
labs(x = "Region",
y = c("single" = "Proportion of Single Individuals",
"pctmetro" = "Proportion of Population Living in Metro Area")[col_var],
title = paste(col_var, "verse region"))
}

grid.arrange(grobs = ggvis_compare, ncol = 2)