Last week: Simulation

Running simulations is an integral part of being a statistician in the 21st century
R provides us with a utility functions for simulations from a wide variety of distributions
To make your simulation results reproducible, you must set the seed, using set.seed()
There is a natural connection between iteration, functions, and simulations
Saving and loading results can be done in two formats: rds and rdata formats
Read in data from a previous R session with readRDS(), load()
Read in data from the outside with read.table(), read.csv()
Can sometimes be tricky to get arguments right in read.table(), read.csv(); helps sometimes take a look at the original data files to see their structure

Why fit statistical (regression) models?

You have some data $X_1,\ldots,X_p,Y$: the variables $X_1,\ldots,X_p$ are called predictors, and $Y$ is called a response. You’re interested in the relationship that governs them

So you posit that $Y|X_1,\ldots,X_p \sim P_\theta$, where $\theta$ represents some unknown parameters. This is called regression model for $Y$ given $X_1,\ldots,X_p$. Goal is to estimate parameters. Why?

To assess model validity, predictor importance (inference)
To predict future $Y$’s from future $X_1,\ldots,X_p$’s (prediction)

Part I

Exploratory data analysis

Crime by State Data

For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).

sid: state id
state: state abbreviation
crime: violent crimes per 100,000 people
murder: murders per 1,000,000
pctmetro: the percent of the population living in metropolitan areas
pctwhite: the percent of the population that is white
pcths: percent of population with a high school education or above
poverty: percent of population living under poverty line
single: and percent of population that are single parents
region: region the state is in
income: median income of the state

It has 51 observations (all states + DC).

library(tidyverse)
crime <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/crime.csv") %>% select(-X)
dim(crime)

## [1] 51 10

head(crime)

##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 2    al   780   11.6     67.4     73.5  66.9    17.4   11.5      2   3624
## 3    ar   593   10.2     44.7     82.9  66.3    20.0   10.7      2   3378
## 4    az   715    8.6     84.7     88.6  78.7    15.4   12.1      4   4530
## 5    ca  1078   13.1     96.7     79.3  76.2    18.2   12.5      4   5114
## 6    co   567    5.8     81.8     92.5  84.4     9.9   12.1      4   4884

Some example questions we might be interested in:

What is the relationship between single and crime?
What is the relationship between region and murder, poverty?
Can we predict crime from the other variables?
Can we predict whether crime is high or low, from other variables?

Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

What are the distributions of the variables?
Are there distinct subgroups of samples?
Are there any noticeable outliers?
Are there interesting relationship/trends to model?

Distributions of state crime variables

library(gridExtra)
colnames(crime) # These are the variables

##  [1] "state"    "crime"    "murder"   "pctmetro" "pctwhite" "pcths"   
##  [7] "poverty"  "single"   "region"   "income"

gghist <- list()
for (col_var in colnames(crime)[-1]) { # excluding state abb
  gghist[[col_var]] <- ggplot(crime) +
    geom_histogram(aes_string(x = col_var),
       fill = "lightblue", color= "black", bins = 20) +
    labs(x = paste("Histogram of",col_var))
}
grid.arrange(grobs = gghist)

What did we learn? A bunch of things! E.g.,

region, is a variable with 4 levels
We see a single outlier in rates of crime, murder and proportion single
We see two outliers in pctwhite
We see that the distributution of poverty is skewed left
and that the pctmetro is pretty uniformly distributed.

After asking our resident demographer some questions, we learn:

region: has levels 1: Northeast, 2: South, 3: North Central 4: West
DC is the extreme outlier in most of our graphics (which makes sense)

crime %>% arrange(desc(single)) %>% head

##   state crime murder pctmetro pctwhite pcths poverty single region income
## 1    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1      2   5299
## 2    la  1062   20.3     75.0     66.7  68.3    26.4   14.9      2   3545
## 3    ms   434   13.5     30.7     63.3  64.3    24.7   14.7      2   3098
## 4    ak   761    9.0     41.8     75.2  86.6     9.1   14.3      4   6315
## 5    nm   930    8.0     56.0     87.1  75.1    17.4   13.8      4   3601
## 6    ga   723   11.4     67.7     70.8  70.9    13.5   13.0      2   4091

Correlations between state and crime related variables

state_cor <- crime %>% select(-state, -region) %>% cor()
round(state_cor,3)

##           crime murder pctmetro pctwhite  pcths poverty single income
## crime     1.000  0.886    0.544   -0.677 -0.256   0.510  0.839  0.239
## murder    0.886  1.000    0.316   -0.706 -0.286   0.566  0.859  0.124
## pctmetro  0.544  0.316    1.000   -0.337 -0.004  -0.061  0.260  0.456
## pctwhite -0.677 -0.706   -0.337    1.000  0.339  -0.389 -0.656 -0.127
## pcths    -0.256 -0.286   -0.004    0.339  1.000  -0.744 -0.220  0.613
## poverty   0.510  0.566   -0.061   -0.389 -0.744   1.000  0.549 -0.524
## single    0.839  0.859    0.260   -0.656 -0.220   0.549  1.000  0.087
## income    0.239  0.124    0.456   -0.127  0.613  -0.524  0.087  1.000

Some strong correlations! Let’s find the biggest (in absolute value):

state_cor[lower.tri(state_cor, diag = TRUE)] <- 0 # Why only upper tri part?
state_cor_sorted <- sort(abs(state_cor), decreasing = T)
state_cor_sorted[1]

## [1] 0.8861963

vars_big_cor <- arrayInd(which(abs(state_cor)==state_cor_sorted[1]), 
                        dim(state_cor)) # Note: arrayInd() is useful
colnames(state_cor)[vars_big_cor]

## [1] "crime"  "murder"

This is not surprising, given one is just a subset of the other.

Let’s find the second biggest correlation (in absolute value):

state_cor_sorted[2]

## [1] 0.8589106

vars_big_cor <- arrayInd(which(abs(state_cor) == state_cor_sorted[2]), 
                        dim(state_cor))
colnames(state_cor)[vars_big_cor]

## [1] "murder" "single"

This is more interesting! If we wanted to predict murder from the other variables, then it seems like we should at least include single as a predictor

Visualizing relationships among variables, with `ggpairs()`

Can easily look at multiple scatter plots at once, using the ggpairs() function.

library(GGally)
crime %>% select(crime, single, murder, pcths, pctmetro) %>%
  ggpairs()

Inspecting relationships over a subset of the observations

DC seems to really be messing up our visuals. Let’s get rid of it and see what the relationships look like

crime_subset <- crime %>% filter(state != "dc", pctmetro > 60) 
nrow(crime_subset) # only the high metropolitan cities

## [1] 33

crime_subset %>% select(crime, single, murder, pcths) %>%
  ggpairs()

Testing means between two different groups

Recall that region contains which region an state comes from

crime %>% pull(region) %>% table

## .
##  1  2  3  4 
##  9 17 12 13

# table(crime$region)

Does the region that state is in relate to rates of single individual? rates of metropoltian area?

Let’s do some plotting first:

crime$region_fac <- factor(crime$region,
                           levels = 1:4,
                           labels = c("Northeast",
                                      "South",
                                      "North Central",
                                      "West"))

ggvis_compare <- list()
for (col_var in c("single", "pctmetro")) {
  ggvis_compare[[col_var]] <- ggplot(crime) +
    geom_boxplot(aes_string(y = col_var,
                            x = "region_fac")) +
      # notes quotes for x making as well
    labs(x = "Region",
         y = c("single" = "Proportion of Single Individuals",
               "pctmetro" = "Proportion of Population Living in Metro Area")[col_var],
         title = paste(col_var, "verse region"))
}

grid.arrange(grobs = ggvis_compare, ncol = 2)

Visually, single looks like it has a big difference, but pctmetro perhaps does not. Specifically, let’s look at we compared the North Central to the West’s distributions.

Now let’s try simple two-sample t-tests:

t.test(crime$single[crime$region_fac == "West"],
       crime$single[crime$region_fac == "North Central"])

## 
##  Welch Two Sample t-test
## 
## data:  crime$single[crime$region_fac == "West"] and crime$single[crime$region_fac == "North Central"]
## t = 2.1899, df = 22.722, p-value = 0.03907
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0676175 2.4041781
## sample estimates:
## mean of x mean of y 
##  11.56923  10.33333

t.test(crime$pctmetro[crime$region_fac == "West"],
       crime$pctmetro[crime$region_fac == "North Central"])

## 
##  Welch Two Sample t-test
## 
## data:  crime$pctmetro[crime$region_fac == "West"] and crime$pctmetro[crime$region_fac == "North Central"]
## t = 0.21503, df = 21.458, p-value = 0.8318
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.86899  19.53437
## sample estimates:
## mean of x mean of y 
##  64.20769  62.37500

Confirms what we saw visually

Part II

Linear models

Linear regression modeling

The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics

Given response $Y$ and predictors $X_1,\ldots,X_p$, in a linear regression model, we posit:

\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]

Goal is to estimate parameters, also called coefficients $\beta_0,\beta_1,\ldots,\beta_p$

Fitting a linear regression model with `lm()`

We can use lm() to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument

E.g., for the crime data, to regress the response variable murder (murder rate) onto the predictors variables single and poverty:

crime_all <- crime
crime <- crime %>% filter(state != "dc")

crime_lm <- lm(murder ~ single + poverty, data = crime)
class(crime_lm) # Really, a specialized list

## [1] "lm"

names(crime_lm) # Here are its components

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

crime_lm # It has a special way of printing

## 
## Call:
## lm(formula = murder ~ single + poverty, data = crime)
## 
## Coefficients:
## (Intercept)       single      poverty  
##    -14.5587       1.5166       0.3597

Utility functions

Linear models in R come with a bunch of utility functions, such as coef(), fitted(), residuals(), summary(), plot(), predict(), for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively

These tasks can also be done manually, by extracting at the components of the returned object from lm(), and manipulating them appropriately. But this is discouraged, because:

The manual strategy is more tedious and error prone
Once you master the utility functions, you’ll be able to retrieve coefficients, fitted values, make predictions, etc., in the same way for model objects returned by glm(), gam(), and many others

Retrieving estimated coefficients with `coef()`

So, what were the regression coefficients that we estimated? Use the coef() function, to retrieve them:

crime_coef <- coef(crime_lm) # Vector of 3 coefficients
crime_coef

## (Intercept)      single     poverty 
## -14.5586635   1.5166224   0.3596596

What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:

\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

So, increasing predictor $X_j$ by one unit, while holding all other predictors fixed, increases the expected response by $\beta_j$

E.g., increasing single by one unit, while holding poverty fixed, increases the expected value of murder by $\approx 1.517$

Retrieving fitted values with `fitted()`

What does our model predict for the murder rates of the 50 states in question? And how do these compare to the actual murder rates? Use the fitted() function, then plot the actual values versus the fitted ones:

crime_fit <- fitted(crime_lm) # Vector of 50 fitted values
ggplot(data.frame(fit = crime_fit,
                  actual = crime$murder)) +
  geom_point(aes(x = fit, y = actual)) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Actual versus fitted values",
       x = "Fitted values", 
       ylab = "Log PSA values")

Displaying an overview with `summary()`

The function summary() gives us a nice summary of the linear model we fit:

summary(crime_lm) # Special way of summarizing

## 
## Call:
## lm(formula = murder ~ single + poverty, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6288 -1.5164 -0.6263  1.9587  5.5705 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.55866    2.60373  -5.591 1.11e-06 ***
## single        1.51662    0.25741   5.892 3.92e-07 ***
## poverty       0.35966    0.08858   4.060 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.399 on 47 degrees of freedom
## Multiple R-squared:  0.6521, Adjusted R-squared:  0.6373 
## F-statistic: 44.05 on 2 and 47 DF,  p-value: 1.674e-11

This tells us:

The quantiles of the residuals: ideally, this is a perfect normal distribution
The coefficient estimates
Their standard errors
P-values for their individual significances
(Adjusted) R-squared value: how much variability is explained by the model?
F-statistic for the significance of the overall model

Running diagnostics with `plot()`

We can use the plot() function to run a series of diagnostic tests for our regression:

par(mfrow=c(2,2))
plot(crime_lm) # Special way of plotting

For the ggplot universe we live in, currently we use the library ggfortify to create similar visuals with autoplot. (But in this case it’s good to know that plot also does the job.)

library(ggfortify)
autoplot(crime_lm)

The results are pretty good:

“Residuals vs Fitted” plot: points appear randomly scattered, no particular pattern
“Normal Q-Q” plot: points are mostly along the 45 degree line, so residuals look close to a normal distribution
“Scale-Location” and “Residuals vs Leverage” plots: no obvious groups, no points are too far from the center

There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course

Making predictions with `predict()`

Suppose we had a new observation (say a “state” in Canada or a US territory or even “DC”?) whose proportion of single people is 22.1, and proportion of citizens in poverty is 26.4. What would our linear model estimate that regions murder rate would be? Use predict():

crime_new <- data.frame(single = 22.1, poverty = 26.4) # Must set up a new data frame
crime_pred <- predict(crime_lm, newdata = crime_new) # Now call predict with new df
crime_pred

##       1 
## 28.4537

ggplot(crime_all) + geom_histogram(aes(x = murder)) + 
  geom_vline(xintercept = crime_pred)

We’ll learn much more about making/evaluating statistical predictions later in the course

Some handy shortcuts

Here are some handy shortcuts, for fitting linear models with lm() (there are also many others):

No intercept (no $\beta_0$ in the mathematical model): use 0 + on the right-hand side of the formula, as in:

summary(lm(murder ~ 0 + poverty + single, data = crime))

## 
## Call:
## lm(formula = murder ~ 0 + poverty + single, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8237 -2.5799 -0.6941  2.1534  7.4166 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)   
## poverty   0.3576     0.1131   3.161  0.00272 **
## single    0.2311     0.1478   1.563  0.12455   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.064 on 48 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.8646 
## F-statistic: 160.6 on 2 and 48 DF,  p-value: < 2.2e-16

Include all predictors: use . on the right-hand side of the formula, as in:

summary(lm(murder ~ . - state, data = crime))

## 
## Call:
## lm(formula = murder ~ . - state, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2621 -0.9004  0.0924  1.0768  2.7025 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.1711673  8.8863844   0.019 0.984731    
## crime                    0.0035844  0.0016947   2.115 0.040862 *  
## pctmetro                 0.0141729  0.0179422   0.790 0.434351    
## pctwhite                -0.0626888  0.0292551  -2.143 0.038424 *  
## pcths                   -0.0935941  0.1022886  -0.915 0.365814    
## poverty                  0.2493987  0.1223689   2.038 0.048364 *  
## single                   0.9183154  0.2562210   3.584 0.000928 ***
## region                  -0.1742092  0.2978270  -0.585 0.561961    
## income                   0.0006446  0.0007058   0.913 0.366664    
## region_facSouth          0.9267710  0.8778056   1.056 0.297565    
## region_facNorth Central  1.0745759  0.6633345   1.620 0.113300    
## region_facWest                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 39 degrees of freedom
## Multiple R-squared:  0.8575, Adjusted R-squared:  0.8209 
## F-statistic: 23.47 on 10 and 39 DF,  p-value: 1.716e-13

Include interaction terms: use : between two predictors of interest, to include the interaction between them as a predictor, as in:

summary(lm(murder ~ poverty + single + poverty:single, data = crime))

## 
## Call:
## lm(formula = murder ~ poverty + single + poverty:single, data = crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.636 -1.489 -0.607  1.927  5.600 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -13.327446   7.489674  -1.779   0.0818 .
## poverty          0.276377   0.482702   0.573   0.5697  
## single           1.410676   0.657086   2.147   0.0371 *
## poverty:single   0.007028   0.040028   0.176   0.8614  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.424 on 46 degrees of freedom
## Multiple R-squared:  0.6523, Adjusted R-squared:  0.6297 
## F-statistic: 28.77 on 3 and 46 DF,  p-value: 1.255e-10

you can use “-” to remove things as well, e.g. “- 1”, “- single”, etc

Part III

Beyond linear models

What other kinds of models are there?

Linear regression models, as we’ve said, are useful and ubiquitous. But there’s a lot else out there. What else?

Logistic regression
Poisson regression
Generalized linear models
Generalized additive models
Many, many others (generative models, Bayesian models, hidden Markov models, nonparametric models, machine learning “models”, etc.)

Today we’ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they’re quite different, and you’ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course (or the Data Mining 36-462 course)

Logistic regression modeling

Given response $Y$ and predictors $X_1,\ldots,X_p$, where $Y \in \{0,1\}$ is a binary outcome. In a logistic regression model, we posit the relationship:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

(where $Y|X$ is shorthand for $Y|X_1,\ldots,X_p$). Goal is to estimate parameters, also called coefficients $\beta_0,\beta_1,\ldots,\beta_p$

Fitting a logistic regression model with `glm()`

We can use glm() to fit a logistic regression model. The arguments are very similar to lm()

The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument. We must also specify family = "binomial" to get logistic regression

E.g., for the prostate data, suppose we add a column murder_high to our data frame crime, as the indicator of whether the murder variable is larger than 9. To regress the binary response variable murder_high onto the predictor variables single and poverty:

crime$murder_high <- as.numeric(crime$murder > 9) # New binary outcome
table(crime$murder_high) # There are 34 states with murder rates below this

## 
##  0  1 
## 34 16

crime_glm <- glm(murder_high ~ single + poverty, data = crime, family = "binomial")
class(crime_glm) # Really, a specialized list

## [1] "glm" "lm"

crime_glm # It has a special way of printing

## 
## Call:  glm(formula = murder_high ~ single + poverty, family = "binomial", 
##     data = crime)
## 
## Coefficients:
## (Intercept)       single      poverty  
##    -16.4218       1.0499       0.2598  
## 
## Degrees of Freedom: 49 Total (i.e. Null);  47 Residual
## Null Deviance:       62.69 
## Residual Deviance: 40.13     AIC: 46.13

Utility functions work as before

For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef(), fitted(), residuals(), summary(), plot(), predict() work pretty much just as with lm(). E.g.,

coef(crime_glm) # Logisitic regression coefficients

## (Intercept)      single     poverty 
## -16.4217523   1.0498861   0.2598245

summary(crime_glm) # Special way of summarizing

## 
## Call:
## glm(formula = murder_high ~ single + poverty, family = "binomial", 
##     data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3066  -0.5873  -0.3019   0.6728   1.7568  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.4218     4.9898  -3.291 0.000998 ***
## single        1.0499     0.3783   2.775 0.005513 ** 
## poverty       0.2598     0.1088   2.389 0.016905 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62.687  on 49  degrees of freedom
## Residual deviance: 40.125  on 47  degrees of freedom
## AIC: 46.125
## 
## Number of Fisher Scoring iterations: 5

p_hat <- fitted(crime_glm) # These are probabilities! Not binary outcomes
y_hat <- round(p_hat) # This is one way we'd compute fitted outcomes
table(y_hat, y_true = crime$murder_high) # This is a 2 x 2 "confusion matrix"

##      y_true
## y_hat  0  1
##     0 29  5
##     1  5 11

What does a logistic regression coefficient mean?

How do you interpret a logistic regression coefficient? Note, from our logistic model:

\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]

So, increasing predictor $X_j$ by one unit, while holding all other predictor fixed, multiplies the odds by $e^{\beta_j}$. E.g.,

coef(crime_glm)

## (Intercept)      single     poverty 
## -16.4217523   1.0498861   0.2598245

So, increasing single (proportion of single individuals in a state) by one unit, while holding poverty (rate of) fixed, multiplies the odds of murder_high (murder rate over 9) by $\approx e^{1.05} \approx 2.86$

Creating a binary variable “on-the-fly”

We can easily create a binary variable “on-the-fly” by using the I() function inside a call to glm():

crime_glm <- glm(I(murder > 9) ~ single + poverty, data = crime, 
               family = "binomial")
summary(crime_glm) # Same as before

## 
## Call:
## glm(formula = I(murder > 9) ~ single + poverty, family = "binomial", 
##     data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3066  -0.5873  -0.3019   0.6728   1.7568  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.4218     4.9898  -3.291 0.000998 ***
## single        1.0499     0.3783   2.775 0.005513 ** 
## poverty       0.2598     0.1088   2.389 0.016905 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62.687  on 49  degrees of freedom
## Residual deviance: 40.125  on 47  degrees of freedom
## AIC: 46.125
## 
## Number of Fisher Scoring iterations: 5

Generalized additive modeling

Generalized additive models allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response $Y$, our model is:

\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is to estimate $\beta_0,f_1,\ldots,f_p$. For a binary response $Y$, our model is:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is again to estimate $\beta_0,f_1,\ldots,f_p$

Fitting a generalized additive model with `gam()`

We can use the gam() function, from the gam package, to fit a generalized additive model. The arguments are similar to glm() (and to lm()), with a key distinction

The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp), where Y is the response and X1, …, Xp are the predictors. The notation s() is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect

So, e.g., to fit the model

\[ \mathbb{E}(\mathrm{murder}\,|\,\mathrm{poverty},\mathrm{single}) = \beta_0 + f_1(\mathrm{poverty}) + \beta_2 \mathrm{single} \]

we use:

library(gam)
crime_gam <- gam(murder ~ s(poverty) + single, data = crime)
class(crime_gam) # Again, a specialized list

## [1] "Gam" "glm" "lm"

crime_gam # It has a special way of printing

## Call:
## gam(formula = murder ~ s(poverty) + single, data = crime)
## 
## Degrees of Freedom: 49 total; 43.99992 Residual
## Residual Deviance: 237.5786

Most utility functions work as before

Most of our utility functions work just as before. E.g.,

summary(crime_gam)

## 
## Call: gam(formula = murder ~ s(poverty) + single, data = crime)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8828 -1.4846 -0.6256  1.8517  5.4802 
## 
## (Dispersion Parameter for gaussian family taken to be 5.3995)
## 
##     Null Deviance: 777.7488 on 49 degrees of freedom
## Residual Deviance: 237.5786 on 43.9999 degrees of freedom
## AIC: 233.8178 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## s(poverty)  1 307.34  307.34  56.920 1.834e-09 ***
## single      1 167.30  167.30  30.984 1.458e-06 ***
## Residuals  44 237.58    5.40                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)
## (Intercept)                      
## s(poverty)        3 2.0364 0.1226
## single

But now, plot(), instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()):

par(mfrow=c(2,2))
plot(crime_gam)

(note this isn’t currently implimented in ggfortify)

# devtools::install_github("benjaminleroy/ggDiagnose")
library(ggDiagnose)
crime_gam <- gam(murder ~ s(poverty) + I(single), data = crime)
# ^note the Ben wrote ggDiagnose himself and it's not really good enough - so we need to do I(variable)
ggDiagnose(crime_gam)

Summary

Fitting models is critical to both statistical inference and prediction
Exploratory data analysis is a very good first step and gives you a sense of what you’re dealing with before you start modeling
Linear regression is the most basic modeling tool of all, and one of the most ubiquitous
lm() allows you to fit a linear model by specifying a formula, in terms of column names of a given data frame
Utility functions coef(), fitted(), residuals(), summary(), plot()/autoplot, predict() are very handy and should be used over manual access tricks
Logistic regression is the natural extension of linear regression to binary data; use glm() with family = "binomial" and all the same utility functions
Generalized additive models add a level of flexibility in that they allow the predictors to have nonlinear effects; use gam() and utility functions

Fitting Models to Data

Logistics

Last week: Simulation

Why fit statistical (regression) models?

Part I

Crime by State Data

Exploratory data analysis

Distributions of state crime variables

Visualizing relationships among variables, with `ggpairs()`

Inspecting relationships over a subset of the observations

Testing means between two different groups

Part II

Linear regression modeling

Fitting a linear regression model with `lm()`

Utility functions

Retrieving estimated coefficients with `coef()`

Retrieving fitted values with `fitted()`

Displaying an overview with `summary()`

Running diagnostics with `plot()`

Making predictions with `predict()`

Some handy shortcuts

Part III

What other kinds of models are there?

Logistic regression modeling

Fitting a logistic regression model with `glm()`

Utility functions work as before

What does a logistic regression coefficient mean?

Creating a binary variable “on-the-fly”

Generalized additive modeling

Fitting a generalized additive model with `gam()`

Most utility functions work as before

Summary

Fitting Models to Data

Logistics

Last week: Simulation

Why fit statistical (regression) models?

Part I

Crime by State Data

Exploratory data analysis

Distributions of state crime variables

Correlations between state and crime related variables

Visualizing relationships among variables, with ggpairs()

Inspecting relationships over a subset of the observations

Testing means between two different groups

Part II

Linear regression modeling

Fitting a linear regression model with lm()

Utility functions

Retrieving estimated coefficients with coef()

Retrieving fitted values with fitted()

Displaying an overview with summary()

Running diagnostics with plot()

Making predictions with predict()

Some handy shortcuts

Part III

What other kinds of models are there?

Logistic regression modeling

Fitting a logistic regression model with glm()

Utility functions work as before

What does a logistic regression coefficient mean?

Creating a binary variable “on-the-fly”

Generalized additive modeling

Fitting a generalized additive model with gam()

Most utility functions work as before

Summary

Visualizing relationships among variables, with `ggpairs()`

Fitting a linear regression model with `lm()`

Retrieving estimated coefficients with `coef()`

Retrieving fitted values with `fitted()`

Displaying an overview with `summary()`

Running diagnostics with `plot()`

Making predictions with `predict()`

Fitting a logistic regression model with `glm()`

Fitting a generalized additive model with `gam()`