Statistical Computing, 36-350
Wednesday July 24, 2019
set.seed()
readRDS()
, load()
read.table()
, read.csv()
read.table()
, read.csv()
; helps sometimes take a look at the original data files to see their structureYou have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them
So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?
Exploratory data analysis
For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).
sid
: state idstate
: state abbreviationcrime
: violent crimes per 100,000 peoplemurder
: murders per 1,000,000pctmetro
: the percent of the population living in metropolitan areaspctwhite
: the percent of the population that is whitepcths
: percent of population with a high school education or abovepoverty
: percent of population living under poverty linesingle
: and percent of population that are single parentsregion
: region the state is inincome
: median income of the stateIt has 51 observations (all states + DC).
library(tidyverse)
crime <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/crime.csv") %>% select(-X)
dim(crime)
## [1] 51 10
## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5 2 3624
## 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7 2 3378
## 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1 4 4530
## 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 4 5114
## 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1 4 4884
Some example questions we might be interested in:
single
and crime
?region
and murder
, poverty
?crime
from the other variables?crime
is high or low, from other variables?Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:
## [1] "state" "crime" "murder" "pctmetro" "pctwhite" "pcths"
## [7] "poverty" "single" "region" "income"
What did we learn? A bunch of things! E.g.,
region
, is a variable with 4 levelscrime
, murder
and proportion single
pctwhite
poverty
is skewed leftpctmetro
is pretty uniformly distributed.After asking our resident demographer some questions, we learn:
region
: has levels 1: Northeast, 2: South, 3: North Central 4: West## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2 5299
## 2 la 1062 20.3 75.0 66.7 68.3 26.4 14.9 2 3545
## 3 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 2 3098
## 4 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 5 nm 930 8.0 56.0 87.1 75.1 17.4 13.8 4 3601
## 6 ga 723 11.4 67.7 70.8 70.9 13.5 13.0 2 4091
## crime murder pctmetro pctwhite pcths poverty single income
## crime 1.000 0.886 0.544 -0.677 -0.256 0.510 0.839 0.239
## murder 0.886 1.000 0.316 -0.706 -0.286 0.566 0.859 0.124
## pctmetro 0.544 0.316 1.000 -0.337 -0.004 -0.061 0.260 0.456
## pctwhite -0.677 -0.706 -0.337 1.000 0.339 -0.389 -0.656 -0.127
## pcths -0.256 -0.286 -0.004 0.339 1.000 -0.744 -0.220 0.613
## poverty 0.510 0.566 -0.061 -0.389 -0.744 1.000 0.549 -0.524
## single 0.839 0.859 0.260 -0.656 -0.220 0.549 1.000 0.087
## income 0.239 0.124 0.456 -0.127 0.613 -0.524 0.087 1.000
Some strong correlations! Let’s find the biggest (in absolute value):
state_cor[lower.tri(state_cor, diag = TRUE)] <- 0 # Why only upper tri part?
state_cor_sorted <- sort(abs(state_cor), decreasing = T)
state_cor_sorted[1]
## [1] 0.8861963
vars_big_cor <- arrayInd(which(abs(state_cor)==state_cor_sorted[1]),
dim(state_cor)) # Note: arrayInd() is useful
colnames(state_cor)[vars_big_cor]
## [1] "crime" "murder"
This is not surprising, given one is just a subset of the other.
Let’s find the second biggest correlation (in absolute value):
## [1] 0.8589106
vars_big_cor <- arrayInd(which(abs(state_cor) == state_cor_sorted[2]),
dim(state_cor))
colnames(state_cor)[vars_big_cor]
## [1] "murder" "single"
This is more interesting! If we wanted to predict murder
from the other variables, then it seems like we should at least include single
as a predictor
ggpairs()
Can easily look at multiple scatter plots at once, using the ggpairs()
function.
DC seems to really be messing up our visuals. Let’s get rid of it and see what the relationships look like
crime_subset <- crime %>% filter(state != "dc", pctmetro > 60)
nrow(crime_subset) # only the high metropolitan cities
## [1] 33
Recall that region
contains which region an state comes from
## .
## 1 2 3 4
## 9 17 12 13
Does the region that state is in relate to rates of single individual? rates of metropoltian area?
Let’s do some plotting first:
crime$region_fac <- factor(crime$region,
levels = 1:4,
labels = c("Northeast",
"South",
"North Central",
"West"))
ggvis_compare <- list()
for (col_var in c("single", "pctmetro")) {
ggvis_compare[[col_var]] <- ggplot(crime) +
geom_boxplot(aes_string(y = col_var,
x = "region_fac")) +
# notes quotes for x making as well
labs(x = "Region",
y = c("single" = "Proportion of Single Individuals",
"pctmetro" = "Proportion of Population Living in Metro Area")[col_var],
title = paste(col_var, "verse region"))
}
grid.arrange(grobs = ggvis_compare, ncol = 2)
Visually, single
looks like it has a big difference, but pctmetro
perhaps does not. Specifically, let’s look at we compared the North Central to the West’s distributions.
Now let’s try simple two-sample t-tests:
##
## Welch Two Sample t-test
##
## data: crime$single[crime$region_fac == "West"] and crime$single[crime$region_fac == "North Central"]
## t = 2.1899, df = 22.722, p-value = 0.03907
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0676175 2.4041781
## sample estimates:
## mean of x mean of y
## 11.56923 10.33333
t.test(crime$pctmetro[crime$region_fac == "West"],
crime$pctmetro[crime$region_fac == "North Central"])
##
## Welch Two Sample t-test
##
## data: crime$pctmetro[crime$region_fac == "West"] and crime$pctmetro[crime$region_fac == "North Central"]
## t = 0.21503, df = 21.458, p-value = 0.8318
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.86899 19.53437
## sample estimates:
## mean of x mean of y
## 64.20769 62.37500
Confirms what we saw visually
Linear models
The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics
Given response \(Y\) and predictors \(X_1,\ldots,X_p\), in a linear regression model, we posit:
\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]
Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)
lm()
We can use lm()
to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp
, where Y
is the response and X1
, …, Xp
are the predictors. These refer to column names of variables in a data frame, that we pass in through the data
argument
E.g., for the crime data, to regress the response variable murder
(murder rate) onto the predictors variables single
and poverty
:
crime_all <- crime
crime <- crime %>% filter(state != "dc")
crime_lm <- lm(murder ~ single + poverty, data = crime)
class(crime_lm) # Really, a specialized list
## [1] "lm"
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## Call:
## lm(formula = murder ~ single + poverty, data = crime)
##
## Coefficients:
## (Intercept) single poverty
## -14.5587 1.5166 0.3597
Linear models in R come with a bunch of utility functions, such as coef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
, for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively
These tasks can also be done manually, by extracting at the components of the returned object from lm()
, and manipulating them appropriately. But this is discouraged, because:
glm()
, gam()
, and many otherscoef()
So, what were the regression coefficients that we estimated? Use the coef()
function, to retrieve them:
## (Intercept) single poverty
## -14.5586635 1.5166224 0.3596596
What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:
\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
So, increasing predictor \(X_j\) by one unit, while holding all other predictors fixed, increases the expected response by \(\beta_j\)
E.g., increasing single
by one unit, while holding poverty
fixed, increases the expected value of murder
by \(\approx 1.517\)
fitted()
What does our model predict for the murder
rates of the 50 states in question? And how do these compare to the actual murder
rates? Use the fitted()
function, then plot the actual values versus the fitted ones:
summary()
The function summary()
gives us a nice summary of the linear model we fit:
##
## Call:
## lm(formula = murder ~ single + poverty, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6288 -1.5164 -0.6263 1.9587 5.5705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.55866 2.60373 -5.591 1.11e-06 ***
## single 1.51662 0.25741 5.892 3.92e-07 ***
## poverty 0.35966 0.08858 4.060 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.399 on 47 degrees of freedom
## Multiple R-squared: 0.6521, Adjusted R-squared: 0.6373
## F-statistic: 44.05 on 2 and 47 DF, p-value: 1.674e-11
This tells us:
plot()
We can use the plot()
function to run a series of diagnostic tests for our regression:
For the ggplot
universe we live in, currently we use the library ggfortify
to create similar visuals with autoplot
. (But in this case it’s good to know that plot
also does the job.)
The results are pretty good:
There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course
predict()
Suppose we had a new observation (say a “state” in Canada or a US territory or even “DC”?) whose proportion of single
people is 22.1, and proportion of citizens in poverty
is 26.4. What would our linear model estimate that regions murder
rate would be? Use predict()
:
crime_new <- data.frame(single = 22.1, poverty = 26.4) # Must set up a new data frame
crime_pred <- predict(crime_lm, newdata = crime_new) # Now call predict with new df
crime_pred
## 1
## 28.4537
We’ll learn much more about making/evaluating statistical predictions later in the course
Here are some handy shortcuts, for fitting linear models with lm()
(there are also many others):
No intercept (no \(\beta_0\) in the mathematical model): use 0 +
on the right-hand side of the formula, as in:
##
## Call:
## lm(formula = murder ~ 0 + poverty + single, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8237 -2.5799 -0.6941 2.1534 7.4166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## poverty 0.3576 0.1131 3.161 0.00272 **
## single 0.2311 0.1478 1.563 0.12455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.064 on 48 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.8646
## F-statistic: 160.6 on 2 and 48 DF, p-value: < 2.2e-16
Include all predictors: use .
on the right-hand side of the formula, as in:
##
## Call:
## lm(formula = murder ~ . - state, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2621 -0.9004 0.0924 1.0768 2.7025
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1711673 8.8863844 0.019 0.984731
## crime 0.0035844 0.0016947 2.115 0.040862 *
## pctmetro 0.0141729 0.0179422 0.790 0.434351
## pctwhite -0.0626888 0.0292551 -2.143 0.038424 *
## pcths -0.0935941 0.1022886 -0.915 0.365814
## poverty 0.2493987 0.1223689 2.038 0.048364 *
## single 0.9183154 0.2562210 3.584 0.000928 ***
## region -0.1742092 0.2978270 -0.585 0.561961
## income 0.0006446 0.0007058 0.913 0.366664
## region_facSouth 0.9267710 0.8778056 1.056 0.297565
## region_facNorth Central 1.0745759 0.6633345 1.620 0.113300
## region_facWest NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 39 degrees of freedom
## Multiple R-squared: 0.8575, Adjusted R-squared: 0.8209
## F-statistic: 23.47 on 10 and 39 DF, p-value: 1.716e-13
Include interaction terms: use :
between two predictors of interest, to include the interaction between them as a predictor, as in:
##
## Call:
## lm(formula = murder ~ poverty + single + poverty:single, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.636 -1.489 -0.607 1.927 5.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.327446 7.489674 -1.779 0.0818 .
## poverty 0.276377 0.482702 0.573 0.5697
## single 1.410676 0.657086 2.147 0.0371 *
## poverty:single 0.007028 0.040028 0.176 0.8614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.424 on 46 degrees of freedom
## Multiple R-squared: 0.6523, Adjusted R-squared: 0.6297
## F-statistic: 28.77 on 3 and 46 DF, p-value: 1.255e-10
you can use “-” to remove things as well, e.g. “- 1”, “- single”, etc
Beyond linear models
Linear regression models, as we’ve said, are useful and ubiquitous. But there’s a lot else out there. What else?
Today we’ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they’re quite different, and you’ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course (or the Data Mining 36-462 course)
Given response \(Y\) and predictors \(X_1,\ldots,X_p\), where \(Y \in \{0,1\}\) is a binary outcome. In a logistic regression model, we posit the relationship:
\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
(where \(Y|X\) is shorthand for \(Y|X_1,\ldots,X_p\)). Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)
glm()
We can use glm()
to fit a logistic regression model. The arguments are very similar to lm()
The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp
, where Y
is the response and X1
, …, Xp
are the predictors. These refer to column names of variables in a data frame, that we pass in through the data
argument. We must also specify family = "binomial"
to get logistic regression
E.g., for the prostate data, suppose we add a column murder_high
to our data frame crime
, as the indicator of whether the murder
variable is larger than 9. To regress the binary response variable murder_high
onto the predictor variables single
and poverty
:
crime$murder_high <- as.numeric(crime$murder > 9) # New binary outcome
table(crime$murder_high) # There are 34 states with murder rates below this
##
## 0 1
## 34 16
crime_glm <- glm(murder_high ~ single + poverty, data = crime, family = "binomial")
class(crime_glm) # Really, a specialized list
## [1] "glm" "lm"
##
## Call: glm(formula = murder_high ~ single + poverty, family = "binomial",
## data = crime)
##
## Coefficients:
## (Intercept) single poverty
## -16.4218 1.0499 0.2598
##
## Degrees of Freedom: 49 Total (i.e. Null); 47 Residual
## Null Deviance: 62.69
## Residual Deviance: 40.13 AIC: 46.13
For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
work pretty much just as with lm()
. E.g.,
## (Intercept) single poverty
## -16.4217523 1.0498861 0.2598245
##
## Call:
## glm(formula = murder_high ~ single + poverty, family = "binomial",
## data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3066 -0.5873 -0.3019 0.6728 1.7568
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.4218 4.9898 -3.291 0.000998 ***
## single 1.0499 0.3783 2.775 0.005513 **
## poverty 0.2598 0.1088 2.389 0.016905 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62.687 on 49 degrees of freedom
## Residual deviance: 40.125 on 47 degrees of freedom
## AIC: 46.125
##
## Number of Fisher Scoring iterations: 5
p_hat <- fitted(crime_glm) # These are probabilities! Not binary outcomes
y_hat <- round(p_hat) # This is one way we'd compute fitted outcomes
table(y_hat, y_true = crime$murder_high) # This is a 2 x 2 "confusion matrix"
## y_true
## y_hat 0 1
## 0 29 5
## 1 5 11
How do you interpret a logistic regression coefficient? Note, from our logistic model:
\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]
So, increasing predictor \(X_j\) by one unit, while holding all other predictor fixed, multiplies the odds by \(e^{\beta_j}\). E.g.,
## (Intercept) single poverty
## -16.4217523 1.0498861 0.2598245
So, increasing single
(proportion of single individuals in a state) by one unit, while holding poverty
(rate of) fixed, multiplies the odds of murder_high
(murder rate over 9) by \(\approx e^{1.05} \approx 2.86\)
We can easily create a binary variable “on-the-fly” by using the I()
function inside a call to glm()
:
crime_glm <- glm(I(murder > 9) ~ single + poverty, data = crime,
family = "binomial")
summary(crime_glm) # Same as before
##
## Call:
## glm(formula = I(murder > 9) ~ single + poverty, family = "binomial",
## data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3066 -0.5873 -0.3019 0.6728 1.7568
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.4218 4.9898 -3.291 0.000998 ***
## single 1.0499 0.3783 2.775 0.005513 **
## poverty 0.2598 0.1088 2.389 0.016905 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62.687 on 49 degrees of freedom
## Residual deviance: 40.125 on 47 degrees of freedom
## AIC: 46.125
##
## Number of Fisher Scoring iterations: 5
Generalized additive models allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response \(Y\), our model is:
\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]
and the goal is to estimate \(\beta_0,f_1,\ldots,f_p\). For a binary response \(Y\), our model is:
\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]
and the goal is again to estimate \(\beta_0,f_1,\ldots,f_p\)
gam()
We can use the gam()
function, from the gam
package, to fit a generalized additive model. The arguments are similar to glm()
(and to lm()
), with a key distinction
The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp)
, where Y
is the response and X1
, …, Xp
are the predictors. The notation s()
is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect
So, e.g., to fit the model
\[ \mathbb{E}(\mathrm{murder}\,|\,\mathrm{poverty},\mathrm{single}) = \beta_0 + f_1(\mathrm{poverty}) + \beta_2 \mathrm{single} \]
we use:
library(gam)
crime_gam <- gam(murder ~ s(poverty) + single, data = crime)
class(crime_gam) # Again, a specialized list
## [1] "Gam" "glm" "lm"
## Call:
## gam(formula = murder ~ s(poverty) + single, data = crime)
##
## Degrees of Freedom: 49 total; 43.99992 Residual
## Residual Deviance: 237.5786
Most of our utility functions work just as before. E.g.,
##
## Call: gam(formula = murder ~ s(poverty) + single, data = crime)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8828 -1.4846 -0.6256 1.8517 5.4802
##
## (Dispersion Parameter for gaussian family taken to be 5.3995)
##
## Null Deviance: 777.7488 on 49 degrees of freedom
## Residual Deviance: 237.5786 on 43.9999 degrees of freedom
## AIC: 233.8178
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(poverty) 1 307.34 307.34 56.920 1.834e-09 ***
## single 1 167.30 167.30 30.984 1.458e-06 ***
## Residuals 44 237.58 5.40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(poverty) 3 2.0364 0.1226
## single
But now, plot()
, instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()
):
(note this isn’t currently implimented in ggfortify
)
(note this isn’t currently implimented in ggfortify
)
lm()
allows you to fit a linear model by specifying a formula, in terms of column names of a given data framecoef()
, fitted()
, residuals()
, summary()
, plot()
/autoplot
, predict()
are very handy and should be used over manual access tricksglm()
with family = "binomial"
and all the same utility functionsgam()
and utility functions