I decided to post this old lecture (with some clean up) online, as I think it really well captures a lot of things students might want to know about logistic regression while using R
. Note: this document was written for summer students in CMU Statistics and Data Sciences’ SURE program, and was presented to them live.
Introduction
This lecture focuses on introducing a set of tools associated with poisson regression (and general additive models).
This data comes from Princeton’s Database, but is a very common dataset. They had an error and ugly data so I cleaned it up a bit.
library(tidyverse)
library(forcats)
theme_set(theme_minimal())
smoking <- read_csv(paste0("https://raw.githubusercontent.com/benjaminleroy/",
"stat315summer_data/master/",
"data_sure/smoking_data.csv")) %>%
mutate(smoke_binary = ifelse(smoke == "no", 0, 1),
age = factor(age)) %>%
select(-X1)
The data we will be using today (explorable below) contains information from a Canadian study that focusing the relationship between smoking and death (related to different age groups).
DT::datatable(smoking)
Exploratory Data Analysis
Quick EDA before any regression:
library(GGally)
ggpairs(smoking, columns = c( "smoke", "smoke_binary", "pop", "dead"))
# Create grid of 1d visuals of variable distributions in data frame
# Args:
# data: data frame you'd like to explore
# ncol: number of columns for the visual output
#
# Returns:
# list of graphics
# and creates the correct visualization
gg1d <- function(data, ncol = 3){
require(gridExtra)
require(ggplot2)
columns <- names(data)
types <- sapply(data, class)
num_types <- c("numeric", "integer")
gglist <- list()
for (i in 1:length(columns)) {
if (types[i] %in% num_types) {
gglist[[i]] <- ggplot(data, aes_string(x = columns[i])) +
geom_histogram()
}else{
gglist[[i]] <- ggplot(data, aes_string(x = columns[i])) +
geom_bar()
}
}
grid.arrange(grobs = gglist, ncol = ncol, top = "1d EDA")
return(gglist)
}
. <- gg1d(smoking, ncol = 2)
Poisson Regression:
Poisson regression is useful when we are dealing with counts, for example the number of deaths of out of population of people (our example), terrorist attacks per year per region, etc.
Additionally, poisson regression is useful when events occur rarely (otherwise one might jump to linear regression first. e.g. population per country). Though for events that occur too rarely (lots of 0s say - you should try using a slight update - see Resources below).
What is the poisson distribution again?
\[ \mathbb{P}(X = x|\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}, x \in \{0,1,2,...\} \quad \mathbb{E}(X) = \lambda, \quad \mathbb{V}(X) = \lambda \]
Motivation for the Link function
Given that we has small amounts of counts (which tend also to be though of as rates - generally less than 1, a log
transform would make the values stretch to the full real values \(\mathbb{R}\).
Inference
Model
In our dataset we have the ability to explore the relationship between death and smoking. We should make sure to think about this problem in the poisson frame with respect to \(\lambda\) (a rate), specifically the rate of death per individual. Note that our data just gives counts of death per group population.
So we have 2 things:
- the counts of deaths per group
- and \(\lambda\) the rate of death per person
They obviously relate with a rate = count of deaths / total individuals.
If we are looking at the regression
\[\begin{align*} \log(\lambda | X) &= X\beta \\ \log(count/total |X) &= X\beta \\ \Leftrightarrow \log(count |X) &= X\beta + \log(total) \end{align*}\]
In the final transformation we call \(log(total)\) the offset. There are cases when we don’t use offsets (e.g. number of papers a graduate student has published), but offsets can be like our example or more complicated, like area of region for growing mushrooms (vs number of mushrooms) etc. Also notice that the \(\log(total)\) does not have a \(\beta\) value associated with it.
Additionally, in the graduate student example (where the population is 1 individual), you can also think of the offset being \(\log(1) = 0\).
As such, in R
we’d do the following:
poisson_glm <- glm(dead ~ age_int + smoke, offset = log(pop), family = poisson,
data = smoking)
The summary looks decently similar to our other glm
s.
summary(poisson_glm)
##
## Call:
## glm(formula = dead ~ age_int + smoke, family = poisson, data = smoking,
## offset = log(pop))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7322 -1.3737 -0.1642 0.7407 2.7613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.249136 0.088890 -70.302 < 2e-16 ***
## age_int 0.068141 0.001154 59.073 < 2e-16 ***
## smokecigarretteOnly 0.390484 0.037330 10.460 < 2e-16 ***
## smokecigarrettePlus 0.189927 0.035892 5.292 1.21e-07 ***
## smokeno -0.039428 0.046880 -0.841 0.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4055.984 on 35 degrees of freedom
## Residual deviance: 67.681 on 31 degrees of freedom
## AIC: 317.7
##
## Number of Fisher Scoring iterations: 4
Hmmmmmm… if we look at those levels of smoking it isn’t really as we expected… we really want the first level to “no” so that all those \(\beta\) values compare against not smoking…
smoking <- smoking %>% mutate(smoke = fct_relevel(factor(smoke), "no"))
poisson_glm2 <- glm(dead ~ age_int + smoke, offset = log(pop), family = poisson,
data = smoking)
poisson_glm2 %>% summary
##
## Call:
## glm(formula = dead ~ age_int + smoke, family = poisson, data = smoking,
## offset = log(pop))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7322 -1.3737 -0.1642 0.7407 2.7613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.288564 0.086484 -72.713 < 2e-16 ***
## age_int 0.068141 0.001154 59.073 < 2e-16 ***
## smokecigarPipeOnly 0.039428 0.046880 0.841 0.4
## smokecigarretteOnly 0.429912 0.039753 10.815 < 2e-16 ***
## smokecigarrettePlus 0.229356 0.038565 5.947 2.73e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4055.984 on 35 degrees of freedom
## Residual deviance: 67.681 on 31 degrees of freedom
## AIC: 317.7
##
## Number of Fisher Scoring iterations: 4
Dispersion correction
Before we talk about inference of the \(\beta\) values and how to interpret them, let’s look at dispersion - which we talked about needing to do the get good confidence intervals.
Recall that since the variance and mean of the poisson are directly linked dispersion tells us if that “linking” is actually correct, we can caculate an estimate of \(\phi\) in the equation:
\[ Var(X) = \phi \lambda \]
when \(X\) is poisson.
We can test whether the dispersion is different than 1 with a \(\chi^2\) test:
dispersion <- poisson_glm$deviance/poisson_glm$df.residual
pchisq(poisson_glm$deviance,df = poisson_glm$df.residual,
lower.tail = FALSE)
## [1] 0.0001527248
Looks like \(\phi\) isn’t 1, that’s ok we can use a quasipoisson
model.
qpoisson_glm <- glm(dead ~ age_int + smoke, offset = log(pop),
family = quasipoisson, data = smoking)
summary(qpoisson_glm)
##
## Call:
## glm(formula = dead ~ age_int + smoke, family = quasipoisson,
## data = smoking, offset = log(pop))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7322 -1.3737 -0.1642 0.7407 2.7613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.288564 0.125073 -50.279 < 2e-16 ***
## age_int 0.068141 0.001668 40.847 < 2e-16 ***
## smokecigarPipeOnly 0.039428 0.067797 0.582 0.565063
## smokecigarretteOnly 0.429912 0.057490 7.478 2e-08 ***
## smokecigarrettePlus 0.229356 0.055772 4.112 0.000267 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.09146)
##
## Null deviance: 4055.984 on 35 degrees of freedom
## Residual deviance: 67.681 on 31 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Question: How does the raw \(\beta\) estimates change between the Quasi-Poisson model (that estimates \(\phi\)) and the standard Poisson model?
Interpretation of \(\beta\)s.
Again, if we directly looked at \(\beta\) values we’d be looking at the linear change the in the \(X\) variable related to \(log(y)\). If we wanted to discuss it more like a change of rate we could look at
\[ e^{log(\lambda)} = e^{\beta_0 + \beta_1 * age\_int + ...} \]
Suppose we wanted to look at the likelihood/ rate of death between those that are 50 years old and a smoker vs a 55 year old and a non-smoker.
Question: What \(\beta\) values are we comparing?
We can’t really actually answer that question unless we assume that that all the types of smoke are the same, but let’s just assume that the first smoker only smoked cigarrettes, then we can examine
exp(sum(coef(qpoisson_glm)*c(0,5,-1,0,0)))
. So, comparing a 50 year-old-cigarrette smoker verse a 55-year-old non-smoker, the later is only 135% as likely to die compared to the former.
Intervals
To estimate confidence intervals we can still use confint
, but we
confint(qpoisson_glm,parm = "age_int")
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.06487952 0.07141881
Should probably get a better CI for this since the data is small (if you’re doing it - ask your TA).
Prediction
Unlike logistic regression we can start using the glm
version of residuals (deviance) for logistic regression, and it makes more sense.
We can look at the standard diagonistics (this time with deviance residuals not residuals).
par(mfrow = c(2, 2))
plot(qpoisson_glm)
Predicting
Predicting new values (ok we’re just getting the fitted values in the code below, but same idea), is similar to Logistic regression, if you just leave type
alone you’ll get the log value. Also note that the model we made also will take in the offset
.
predict_log <- predict(object = qpoisson_glm, newdata = smoking)
predict <- predict(object = qpoisson_glm, newdata = smoking,
type = "response")
predict %>% head
## 1 2 3 4 5 6
## 21.31690 16.40137 15.99375 57.07336 135.47095 160.11778
GAM
Now, let’s look at some GAMs (general additive models). From lecture you learned about nonlinear regression and that we can change a glm to a gam like so:
\[\begin{align*} g(y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2... \\ g(y) = \beta_0 + f(X_1) + g(X_2) + ... \end{align*}\]
Specifically, we generally have \(f,g\) be loess (local polynomial regression) or smoothing splines. Using the library(gam)
, we can use gam
very similarly to what we do with glm
, the only difference is we use lo(X1)
or s(X1)
to create a 1d Loess or Smoothing spline with that variable. After this lecture I learned that library(mgcv)
provides better toos for GAM - specifically, it allows the user to have interaction terms between features.
library(gam)
data(iris)
lm_gam <- gam(Sepal.Length ~ lo(Sepal.Width) + s(Petal.Length) + Species,data = iris, family = gaussian)
summary(lm_gam)
##
## Call: gam(formula = Sepal.Length ~ lo(Sepal.Width) + s(Petal.Length) +
## Species, family = gaussian, data = iris)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.78898 -0.22747 0.02056 0.19209 0.65905
##
## (Dispersion Parameter for gaussian family taken to be 0.0927)
##
## Null Deviance: 102.1683 on 149 degrees of freedom
## Residual Deviance: 12.9049 on 139.1878 degrees of freedom
## AIC: 81.3515
##
## Number of Local Scoring Iterations: 5
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## lo(Sepal.Width) 1.00 3.433 3.433 37.028 1.066e-08 ***
## s(Petal.Length) 1.00 80.383 80.383 866.985 < 2.2e-16 ***
## Species 2.00 2.120 1.060 11.433 2.528e-05 ***
## Residuals 139.19 12.905 0.093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## lo(Sepal.Width) 2.8 1.8159 0.1507
## s(Petal.Length) 3.0 1.7429 0.1610
## Species
Note that the statistics have been changed to \(F\) statistics. For the nonparametric effects, it’s really wise to look at the ANOVA \(F\)-statistics, and we could run following to see if the additional of a smoothing spline with lo(Petal.Length)
is useful.
anova(lm_gam,
gam(Sepal.Length ~ s(Petal.Length) + Species,data = iris, family = gaussian))
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
## Analysis of Deviance Table
##
## Model 1: Sepal.Length ~ lo(Sepal.Width) + s(Petal.Length) + Species
## Model 2: Sepal.Length ~ s(Petal.Length) + Species
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 139.19 12.905
## 2 143.00 16.314 -3.8121 -3.4093 1.576e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Partial Dependence Plots
To examine the relationship between the response and a single variable (even in logistic regression), we can do the following (using the built-in plot
function). This creates a Partial Dependence Plot, basically think of them as averages of the estimated value of mean of \(Y\) holding the variable of interest fixed.
par(mfrow = c(2,2))
plot(lm_gam)
Additionally, you can still can use the basic plots from whatever glm
function you’re using (in lm we would do the following graphics) - this time in ggplot):
diag_data <- data.frame(y_train = lm_gam$y,
fitted = lm_gam$fitted.values,
residuals = lm_gam$residuals,
st_resid = rstandard(lm_gam),
leverage = hatvalues(lm_gam),
cooks = cooks.distance(lm_gam)) %>%
mutate(sqrt_st_res = sqrt(abs(st_resid)))
fit_vs_y <- ggplot(diag_data, aes(x = fitted, y = y_train)) + geom_point() +
geom_smooth()
res_vs_fit <- ggplot(diag_data, aes(x = fitted, y = residuals)) + geom_point() +
geom_smooth(se = F)
qq_norm <- ggplot(diag_data, aes(sample = residuals)) +
geom_qq(distribution = stats::qnorm) +
geom_qq_line()
sqrt_st_res_vs_fit <- ggplot(diag_data, aes(x = fitted, y = sqrt_st_res)) +
geom_point() + geom_smooth(se = F)
st_resid_vs_leverage <- ggplot(diag_data, aes(x = leverage, y = st_resid,
color = cooks)) +
geom_point() +
geom_smooth(se = F)
grid.arrange(fit_vs_y, ggplot(),
res_vs_fit, qq_norm,
sqrt_st_res_vs_fit, st_resid_vs_leverage, nrow = 3)
Additional Resources
Some more examples of GLMs and GAMs: Cosma’s Textbook, Chapter 13
R approaches for Poisson Regression Cran Package vignette: Deals with dispersion, zero-inflation (lots of 0 values as counts), other link functions, etc.