set.seed()
readRDS()
, load()
read.table()
, read.csv()
read.table()
, read.csv()
; helps sometimes take a look at the original data files to see their structureYou have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them
So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?
Exploratory data analysis
For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).
sid
: state idstate
: state abbreviationcrime
: violent crimes per 100,000 peoplemurder
: murders per 1,000,000pctmetro
: the percent of the population living in metropolitan areaspctwhite
: the percent of the population that is whitepcths
: percent of population with a high school education or abovepoverty
: percent of population living under poverty linesingle
: and percent of population that are single parentsregion
: region the state is inincome
: median income of the stateIt has 51 observations (all states + DC).
library(tidyverse)
crime <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/crime.csv") %>% select(-X)
dim(crime)
## [1] 51 10
head(crime)
## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5 2 3624
## 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7 2 3378
## 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1 4 4530
## 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 4 5114
## 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1 4 4884
Some example questions we might be interested in:
single
and crime
?region
and murder
, poverty
?crime
from the other variables?crime
is high or low, from other variables?Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:
library(gridExtra)
colnames(crime) # These are the variables
## [1] "state" "crime" "murder" "pctmetro" "pctwhite" "pcths"
## [7] "poverty" "single" "region" "income"
gghist <- list()
for (col_var in colnames(crime)[-1]) { # excluding state abb
gghist[[col_var]] <- ggplot(crime) +
geom_histogram(aes_string(x = col_var),
fill = "lightblue", color= "black", bins = 20) +
labs(x = paste("Histogram of",col_var))
}
grid.arrange(grobs = gghist)
What did we learn? A bunch of things! E.g.,
region
, is a variable with 4 levelscrime
, murder
and proportion single
pctwhite
poverty
is skewed leftpctmetro
is pretty uniformly distributed.After asking our resident demographer some questions, we learn:
region
: has levels 1: Northeast, 2: South, 3: North Central 4: Westcrime %>% arrange(desc(single)) %>% head
## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2 5299
## 2 la 1062 20.3 75.0 66.7 68.3 26.4 14.9 2 3545
## 3 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 2 3098
## 4 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 5 nm 930 8.0 56.0 87.1 75.1 17.4 13.8 4 3601
## 6 ga 723 11.4 67.7 70.8 70.9 13.5 13.0 2 4091
ggpairs()
Can easily look at multiple scatter plots at once, using the ggpairs()
function.
library(GGally)
crime %>% select(crime, single, murder, pcths) %>%
ggpairs()
DC seems to really be messing up our visuals. Let’s get rid of it and see what the relationships look like
crime_subset <- crime %>% filter(state != "dc")
nrow(crime_subset) # only the states now
## [1] 50
crime_subset %>% select(crime, single, murder, pcths) %>%
ggpairs()