- Additional Office Hour: Friday 4:30 - 5:30 pm (Wean 3715)
- Extra Lecture, Thursday (1:45 - 2:30): dplyr and ggplot (will be mostly focused on review and building up better intuition - definitely not required / will not you impact on homework if you donâ€™t go)
- Reminders:
- Review notes from Tudor on Assignments
- Look at solutions for examples of what I expect the solution to look like

- Next week: Tudor will be covering Monday and Wednesday Lectures

- Running simulations is an integral part of being a statistician in the 21st century
- R provides us with a utility functions for simulations from a wide variety of distributions
- To make your simulation results reproducible, you must set the seed, using
`set.seed()`

- There is a natural connection between iteration, functions, and simulations
- Saving and loading results can be done in two formats: rds and rdata formats
- Read in data from a previous R session with
`readRDS()`

,`load()`

- Read in data from the outside with
`read.table()`

,`read.csv()`

- Can sometimes be tricky to get arguments right in
`read.table()`

,`read.csv()`

; helps sometimes take a look at the original data files to see their structure

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. Youâ€™re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called **regression model** for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

- To assess model validity, predictor importance (
**inference**) - To predict future \(Y\)â€™s from future \(X_1,\ldots,X_p\)â€™s (
**prediction**)

*Exploratory data analysis*

For our data analysis we will use the an extended version of crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997).

`sid`

: state id`state`

: state abbreviation`crime`

: violent crimes per 100,000 people`murder`

: murders per 1,000,000`pctmetro`

: the percent of the population living in metropolitan areas`pctwhite`

: the percent of the population that is white`pcths`

: percent of population with a high school education or above

`poverty`

: percent of population living under poverty line`single`

: and percent of population that are single parents`region`

: region the state is in`income`

: median income of the state

It has 51 observations (all states + DC).

```
library(tidyverse)
crime <- read.csv("https://raw.githubusercontent.com/benjaminleroy/36-350-summer-data/master/Week4/crime.csv") %>% select(-X)
dim(crime)
```

`## [1] 51 10`

`head(crime)`

```
## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5 2 3624
## 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7 2 3378
## 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1 4 4530
## 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 4 5114
## 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1 4 4884
```

Some example questions we might be interested in:

- What is the relationship between
`single`

and`crime`

? - What is the relationship between
`region`

and`murder`

,`poverty`

? - Can we predict
`crime`

from the other variables? - Can we predict whether
`crime`

is high or low, from other variables?

Before pursuing a specific model, itâ€™s generally a good idea to look at your data. When done in a structured way, this is called **exploratory data analysis**. E.g., you might investigate:

- What are the distributions of the variables?
- Are there distinct subgroups of samples?
- Are there any noticeable outliers?
- Are there interesting relationship/trends to model?

```
library(gridExtra)
colnames(crime) # These are the variables
```

```
## [1] "state" "crime" "murder" "pctmetro" "pctwhite" "pcths"
## [7] "poverty" "single" "region" "income"
```

```
gghist <- list()
for (col_var in colnames(crime)[-1]) { # excluding state abb
gghist[[col_var]] <- ggplot(crime) +
geom_histogram(aes_string(x = col_var),
fill = "lightblue", color= "black", bins = 20) +
labs(x = paste("Histogram of",col_var))
}
grid.arrange(grobs = gghist)
```

What did we learn? A bunch of things! E.g.,

`region`

, is a variable with 4 levels- We see a single outlier in rates of
`crime`

,`murder`

and proportion`single`

- We see two outliers in
`pctwhite`

- We see that the distributution of
`poverty`

is skewed left - and that the
`pctmetro`

is pretty uniformly distributed.

After asking our resident demographer some questions, we learn:

`region`

: has levels 1: Northeast, 2: South, 3: North Central 4: West- DC is the extreme outlier in most of our graphics (which makes ense)

`crime %>% arrange(desc(single)) %>% head`

```
## state crime murder pctmetro pctwhite pcths poverty single region income
## 1 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2 5299
## 2 la 1062 20.3 75.0 66.7 68.3 26.4 14.9 2 3545
## 3 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 2 3098
## 4 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 4 6315
## 5 nm 930 8.0 56.0 87.1 75.1 17.4 13.8 4 3601
## 6 ga 723 11.4 67.7 70.8 70.9 13.5 13.0 2 4091
```

`ggpairs()`

Can easily look at multiple scatter plots at once, using the `ggpairs()`

function.

```
library(GGally)
crime %>% select(crime, single, murder, pcths) %>%
ggpairs()
```

DC seems to really be messing up our visuals. Letâ€™s get rid of it and see what the relationships look like

```
crime_subset <- crime %>% filter(state != "dc")
nrow(crime_subset) # only the states now
```

`## [1] 50`

```
crime_subset %>% select(crime, single, murder, pcths) %>%
ggpairs()
```