nchar()
, substr()
: functions for substring extractions and replacementsstrsplit()
, paste()
: functions for splitting and combining stringstable()
: function to get word counts, useful way of summarizing text dataData frames
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Use data.frame()
, similar to how we create lists
my_df <- data.frame(nums = seq(0.1,0.6, by = 0.1), chars = letters[1:6],
bools = sample(c(TRUE,FALSE), 6, replace = TRUE))
my_df
## nums chars bools
## 1 0.1 a TRUE
## 2 0.2 b TRUE
## 3 0.3 c TRUE
## 4 0.4 d TRUE
## 5 0.5 e TRUE
## 6 0.6 f TRUE
# Note, a list can have different lengths for different elements!
my_list <- list(nums = seq(0.1,0.6,by=0.1), chars = letters[1:12],
bools = sample(c(TRUE,FALSE), 6, replace = TRUE))
my_list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE TRUE FALSE TRUE TRUE TRUE
my_df[,1] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my_df[,"nums"] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my_df$nums # Doesn't work for a matrix, but works for a list
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my_df$chars # Note: this one has been converted into a factor data type
## [1] a b c d e f
## Levels: a b c d e f
as.character(my_df$chars) # Converting it back to a character data type
## [1] "a" "b" "c" "d" "e" "f"
Factors are slightly different that the data types we’ve seen before -but are most similar to strings.
f_pol <- factor(sample(c("R", "D"), size = 10, replace = T))
f_pol
## [1] R D R R D D D D D D
## Levels: D R
f_salsa <- factor(sample(c("mild", "medium","hot"), size = 4, replace = T),
levels = c("mild", "medium","hot"), ordered = T)
f_salsa
## [1] hot hot hot medium
## Levels: mild < medium < hot
as.numeric(f_salsa)
## [1] 3 3 3 2
Factors will be useful to visualization (tomorrow), and more. Factors are slightly different than other data types.
f_pol[2] <- "I"
## Warning in `[<-.factor`(`*tmp*`, 2, value = "I"): invalid factor level, NA
## generated
f_pol
## [1] R <NA> R R D D D D D D
## Levels: D R
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
class(state.x77) # Built-in matrix of states data, 50 states x 8 variables
## [1] "matrix"
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
class(state.region) # Factor of regions for the 50 states
## [1] "factor"
head(state.region)
## [1] South West West South West West
## Levels: Northeast South North Central West
class(state.division) # Factor of divisions for the 50 states
## [1] "factor"
head(state.division)
## [1] East South Central Pacific Mountain
## [4] West South Central Pacific Mountain
## 9 Levels: New England Middle Atlantic ... Pacific
# Combine these into a data frame with 50 rows and 10 columns
state_df <- data.frame(state.x77, Region = state.region,
Division = state.division)
class(state_df)
## [1] "data.frame"
head(state_df) # Note that the first 8 columns name carried over from state.x77
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Region Division
## Alabama 50708 South East South Central
## Alaska 566432 West Pacific
## Arizona 113417 West Mountain
## Arkansas 51945 South West South Central
## California 156361 West Pacific
## Colorado 103766 West Mountain
To add columns: we can either use data.frame()
, or directly define a new named column
# First way: use data.frame() to concatenate on a new column
state_df <- data.frame(state_df, Cool = sample(c(T,F), nrow(state_df), rep = TRUE))
head(state_df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central TRUE
## Alaska West Pacific TRUE
## Arizona West Mountain TRUE
## Arkansas South West South Central TRUE
# Second way: just directly define a new named column
state_df$Score <- sample(1:100, nrow(state_df), replace = TRUE)
head(state_df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool Score
## Alabama South East South Central TRUE 14
## Alaska West Pacific TRUE 82
## Arizona West Mountain TRUE 19
## Arkansas South West South Central TRUE 82
To delete columns: we can either use negative integer indexing, or set a column to NULL
# First way: use negative integer indexing
state_df <- state_df[,-ncol(state_df)]
head(state_df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central TRUE
## Alaska West Pacific TRUE
## Arizona West Mountain TRUE
## Arkansas South West South Central TRUE
# Second way: just directly set a column to NULL
state_df$Cool <- NULL
head(state_df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division
## Alabama South East South Central
## Alaska West Pacific
## Arizona West Mountain
## Arkansas South West South Central
With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state_df[state_df$Division == "New England", "Frost"])
## [1] 145.3333
mean(state_df[state_df$Division == "Pacific", "Frost"]) # Home sweet home!
## [1] 49.6
subset()
The subset()
function provides a convenient alternative way of accessing rows for data frames
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state_df_ne_1 <- subset(state_df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state_df_ne_2 <- state_df[state_df$Division == "New England", ]
all(state_df_ne_1 == state_df_ne_2)
## [1] TRUE
# Same calculation as in the last slide, using subset()
mean(subset(state_df, Division == "New England")$Frost)
## [1] 145.3333
mean(subset(state_df, Division == "Pacific")$Frost) # Home sweet home!
## [1] 49.6
apply()
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for()
loop; can be simpler and faster, though not always. Summary of functions:
apply()
: apply a function to rows or columns of a matrix or data framelapply()
: apply a function to elements of a list or vectorsapply()
: same as the above, but simplify the output (if possible)tapply()
: apply a function to levels of a factor vectorapply()
, rows or columns of a matrix or data frameThe apply()
function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun)
, to apply my.fun()
across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun)
, to apply my.fun()
across columns of a matrix or data frame x
apply(state.x77, MARGIN = 2, FUN = min) # Minimum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 365.00 3098.00 0.50 67.96 1.40 37.80
## Frost Area
## 0.00 1049.00
apply(state.x77, MARGIN = 2, FUN = max) # Maximum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
apply(state.x77, MARGIN = 2, FUN = which.max) # Index of the max in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 5 2 18 11 1 44
## Frost Area
## 28 2
apply(state.x77, MARGIN = 2, FUN = summary) # Summary of each col, get back matrix!
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Min. 365.00 3098.00 0.500 67.9600 1.400 37.800 0.00
## 1st Qu. 1079.50 3992.75 0.625 70.1175 4.350 48.050 66.25
## Median 2838.50 4519.00 0.950 70.6750 6.850 53.250 114.50
## Mean 4246.42 4435.80 1.170 70.8786 7.378 53.108 104.46
## 3rd Qu. 4968.50 4813.50 1.575 71.8925 10.675 59.150 139.75
## Max. 21198.00 6315.00 2.800 73.6000 15.100 67.300 188.00
## Area
## Min. 1049.00
## 1st Qu. 36985.25
## Median 54277.00
## Mean 70735.88
## 3rd Qu. 81162.50
## Max. 566432.00
For a custom function, we can just define it before hand, and the use apply()
as usual
# Our custom function: trimmed mean
trimmed_mean <- function(v) {
q1 <- quantile(v, prob = 0.1)
q2 <- quantile(v, prob = 0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN = 2, FUN = trimmed_mean)
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN = 2, FUN = function(v) {
q1 <- quantile(v, prob = 0.1)
q2 <- quantile(v, prob = 0.9)
return(mean(v[q1 <= v & v <= q2]))
})
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
Can tell apply()
to pass extra arguments to the function in question. E.g., can use: apply(x, MARGIN=1, FUN=my_fun, extra_arg_1, extra_arg_2)
, for two extra arguments extra_arg_1
, extra.arg.2
to be passed to my_fun()
# Our custom function: trimmed mean, with user-specified percentiles
trimmed_mean <- function(v, p1, p2) {
q1 <- quantile(v, prob = p1)
q2 <- quantile(v, prob = p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN = 2, FUN = trimmed_mean, p1 = 0.01, p2 = 0.99)
## Population Income Illiteracy Life Exp Murder
## 3974.125000 4424.520833 1.136735 70.882708 7.341667
## HS Grad Frost Area
## 53.131250 104.895833 61860.687500
What kind of data type will apply()
give us? Depends on what function we pass. Summary, say, with FUN=my_fun()
:
my_fun()
returns a single value, then apply()
will return a vectormy_fun()
returns k values, then apply()
will return a matrix with k rows (note: this is true regardless of whether MARGIN = 1
or MARGIN = 2
)my_fun()
returns different length outputs for different inputs, then apply()
will return a listmy_fun()
returns a list, then apply()
will return a listWe’ll grapple with this on the lab/hw.
Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply()
. E.g.,
rowSums()
, colSums()
: for computing row, column sums of a matrixrowMeans()
, colMeans()
: for computing row, column means of a matrixmax.col()
: for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x <- matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN = 1, function(v) { return(sum(v > 0)) })
## [1] 3 1 2
# Do this instead (much faster, simpler)
rowSums(x > 0)
## [1] 3 1 2
lapply()
, sapply()
, tapply()
lapply()
, elements of a list or vectorThe lapply()
function takes inputs as in: lapply(x, FUN = my_fun)
, to apply my_fun()
across elements of a list or vector x
. The output is always a list
my_list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE TRUE FALSE TRUE TRUE TRUE
lapply(my_list, FUN = mean) # Get a warning: mean() can't be applied to chars
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $nums
## [1] 0.35
##
## $chars
## [1] NA
##
## $bools
## [1] 0.6666667
lapply(my_list, FUN = summary)
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 2 4
sapply()
, elements of a list or vectorThe sapply()
function works just like lapply()
, but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector
sapply(my_list, FUN = mean) # Simplifies the result, now a vector
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## nums chars bools
## 0.3500000 NA 0.6666667
sapply(my_list, FUN = summary) # Can't simplify, so still a list
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 2 4
tapply()
, levels of a factor vectorThe function tapply()
takes inputs as in: tapply(x, INDEX = my_index, FUN = my_fun)
, to apply my.fun()
to subsets of entries in x
that share a common level in my.index
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX = state.region, FUN = mean)
## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
tapply(state.x77[,"Frost"], INDEX = state.region, FUN = sd)
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
split()
, split by levels of a factorThe function split()
split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index)
to split a data frame x
according to levels of my.index
# Split up the state.x77 matrix according to region
state_by_reg <- split(data.frame(state.x77), f = state.region)
class(state_by_reg) # The result is a list
## [1] "list"
names(state_by_reg) # This has 4 elements for the 4 regions
## [1] "Northeast" "South" "North Central" "West"
class(state_by_reg[[1]]) # Each element is a data frame
## [1] "data.frame"
# For each region, display the first 3 rows of the data frame
lapply(state_by_reg, FUN = head, 3)
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## Maine 1058 3694 0.7 70.39 2.7 54.7 161
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103
## Area
## Connecticut 4862
## Maine 30920
## Massachusetts 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Area
## Alaska 566432
## Arizona 113417
## California 156361
# For each region, average each of the 8 numeric variables
lapply(state_by_reg, FUN = function(df) {
return(apply(df, MARGIN = 2, mean))
})
## $Northeast
## Population Income Illiteracy Life.Exp Murder
## 5495.111111 4570.222222 1.000000 71.264444 4.722222
## HS.Grad Frost Area
## 53.966667 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
## Frost Area
## 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
## Frost Area
## 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00
## HS.Grad Frost Area
## 6.200000e+01 1.021538e+02 1.344630e+05
subset()
: function for extracting rows of a data frame meeting a conditionsplit()
: function for splitting up rows of a data frame, according to a factor variableapply()
: function for applying a given routine to rows or columns of a matrix or data framelapply()
: similar, but used for applying a routine to elements of a vector or listsapply()
: similar, but will try to simplify the return type, in comparison to lapply()
tapply()
: function for applying a given routine to groups of elements in a vector or list, according to a factor variable