Last week: Text manipulation

Strings are, simply put, sequences of characters bound together
Text data occurs frequently “in the wild”, so you should learn how to deal with it!
nchar(), substr(): functions for substring extractions and replacements
strsplit(), paste(): functions for splitting and combining strings
Reconstitution: take lines of text, combine into one long string, then split to get the words
table(): function to get word counts, useful way of summarizing text data
Zipf’s law: word frequency tends to be inversely proportional to (a power of) rank

Part I

Data frames

Data frames

The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames

Think of each row as an observation/case
Think of each columns as a variable/feature
Not just a matrix because variables can have different types
Both rows and columns can be assigned names

Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)

Creating a data frame

Use data.frame(), similar to how we create lists

my_df <- data.frame(nums = seq(0.1,0.6, by = 0.1), chars = letters[1:6], 
                   bools = sample(c(TRUE,FALSE), 6, replace = TRUE))
my_df

##   nums chars bools
## 1  0.1     a FALSE
## 2  0.2     b FALSE
## 3  0.3     c FALSE
## 4  0.4     d  TRUE
## 5  0.5     e  TRUE
## 6  0.6     f FALSE

# Note, a list can have different lengths for different elements!
my_list <- list(nums = seq(0.1,0.6,by=0.1), chars = letters[1:12], 
               bools = sample(c(TRUE,FALSE), 6, replace = TRUE))
my_list

## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## 
## $chars
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
## 
## $bools
## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

Indexing a data frame (Base R)

By rows/columns: similar to how we index matrices
By columns only: similar to how we index lists

my_df[,1] # Also works for a matrix

## [1] 0.1 0.2 0.3 0.4 0.5 0.6

my_df[,"nums"] # Also works for a matrix

## [1] 0.1 0.2 0.3 0.4 0.5 0.6

my_df$nums # Doesn't work for a matrix, but works for a list

## [1] 0.1 0.2 0.3 0.4 0.5 0.6

my_df$chars # Note: this one has been converted into a factor data type

## [1] a b c d e f
## Levels: a b c d e f

as.character(my_df$chars) # Converting it back to a character data type

## [1] "a" "b" "c" "d" "e" "f"

(A pause from Data Frames) Factors

Factors are slightly different that the data types we’ve seen before -but are most similar to strings.

Factors have levels, e.g. regions of states, Political Parties

f_pol <- factor(sample(c("R", "D"), size = 10, replace = T))
f_pol

##  [1] R R D D R R R R D R
## Levels: D R

Factors can be ordered, e.g. salsa spice levels

f_salsa <- factor(sample(c("mild", "medium","hot"), size = 4, replace = T),
                  levels = c("mild", "medium","hot"), ordered = T)
f_salsa

## [1] hot    mild   medium hot   
## Levels: mild < medium < hot

as.numeric(f_salsa)

## [1] 3 1 2 3

Factors will be useful to visualization (tomorrow), and more. Factors are slightly different than other data types.

cannot include new levels

f_pol[2] <- "I"

## Warning in `[<-.factor`(`*tmp*`, 2, value = "I"): invalid factor level, NA
## generated

f_pol

##  [1] R    <NA> D    D    R    R    R    R    D    R   
## Levels: D R

(Data Frames continued) Creating a data frame from a matrix

Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame

class(state.x77) # Built-in matrix of states data, 50 states x 8 variables

## [1] "matrix"

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

class(state.region) # Factor of regions for the 50 states

## [1] "factor"

head(state.region)

## [1] South West  West  South West  West 
## Levels: Northeast South North Central West

class(state.division) # Factor of divisions for the 50 states

## [1] "factor"

head(state.division)

## [1] East South Central Pacific            Mountain          
## [4] West South Central Pacific            Mountain          
## 9 Levels: New England Middle Atlantic ... Pacific

# Combine these into a data frame with 50 rows and 10 columns
state_df <- data.frame(state.x77, Region = state.region, 
                       Division = state.division)
class(state_df)

## [1] "data.frame"

head(state_df) # Note that the first 8 columns name carried over from state.x77

##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area Region           Division
## Alabama     50708  South East South Central
## Alaska     566432   West            Pacific
## Arizona    113417   West           Mountain
## Arkansas    51945  South West South Central
## California 156361   West            Pacific
## Colorado   103766   West           Mountain

Adding columns to a data frame

To add columns: we can either use data.frame(), or directly define a new named column

# First way: use data.frame() to concatenate on a new column
state_df <- data.frame(state_df, Cool = sample(c(T,F), nrow(state_df), rep = TRUE))
head(state_df, 4)

##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska          365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona        2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
##          Region           Division  Cool
## Alabama   South East South Central  TRUE
## Alaska     West            Pacific FALSE
## Arizona    West           Mountain FALSE
## Arkansas  South West South Central  TRUE

# Second way: just directly define a new named column
state_df$Score <- sample(1:100, nrow(state_df), replace = TRUE)
head(state_df, 4)

##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska          365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona        2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
##          Region           Division  Cool Score
## Alabama   South East South Central  TRUE    28
## Alaska     West            Pacific FALSE    45
## Arizona    West           Mountain FALSE    41
## Arkansas  South West South Central  TRUE    39

Deleting columns from a data frame

To delete columns: we can either use negative integer indexing, or set a column to NULL

# First way: use negative integer indexing
state_df <- state_df[,-ncol(state_df)]
head(state_df, 4)

##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska          365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona        2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
##          Region           Division  Cool
## Alabama   South East South Central  TRUE
## Alaska     West            Pacific FALSE
## Arizona    West           Mountain FALSE
## Arkansas  South West South Central  TRUE

# Second way: just directly set a column to NULL
state_df$Cool <- NULL
head(state_df, 4)

##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska          365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona        2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
##          Region           Division
## Alabama   South East South Central
## Alaska     West            Pacific
## Arizona    West           Mountain
## Arkansas  South West South Central

Reminder: Boolean indexing

With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing

# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state_df[state_df$Division == "New England", "Frost"])

## [1] 145.3333

mean(state_df[state_df$Division == "Pacific", "Frost"]) # Home sweet home!

## [1] 49.6

`subset()`

The subset() function provides a convenient alternative way of accessing rows for data frames

# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state_df_ne_1 <- subset(state_df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state_df_ne_2 <- state_df[state_df$Division == "New England", ]
all(state_df_ne_1 == state_df_ne_2)

## [1] TRUE

# Same calculation as in the last slide, using subset()
mean(subset(state_df, Division == "New England")$Frost)

## [1] 145.3333

mean(subset(state_df, Division == "Pacific")$Frost) # Home sweet home!

## [1] 49.6

Part II

apply()

The apply family

R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop; can be simpler and faster, though not always. Summary of functions:

apply(): apply a function to rows or columns of a matrix or data frame
lapply(): apply a function to elements of a list or vector
sapply(): same as the above, but simplify the output (if possible)
tapply(): apply a function to levels of a factor vector

`apply()`, rows or columns of a matrix or data frame

The apply() function takes inputs of the following form:

apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame x

apply(state.x77, MARGIN = 2, FUN = min) # Minimum entry in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##     365.00    3098.00       0.50      67.96       1.40      37.80 
##      Frost       Area 
##       0.00    1049.00

apply(state.x77, MARGIN = 2, FUN = max) # Maximum entry in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##    21198.0     6315.0        2.8       73.6       15.1       67.3 
##      Frost       Area 
##      188.0   566432.0

apply(state.x77, MARGIN = 2, FUN = which.max) # Index of the max in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##          5          2         18         11          1         44 
##      Frost       Area 
##         28          2

apply(state.x77, MARGIN = 2, FUN = summary) # Summary of each col, get back matrix!

##         Population  Income Illiteracy Life Exp Murder HS Grad  Frost
## Min.        365.00 3098.00      0.500  67.9600  1.400  37.800   0.00
## 1st Qu.    1079.50 3992.75      0.625  70.1175  4.350  48.050  66.25
## Median     2838.50 4519.00      0.950  70.6750  6.850  53.250 114.50
## Mean       4246.42 4435.80      1.170  70.8786  7.378  53.108 104.46
## 3rd Qu.    4968.50 4813.50      1.575  71.8925 10.675  59.150 139.75
## Max.      21198.00 6315.00      2.800  73.6000 15.100  67.300 188.00
##              Area
## Min.      1049.00
## 1st Qu.  36985.25
## Median   54277.00
## Mean     70735.88
## 3rd Qu.  81162.50
## Max.    566432.00

Applying a custom function

For a custom function, we can just define it before hand, and the use apply() as usual

# Our custom function: trimmed mean
trimmed_mean <- function(v) {  
  q1 <- quantile(v, prob = 0.1)
  q2 <- quantile(v, prob = 0.9)
  return(mean(v[q1 <= v & v <= q2]))
}

apply(state.x77, MARGIN = 2, FUN = trimmed_mean)

##  Population      Income  Illiteracy    Life Exp      Murder     HS Grad 
##  3384.27500  4430.07500     1.07381    70.91775     7.29750    53.33750 
##       Frost        Area 
##   104.68293 56575.72500

We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)

Applying a custom function “on-the-fly”

Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient

# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN = 2, FUN = function(v) { 
  q1 <- quantile(v, prob = 0.1)
  q2 <- quantile(v, prob = 0.9)
  return(mean(v[q1 <= v & v <= q2]))
})

##  Population      Income  Illiteracy    Life Exp      Murder     HS Grad 
##  3384.27500  4430.07500     1.07381    70.91775     7.29750    53.33750 
##       Frost        Area 
##   104.68293 56575.72500

Applying a function that takes extra arguments

Can tell apply() to pass extra arguments to the function in question. E.g., can use: apply(x, MARGIN=1, FUN=my_fun, extra_arg_1, extra_arg_2), for two extra arguments extra_arg_1, extra.arg.2 to be passed to my_fun()

# Our custom function: trimmed mean, with user-specified percentiles
trimmed_mean <- function(v, p1, p2) {
  q1 <- quantile(v, prob = p1)
  q2 <- quantile(v, prob = p2)
  return(mean(v[q1 <= v & v <= q2]))
}

apply(state.x77, MARGIN = 2, FUN = trimmed_mean, p1 = 0.01, p2 = 0.99)

##   Population       Income   Illiteracy     Life Exp       Murder 
##  3974.125000  4424.520833     1.136735    70.882708     7.341667 
##      HS Grad        Frost         Area 
##    53.131250   104.895833 61860.687500

What’s the return argument?

What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my_fun():

If my_fun() returns a single value, then apply() will return a vector
If my_fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN = 1 or MARGIN = 2)
If my_fun() returns different length outputs for different inputs, then apply() will return a list
If my_fun() returns a list, then apply() will return a list

We’ll grapple with this on the lab/hw.

Optimized functions for special tasks

Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,

rowSums(), colSums(): for computing row, column sums of a matrix
rowMeans(), colMeans(): for computing row, column means of a matrix
max.col(): for finding the maximum position in each row of a matrix

Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?

x <- matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN = 1, function(v) { return(sum(v > 0)) })

## [1] 1 1 2

# Do this instead (much faster, simpler)
rowSums(x > 0)

## [1] 1 1 2

Part III

lapply(), sapply(), tapply()

`lapply()`, elements of a list or vector

The lapply() function takes inputs as in: lapply(x, FUN = my_fun), to apply my_fun() across elements of a list or vector x. The output is always a list

my_list

## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## 
## $chars
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
## 
## $bools
## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

lapply(my_list, FUN = mean) # Get a warning: mean() can't be applied to chars

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## $nums
## [1] 0.35
## 
## $chars
## [1] NA
## 
## $bools
## [1] 0.5

lapply(my_list, FUN = summary)

## $nums
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.225   0.350   0.350   0.475   0.600 
## 
## $chars
##    Length     Class      Mode 
##        12 character character 
## 
## $bools
##    Mode   FALSE    TRUE 
## logical       3       3

`sapply()`, elements of a list or vector

The sapply() function works just like lapply(), but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector

sapply(my_list, FUN = mean) # Simplifies the result, now a vector

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##  nums chars bools 
##  0.35    NA  0.50

sapply(my_list, FUN = summary) # Can't simplify, so still a list

## $nums
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.225   0.350   0.350   0.475   0.600 
## 
## $chars
##    Length     Class      Mode 
##        12 character character 
## 
## $bools
##    Mode   FALSE    TRUE 
## logical       3       3

`tapply()`, levels of a factor vector

The function tapply() takes inputs as in: tapply(x, INDEX = my_index, FUN = my_fun), to apply my.fun() to subsets of entries in x that share a common level in my.index

# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX = state.region, FUN = mean)

##     Northeast         South North Central          West 
##      132.7778       64.6250      138.8333      102.1538

tapply(state.x77[,"Frost"], INDEX = state.region, FUN = sd)

##     Northeast         South North Central          West 
##      30.89408      31.30682      23.89307      68.87652

`split()`, split by levels of a factor

The function split() split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index) to split a data frame x according to levels of my.index

# Split up the state.x77 matrix according to region
state_by_reg <- split(data.frame(state.x77), f = state.region)
class(state_by_reg) # The result is a list

## [1] "list"

names(state_by_reg) # This has 4 elements for the 4 regions

## [1] "Northeast"     "South"         "North Central" "West"

class(state_by_reg[[1]]) # Each element is a data frame

## [1] "data.frame"

# For each region, display the first 3 rows of the data frame
lapply(state_by_reg, FUN = head, 3)

## $Northeast
##               Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Connecticut         3100   5348        1.1    72.48    3.1    56.0   139
## Maine               1058   3694        0.7    70.39    2.7    54.7   161
## Massachusetts       5814   4755        1.1    71.83    3.3    58.5   103
##                Area
## Connecticut    4862
## Maine         30920
## Massachusetts  7826
## 
## $South
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20 50708
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65 51945
## Delaware        579   4809        0.9    70.06    6.2    54.6   103  1982
## 
## $`North Central`
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Illinois      11197   5107        0.9    70.14   10.3    52.6   127 55748
## Indiana        5313   4458        0.7    70.88    7.1    52.9   122 36097
## Iowa           2861   4628        0.5    72.56    2.3    59.0   140 55941
## 
## $West
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## California      21198   5114        1.1    71.71   10.3    62.6    20
##              Area
## Alaska     566432
## Arizona    113417
## California 156361

# For each region, average each of the 8 numeric variables
lapply(state_by_reg, FUN = function(df) { 
  return(apply(df, MARGIN = 2, mean)) 
})

## $Northeast
##   Population       Income   Illiteracy     Life.Exp       Murder 
##  5495.111111  4570.222222     1.000000    71.264444     4.722222 
##      HS.Grad        Frost         Area 
##    53.966667   132.777778 18141.000000 
## 
## $South
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4208.12500  4011.93750     1.73750    69.70625    10.58125    44.34375 
##       Frost        Area 
##    64.62500 54605.12500 
## 
## $`North Central`
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4803.00000  4611.08333     0.70000    71.76667     5.27500    54.51667 
##       Frost        Area 
##   138.83333 62652.00000 
## 
## $West
##   Population       Income   Illiteracy     Life.Exp       Murder 
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 
##      HS.Grad        Frost         Area 
## 6.200000e+01 1.021538e+02 1.344630e+05

Summary

Data frames are a representation of the “classic” data table in R: rows are observations/cases, columns are variables/features
Each column can be a different data type (but must be the same length)
subset(): function for extracting rows of a data frame meeting a condition
split(): function for splitting up rows of a data frame, according to a factor variable
apply(): function for applying a given routine to rows or columns of a matrix or data frame
lapply(): similar, but used for applying a routine to elements of a vector or list
sapply(): similar, but will try to simplify the return type, in comparison to lapply()
tapply(): function for applying a given routine to groups of elements in a vector or list, according to a factor variable

Data Frames and Apply