Why statisticians learn to program

How this class will work




Communicating with the instructor/TA

Late days

R, R Studio, R Markdown

All 3 are free, and all 3 will be used extensively in this course

Laptop

Copying/cheating

Read the syllabus

It’s on the course website, please read it (actually read it)

Last minute things

Part I

Data types, operators, variables

This class in a nutshell: functional programming

Two basic types of things/objects: data and functions

A function is a machine which turns input objects, or arguments, into an output object, or a return value (possibly with side effects), according to a definite rule


The trick to good programming is to take a big transformation and break it down into smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)

Before functions, data

At base level, all data can represented in binary format, by bits (i.e., TRUE/FALSE, YES/NO, 1/0). Basic data types:

Operators

-7
## [1] -7
7 + 5
## [1] 12
7 - 5
## [1] 2

7 * 5
## [1] 35
7 ^ 5
## [1] 16807
7 / 5
## [1] 1.4
7 %% 5
## [1] 2

R console

Comparison operators

These are also binary operators; they take two objects, and give back a Boolean

7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE

7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE

Warning: == is a comparison operator, = is not!

Logical operators

These basic ones are & (and) and | (or)

(5 > 7) & (6 * 7 == 42)
## [1] FALSE
(5 > 7) | (6 * 7 == 42)
## [1] TRUE
(5 > 7) | (6 * 7 == 42) & (0 != 0)
## [1] FALSE
(5 > 7) | (6 * 7 == 42) & (0 != 0) | (9 - 8 >= 0)
## [1] TRUE

Note: The double forms && and || are different! We’ll see them later

More types

typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.na(7)
## [1] FALSE
is.na(7/0)
## [1] FALSE
is.na(0/0)
## [1] TRUE

is.character(7)
## [1] FALSE
is.character("7")
## [1] TRUE
is.character("seven")
## [1] TRUE
is.na("seven")
## [1] FALSE

as.character(5/6)
## [1] "0.833333333333333"
as.numeric(as.character(5/6))
## [1] 0.8333333
6 * as.numeric(as.character(5/6))
## [1] 5
5/6 == as.numeric(as.character(5/6))
## [1] FALSE

Data can have names

We can give names to data objects; these give us variables. Some variables are built-in:

pi
## [1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi * 10
## [1] 31.41593
cos(pi)
## [1] -1

We create variables with the assignment operator, <- or =

approx_pi <- 22/7
approx_pi
## [1] 3.142857
diameter <- 10
approx_pi * diameter
## [1] 31.42857

The assignment operator also changes values:

circumference <- approx_pi * diameter
circumference
## [1] 31.42857
circumference <- 30
circumference
## [1] 30

R workspace

What variables have you defined?

ls()
## [1] "approx_pi"     "circumference" "diameter"

Getting rid of variables:

rm("circumference")
ls()
## [1] "approx_pi" "diameter"
rm(list=ls()) # Be warned! This erases everything
ls()
## character(0)

Part II

Data structures

First data structure: vectors

x <- c(7, 8, 10, 45)
x
## [1]  7  8 10 45
is.vector(x)
## [1] TRUE

vector(length = n) returns an empty vector of length n; helpful for filling things up later

weekly_hours <- vector(length = 5)
weekly_hours
## [1] FALSE FALSE FALSE FALSE FALSE
weekly_hours[5] <- 8
weekly_hours
## [1] 0 0 0 0 8

Vector arithmetic

Arithmetic operator apply to vectors in a “componentwise” fashion

y <- c(-7, -8, -10, -45)
x + y
## [1] 0 0 0 0
x * y
## [1]   -49   -64  -100 -2025

Recycling

Recycling repeat elements in shorter vector when combined with a longer one

x + c(-7,-8)
## [1]  0  0  3 37
x^c(1,0,-1,0.5)
## [1] 7.000000 1.000000 0.100000 6.708204

Single numbers are vectors of length 1 for purposes of recycling:

2 * x
## [1] 14 16 20 90

Can do componentwise comparisons with vectors:

x > 9
## [1] FALSE FALSE  TRUE  TRUE

Logical operators also work elementwise:

(x > 9) & (x < 20)
## [1] FALSE FALSE  TRUE FALSE

To compare whole vectors, best to use identical() or all.equal():

x == -y
## [1] TRUE TRUE TRUE TRUE
identical(x, -y)
## [1] TRUE
identical(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))
## [1] FALSE
all.equal(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))
## [1] TRUE

Note: these functions are slightly different; we’ll see more later

Functions on vectors

Many functions can take vectors as arguments:

Indexing vectors

Vector of indices:

x[c(2,4)]
## [1]  8 45

Vector of negative indices:

x[c(-1,-3)]
## [1]  8 45

Boolean vector:

x[x > 9]
## [1] 10 45
y[x > 9]
## [1] -10 -45

which() gives the elements of a Boolean vector that are TRUE:

places <- which(x > 9)
places
## [1] 3 4
y[places]
## [1] -10 -45

Named components

We can give names to elements/components of vectors, and index vectors accordingly

names(x) <- c("v1","v2","v3","fred")
names(x)
## [1] "v1"   "v2"   "v3"   "fred"
x[c("fred","v1")]
## fred   v1 
##   45    7

Note: here R is printing the labels, these are not additional components of x


names() returns another vector (of characters):

names(y) <- names(x)
sort(names(x))
## [1] "fred" "v1"   "v2"   "v3"
which(names(x) == "fred")
## [1] 4

Second data structure: matrices

A matrix is a specialization of a 2d array

z_mat <- matrix(c(40,1,60,3), nrow = 2)
z_mat
##      [,1] [,2]
## [1,]   40   60
## [2,]    1    3
dim(z_mat)
## [1] 2 2
is.array(z_mat)
## [1] TRUE
is.matrix(z_mat)
## [1] TRUE

Indexing matrices

Can access a matrices either by pairs of indices or by the underlying vector (column-major order):

z_mat[1,2]
## [1] 60
z_mat[3]
## [1] 60

Omitting an index means “all of it”:

z_mat[c(1,2),2]
## [1] 60  3
z_mat[,2]
## [1] 60  3
z_mat[,2, drop = FALSE]
##      [,1]
## [1,]   60
## [2,]    3

Functions on matrices

Many functions applied to an array will just boil things down to the underlying vector:

which(z_mat > 9)
## [1] 1 3

This happens unless the function is set up to handle arrays specifically


And there are several functions/operators that do preserve array structure:

y_mat <- matrix(1:4, nrow = 2)
y_mat + z_mat
##      [,1] [,2]
## [1,]   41   63
## [2,]    3    7

Others specifically act on each row or column of the array separately:

rowSums(z_mat)
## [1] 100   4
colSums(z_mat)
## [1] 41 63

Matrix multiplication

Has its own special operator, written %*%:

six_sevens <- matrix(rep(7,6), ncol = 3)
six_sevens
##      [,1] [,2] [,3]
## [1,]    7    7    7
## [2,]    7    7    7
z_mat %*% six_sevens # [2x2] * [2x3]
##      [,1] [,2] [,3]
## [1,]  700  700  700
## [2,]   28   28   28

Multiplying matrices and vectors

Numeric vectors can act like column or row vectors, as needed:

a_vec <- c(10,20)
z_mat %*% a_vec
##      [,1]
## [1,] 1600
## [2,]   70
a_vec %*% z_mat
##      [,1] [,2]
## [1,]  420  660

Matrix operators

Transpose:

t(z_mat)
##      [,1] [,2]
## [1,]   40    1
## [2,]   60    3

Determinant:

det(z_mat)
## [1] 60

Names in matrices

Third data structure: lists

A list is sequence of values, but not necessarily all of the same type

my_list <- list("exponential", 7, FALSE)
my_list
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE

Most of what you can do with vectors you can also do with lists

Accessing pieces of lists

my_list[2]
## [[1]]
## [1] 7
my_list[[2]]
## [1] 7
my_list[[2]]^2
## [1] 49

Expanding and contracting lists

Add to lists with c() (also works with vectors):

my_list <- c(my_list, 9)
my_list
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [1] 9

Chop off the end of a list by setting the length to something smaller (also works with vectors):

length(my_list)
## [1] 4
length(my_list) <- 3
my_list
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE

Pluck out all but one piece of a list (also works with vectors):

my_list[-2]
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] FALSE

Names in lists

We can name some or all of the elements of a list:

names(my_list) <- c("family","mean","is.symmetric")
my_list
## $family
## [1] "exponential"
## 
## $mean
## [1] 7
## 
## $is.symmetric
## [1] FALSE
my_list[["family"]]
## [1] "exponential"
my_list["family"]
## $family
## [1] "exponential"

Lists have a special shortcut way of using names, with $:

my_list[["family"]]
## [1] "exponential"
my_list$family
## [1] "exponential"

Creating a list with names:

another_list <- list(family = "gaussian",
                     mean = 7, sd = 1, is_symmetric = TRUE)

Adding named elements:

my_list$was_estimated <- FALSE
my_list[["last_updated"]] <- "2015-09-01"

Removing a named list element, by assigning it the value NULL:

my_list$was_estimated <- NULL

Key-value pairs

Data frames

a_mat <- matrix(c(35,8,10,4), nrow = 2)
colnames(a_mat) <- c("v1","v2")
a_mat
##      v1 v2
## [1,] 35 10
## [2,]  8  4
a_mat[,"v1"]  # Try a_mat$v1 and see what happens
## [1] 35  8

a_df <- data.frame(a_mat,logicals = c(TRUE,FALSE))
a_df
##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
a_df$v1
## [1] 35  8
a_df[,"v1"]
## [1] 35  8
a_df[1,]
##   v1 v2 logicals
## 1 35 10     TRUE
colMeans(a_df)
##       v1       v2 logicals 
##     21.5      7.0      0.5

Adding rows and columns

We can add rows or columns to an array or data frame with rbind() and cbind(), but be careful about forced type conversions

rbind(a_df, list(v1 = -3,v2 = -5,logicals = TRUE))
##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
## 3 -3 -5     TRUE
rbind(a_df, c(3,4,6))
##   v1 v2 logicals
## 1 35 10        1
## 2  8  4        0
## 3  3  4        6

Much more on data frames a bit later in the course …

Structures of structures

So far, every list element has been a single data value. List elements can be other data structures, e.g., vectors and matrices, even other lists:

my_list2 <- list(z_mat = z_mat, my_lucky_num = 13, my_list = my_list)
my_list2
## $z_mat
##      [,1] [,2]
## [1,]   40   60
## [2,]    1    3
## 
## $my_lucky_num
## [1] 13
## 
## $my_list
## $my_list$family
## [1] "exponential"
## 
## $my_list$mean
## [1] 7
## 
## $my_list$is.symmetric
## [1] FALSE
## 
## $my_list$last_updated
## [1] "2015-09-01"

Summary