# Introduction and R Basics

Monday - July 1, 2019

# Why statisticians learn to program

• Independence: otherwise, you rely on someone else giving you exactly the right tool
• Honesty: otherwise, you end up distorting your problem to match the tools you have
• Clarity: often, turning your ideas into something a machine can do refines your thinking
• Fun: these were the best of times (the worst of times)

# How this class will work

• Instructor: Benjamin (Ben) LeRoy
• TA: Tudor Manole
• Not much programming knowledge assumed
• Some statistics knowledge assumed
• Focus is almost entirely on R
• Class will be cumulative, so keep up with the material and assignments!
• Monday/Friday: lecture for ~50 minutes, then lab for ~30 minutes
• Tuesday/Thursday class: lab for 80 minutes
• Wednesday class: Homework review (first 10 minutes). Lecture for ~40 minutes, then lab for ~30 minutes
• Labs due at 10pm each Tuesday and Thursday, submitted through Canvas
• Graded 40% on attendance, 60% on completion
• Homework due at 10pm on each Monday, also through Canvas
• No midterm, final take-home programming assignment
• Very hands-on class
• Most of class time is in lab, where you learn how to code by coding and asking questions
• Labs: 30%
• Homework: 50%
• Final project: 20%
• Canvas: course information/website, used to collect submissions, and keep track of grades
• We will make all announcements via Canvas, so make sure you enable email notifications
• Piazza group: for discussions

# Communicating with the instructor/TA

• Email the TA for any questions about the course (mainly, asking for extension, excusing absence in class, asking for regrade). Put “[36-350]” in the subject of the email
• You should not need to email the instructor for anything (unless the instructor asks you to)
• Questions about assignment content should be on Piazza
• Read syllabus for Piazza guidelines. It is not a question-answer hotline. Piazza is meant to be a collaborative forum for students to help each other out, guided by the instructor/TA.

# Late days

• Each student is given 5 late days for the semester, for both homeworks and labs
• You can only use a maximum of 3 late days per assignment (to avoid problems with releasing the solution too late)
• Late days are automatically used (you do not need to email the TA beforehand)
• Each late day is counted as up to 24 hours after the assignment is due
• Assignments submitted after all 5 late days are used receive a 0
• Email the TA for extreme circumstances (read the syllabus)

# R, R Studio, R Markdown

• R is a programming language for statistical computing
• R Studio is an integrated development environment for R programming
• R Markdown is a markup language for combining R code with text

All 3 are free, and all 3 will be used extensively in this course

# Laptop

• You will be needing a laptop for each class for lab
• If you do not have a laptop, please let the instructor/TA know. We will be able to make arrangements to lend you a laptop for the session
• For courtesy of other students and understanding, you may not be using your laptop while the instructor is lecturing

# Copying/cheating

• Don’t do it, refer to syllabus if you’re unclear about anything
(and if you’re still unclear, come see me)
• If you are struggling with the material, let me or Tudor know immediately. (The summer session goes by fast!)

# Last minute things

• Complete and submit lab by Wednesday 10pm (note different time)
• Questions?

# Part I

Data types, operators, variables

# This class in a nutshell: functional programming

Two basic types of things/objects: data and functions

• Data: things like 7, “seven”, $$7.000$$, and $$\left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right]$$
• Functions: things like log, + (takes two arguments), < (two), %% (two), and mean (one)

A function is a machine which turns input objects, or arguments, into an output object, or a return value (possibly with side effects), according to a definite rule

• Programming is writing functions to transform inputs into outputs
• Good programming ensures the transformation is done easily and correctly
• Machines are made out of machines; functions are made out of functions, like $$f(a,b) = a^2 + b^2$$

The trick to good programming is to take a big transformation and break it down into smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)

# Before functions, data

At base level, all data can represented in binary format, by bits (i.e., TRUE/FALSE, YES/NO, 1/0). Basic data types:

• Booleans: Direct binary values: TRUE or FALSE in R
• Integers: whole numbers (positive, negative or zero), represented by a fixed-length block of bits
• Characters: fixed-length blocks of bits, with special coding; strings: sequences of characters
• Floating point numbers: A significand (real number) that contains the number’s digits. Negative significands represent negative numbers. And an exponent that says where the decimal (or binary) point is placed relative to the beginning of the significand. E.g. as in $$3 \times 10^6$$ or $$1.01 \times 10^{-1}$$ or $$1 * 2^6$$
• Missing or ill-defined values: NA, NaN, etc.

# Operators

• Unary: take just one argument. E.g., - for arithmetic negation, ! for Boolean negation
• Binary: take two arguments. E.g., +, -, *, and / (though this is only a partial operator). Also, %% (for mod), and ^ (again partial)
-7
## [1] -7
7 + 5
## [1] 12
7 - 5
## [1] 2
7 * 5
## [1] 35
7 ^ 5
## [1] 16807
7 / 5
## [1] 1.4
7 %% 5
## [1] 2

# R console

• Basic interaction with R is by typing in the console, i.e., terminal, or command line
• You type in commands, R gives back answers (or errors)
• Menus and other graphical interfaces are extras built on top of the console

# Comparison operators

These are also binary operators; they take two objects, and give back a Boolean

7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE
7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE

Warning: == is a comparison operator, = is not!

# Logical operators

These basic ones are & (and) and | (or)

(5 > 7) & (6 * 7 == 42)
## [1] FALSE
(5 > 7) | (6 * 7 == 42)
## [1] TRUE
(5 > 7) | (6 * 7 == 42) & (0 != 0)
## [1] FALSE
(5 > 7) | (6 * 7 == 42) & (0 != 0) | (9 - 8 >= 0)
## [1] TRUE

Note: The double forms && and || are different! We’ll see them later

# More types

• The typeof() function returns the data type
• is.foo() functions return Booleans for whether the argument is of type foo
• as.foo() (tries to) “cast” its argument to type foo, to translate it sensibly into such a value
typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.na(7)
## [1] FALSE
is.na(7/0)
## [1] FALSE
is.na(0/0)
## [1] TRUE
is.character(7)
## [1] FALSE
is.character("7")
## [1] TRUE
is.character("seven")
## [1] TRUE
is.na("seven")
## [1] FALSE
as.character(5/6)
## [1] "0.833333333333333"
as.numeric(as.character(5/6))
## [1] 0.8333333
6 * as.numeric(as.character(5/6))
## [1] 5
5/6 == as.numeric(as.character(5/6))
## [1] FALSE

# Data can have names

We can give names to data objects; these give us variables. Some variables are built-in:

pi
## [1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi * 10
## [1] 31.41593
cos(pi)
## [1] -1

We create variables with the assignment operator, <- or =

approx_pi <- 22/7
approx_pi
## [1] 3.142857
diameter <- 10
approx_pi * diameter
## [1] 31.42857

The assignment operator also changes values:

circumference <- approx_pi * diameter
circumference
## [1] 31.42857
circumference <- 30
circumference
## [1] 30
• The code you write will be made of variables, with descriptive names
• Easier to design, easier to debug, easier to improve, and easier for others to read
• Avoid “magic constants”; instead use named variables
• Named variables are a first step towards abstraction

# R workspace

What variables have you defined?

ls()
## [1] "approx_pi"     "circumference" "diameter"

Getting rid of variables:

rm("circumference")
ls()
## [1] "approx_pi" "diameter"
rm(list=ls()) # Be warned! This erases everything
ls()
## character(0)

Data structures

# First data structure: vectors

• A data structure is a grouping of related data values into an object
• A vector is a sequence of values, all of the same type
x <- c(7, 8, 10, 45)
x
## [1]  7  8 10 45
is.vector(x)
## [1] TRUE
• The c() function returns a vector containing all its arguments in specified order
• 1:5 is shorthand for c(1,2,3,4,5), and so on
• x[1] would be the first element, x[4] the fourth element, and x[-4] is a vector containing all but the fourth element

vector(length = n) returns an empty vector of length n; helpful for filling things up later

weekly_hours <- vector(length = 5)
weekly_hours
## [1] FALSE FALSE FALSE FALSE FALSE
weekly_hours[5] <- 8
weekly_hours
## [1] 0 0 0 0 8

# Vector arithmetic

Arithmetic operator apply to vectors in a “componentwise” fashion

y <- c(-7, -8, -10, -45)
x + y
## [1] 0 0 0 0
x * y
## [1]   -49   -64  -100 -2025

# Recycling

Recycling repeat elements in shorter vector when combined with a longer one

x + c(-7,-8)
## [1]  0  0  3 37
x^c(1,0,-1,0.5)
## [1] 7.000000 1.000000 0.100000 6.708204

Single numbers are vectors of length 1 for purposes of recycling:

2 * x
## [1] 14 16 20 90

Can do componentwise comparisons with vectors:

x > 9
## [1] FALSE FALSE  TRUE  TRUE

Logical operators also work elementwise:

(x > 9) & (x < 20)
## [1] FALSE FALSE  TRUE FALSE

To compare whole vectors, best to use identical() or all.equal():

x == -y
## [1] TRUE TRUE TRUE TRUE
identical(x, -y)
## [1] TRUE
identical(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))
## [1] FALSE
all.equal(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))
## [1] TRUE

Note: these functions are slightly different; we’ll see more later

# Functions on vectors

Many functions can take vectors as arguments:

• mean(), median(), sd(), var(), max(), min(), length(), and sum() return single numbers
• sort() returns a new vector
• hist() takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plot
• ecdf() similarly produces a cumulative-density-function object
• summary() gives a five-number summary of numerical vectors
• any() and all() are useful on Boolean vectors

# Indexing vectors

Vector of indices:

x[c(2,4)]
## [1]  8 45

Vector of negative indices:

x[c(-1,-3)]
## [1]  8 45

Boolean vector:

x[x > 9]
## [1] 10 45
y[x > 9]
## [1] -10 -45

which() gives the elements of a Boolean vector that are TRUE:

places <- which(x > 9)
places
## [1] 3 4
y[places]
## [1] -10 -45

# Named components

We can give names to elements/components of vectors, and index vectors accordingly

names(x) <- c("v1","v2","v3","fred")
names(x)
## [1] "v1"   "v2"   "v3"   "fred"
x[c("fred","v1")]
## fred   v1
##   45    7

Note: here R is printing the labels, these are not additional components of x

names() returns another vector (of characters):

names(y) <- names(x)
sort(names(x))
## [1] "fred" "v1"   "v2"   "v3"
which(names(x) == "fred")
## [1] 4

# Second data structure: matrices

A matrix is a specialization of a 2d array

z_mat <- matrix(c(40,1,60,3), nrow = 2)
z_mat
##      [,1] [,2]
## [1,]   40   60
## [2,]    1    3
dim(z_mat)
## [1] 2 2
is.array(z_mat)
## [1] TRUE
is.matrix(z_mat)
## [1] TRUE
• dim says how many rows and columns; filled by columns by default
• Could also specify ncol for the number of columns
• To fill by rows, use byrow = TRUE
• Elementwise operations with the usual arithmetic and comparison operators (e.g., z_mat/3)

# Indexing matrices

Can access a matrices either by pairs of indices or by the underlying vector (column-major order):

z_mat[1,2]
## [1] 60
z_mat[3]
## [1] 60

Omitting an index means “all of it”:

z_mat[c(1,2),2]
## [1] 60  3
z_mat[,2]
## [1] 60  3
z_mat[,2, drop = FALSE]
##      [,1]
## [1,]   60
## [2,]    3

# Functions on matrices

Many functions applied to an array will just boil things down to the underlying vector:

which(z_mat > 9)
## [1] 1 3

This happens unless the function is set up to handle arrays specifically

And there are several functions/operators that do preserve array structure:

y_mat <- matrix(1:4, nrow = 2)
y_mat + z_mat
##      [,1] [,2]
## [1,]   41   63
## [2,]    3    7

Others specifically act on each row or column of the array separately:

rowSums(z_mat)
## [1] 100   4
colSums(z_mat)
## [1] 41 63

# Matrix multiplication

Has its own special operator, written %*%:

six_sevens <- matrix(rep(7,6), ncol = 3)
six_sevens
##      [,1] [,2] [,3]
## [1,]    7    7    7
## [2,]    7    7    7
z_mat %*% six_sevens # [2x2] * [2x3]
##      [,1] [,2] [,3]
## [1,]  700  700  700
## [2,]   28   28   28

# Multiplying matrices and vectors

Numeric vectors can act like column or row vectors, as needed:

a_vec <- c(10,20)
z_mat %*% a_vec
##      [,1]
## [1,] 1600
## [2,]   70
a_vec %*% z_mat
##      [,1] [,2]
## [1,]  420  660

# Matrix operators

Transpose:

t(z_mat)
##      [,1] [,2]
## [1,]   40    1
## [2,]   60    3

Determinant:

det(z_mat)
## [1] 60

# Names in matrices

• We can name either rows or columns or both, with rownames() and colnames()
• These are just character vectors, and we use them just like we do names() for vectors
• Names help us understand what we’re working with

# Third data structure: lists

A list is sequence of values, but not necessarily all of the same type

my_list <- list("exponential", 7, FALSE)
my_list
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE

Most of what you can do with vectors you can also do with lists

# Accessing pieces of lists

• Can use [ ] as with vectors
• Or use [[ ]], but only with a single index [[ ]] drops names and structures, [ ] does not
my_list[2]
## [[1]]
## [1] 7
my_list[[2]]
## [1] 7
my_list[[2]]^2
## [1] 49

# Expanding and contracting lists

Add to lists with c() (also works with vectors):

my_list <- c(my_list, 9)
my_list
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [1] 9

Chop off the end of a list by setting the length to something smaller (also works with vectors):

length(my_list)
## [1] 4
length(my_list) <- 3
my_list
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE

Pluck out all but one piece of a list (also works with vectors):

my_list[-2]
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] FALSE

# Names in lists

We can name some or all of the elements of a list:

names(my_list) <- c("family","mean","is.symmetric")
my_list
## $family ## [1] "exponential" ## ##$mean
## [1] 7
##
## $is.symmetric ## [1] FALSE my_list[["family"]] ## [1] "exponential" my_list["family"] ##$family
## [1] "exponential"

Lists have a special shortcut way of using names, with $: my_list[["family"]] ## [1] "exponential" my_list$family
## [1] "exponential"

Creating a list with names:

another_list <- list(family = "gaussian",
mean = 7, sd = 1, is_symmetric = TRUE)

my_list$was_estimated <- FALSE my_list[["last_updated"]] <- "2015-09-01" Removing a named list element, by assigning it the value NULL: my_list$was_estimated <- NULL

# Key-value pairs

• Lists give us a natural way to store and look up data by name, rather than by position
• A really useful programming concept with many names: key-value pairs, i.e., dictionaries, or associative arrays
• If all our distributions have components named family, we can look that up by name, without caring where it is (in what position it lies) in the list

# Data frames

• The classic data table, $$n$$ rows for cases, $$p$$ columns for variables
• Lots of the really-statistical parts of R presume data frames
• Not just a matrix because columns can have different types
• Many matrix functions also work for data frames (e.g.,rowSums(), summary(), apply())
a_mat <- matrix(c(35,8,10,4), nrow = 2)
colnames(a_mat) <- c("v1","v2")
a_mat
##      v1 v2
## [1,] 35 10
## [2,]  8  4
a_mat[,"v1"]  # Try a_mat$v1 and see what happens ## [1] 35 8 a_df <- data.frame(a_mat,logicals = c(TRUE,FALSE)) a_df ## v1 v2 logicals ## 1 35 10 TRUE ## 2 8 4 FALSE a_df$v1
## [1] 35  8
a_df[,"v1"]
## [1] 35  8
a_df[1,]
##   v1 v2 logicals
## 1 35 10     TRUE
colMeans(a_df)
##       v1       v2 logicals
##     21.5      7.0      0.5

We can add rows or columns to an array or data frame with rbind() and cbind(), but be careful about forced type conversions

rbind(a_df, list(v1 = -3,v2 = -5,logicals = TRUE))
##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
## 3 -3 -5     TRUE
rbind(a_df, c(3,4,6))
##   v1 v2 logicals
## 1 35 10        1
## 2  8  4        0
## 3  3  4        6

Much more on data frames a bit later in the course …

# Structures of structures

So far, every list element has been a single data value. List elements can be other data structures, e.g., vectors and matrices, even other lists:

my_list2 <- list(z_mat = z_mat, my_lucky_num = 13, my_list = my_list)
my_list2
## $z_mat ## [,1] [,2] ## [1,] 40 60 ## [2,] 1 3 ## ##$my_lucky_num
## [1] 13
##
## $my_list ##$my_list$family ## [1] "exponential" ## ##$my_list$mean ## [1] 7 ## ##$my_list$is.symmetric ## [1] FALSE ## ##$my_list\$last_updated
## [1] "2015-09-01"

# Summary

• We write programs by composing functions to manipulate data
• The basic data types let us represent Booleans, numbers, and characters
• Data structures let us group together related values
• Vectors let us group values of the same type
• Arrays add multi-dimensional structure to vectors
• Matrices act like you’d hope they would
• Lists let us combine different types of data
• Data frames are hybrids of matrices and lists, allowing each column to have a different data type