Statistical Computing, 36-350

Monday - July 1, 2019

**Independence**: otherwise, you rely on someone else giving you exactly the right tool**Honesty**: otherwise, you end up distorting your problem to match the tools you have**Clarity**: often, turning your ideas into something a machine can do refines your thinking**Fun**: these were the best of times (the worst of times)

- Instructor: Benjamin (Ben) LeRoy
- TA: Tudor Manole
- Not much programming knowledge assumed
- Some statistics knowledge assumed
- Focus is almost entirely on R
- Class will be cumulative, so keep up with the material and assignments!

- Monday/Friday: lecture for ~50 minutes, then lab for ~30 minutes
- Tuesday/Thursday class: lab for 80 minutes
- Wednesday class: Homework review (first 10 minutes). Lecture for ~40 minutes, then lab for ~30 minutes
- Labs due at 10pm each Tuesday and Thursday, submitted through Canvas
- Graded 40% on attendance, 60% on completion
- Homework due at 10pm on each Monday, also through Canvas
- No midterm, final take-home programming assignment

- Very hands-on class
- Most of class time is in lab, where you learn how to code by coding and asking questions

- Grading breakdown:
- Labs: 30%
- Homework: 50%
- Final project: 20%

- Canvas: course information/website, used to collect submissions, and keep track of grades
- We will make all announcements via Canvas, so make sure you enable email notifications

- Piazza group: for discussions

- Email the TA for any questions about the course (mainly, asking for extension, excusing absence in class, asking for regrade). Put “[36-350]” in the subject of the email
- You should not need to email the instructor for anything (unless the instructor asks you to)
- Questions about assignment content should be on Piazza
- Read syllabus for Piazza guidelines. It is
**not**a question-answer hotline. Piazza is meant to be a collaborative forum for students to help each other out, guided by the instructor/TA.

- Each student is given 5 late days for the semester, for both homeworks and labs
- You can only use a maximum of 3 late days per assignment (to avoid problems with releasing the solution too late)
- Late days are automatically used (you do not need to email the TA beforehand)
- Each late day is counted as up to 24 hours after the assignment is due
- Assignments submitted after all 5 late days are used receive a 0
- Email the TA for extreme circumstances (read the syllabus)

- R is a programming language for statistical computing
- R Studio is an integrated development environment for R programming

- R Markdown is a markup language for combining R code with text

All 3 are free, and all 3 will be used extensively in this course

- You will be needing a laptop for each class for lab
- If you do not have a laptop, please let the instructor/TA know. We will be able to make arrangements to lend you a laptop for the session
- For courtesy of other students and understanding,
**you may not be using your laptop while the instructor is lecturing**

- Don’t do it, refer to syllabus if you’re unclear about anything

(and if you’re still unclear, come see me) - If you are struggling with the material, let me or Tudor know immediately. (The summer session goes by fast!)

It’s on the course website, please read it (actually read it)

- Complete and submit lab by Wednesday 10pm
*(note different time)* - Questions?

*Data types, operators, variables*

Two basic types of things/objects: **data** and **functions**

**Data**: things like 7, “seven”, \(7.000\), and \(\left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right]\)**Functions**: things like`log`

,`+`

(takes two arguments),`<`

(two),`%%`

(two), and`mean`

(one)

A function is a machine which turns input objects, or

arguments, into an output object, or areturn value(possibly with side effects), according to a definite rule

- Programming is writing functions to transform inputs into outputs
- Good programming ensures the transformation is done easily and correctly
- Machines are made out of machines; functions are made out of functions, like \(f(a,b) = a^2 + b^2\)

The trick to good programming is to take a big transformation and

break it downinto smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)

At base level, all data can represented in binary format, by **bits** (i.e., TRUE/FALSE, YES/NO, 1/0). Basic data types:

**Booleans**: Direct binary values:`TRUE`

or`FALSE`

in R**Integers**: whole numbers (positive, negative or zero), represented by a fixed-length block of bits**Characters**: fixed-length blocks of bits, with special coding;**strings**: sequences of characters**Floating point numbers**: A significand (real number) that contains the number’s digits. Negative significands represent negative numbers.*And*an exponent that says where the decimal (or binary) point is placed relative to the beginning of the significand. E.g. as in \(3 \times 10^6\) or \(1.01 \times 10^{-1}\) or \(1 * 2^6\)**Missing or ill-defined values**:`NA`

,`NaN`

, etc.

**Unary**: take just one argument. E.g.,`-`

for arithmetic negation,`!`

for Boolean negation**Binary**: take two arguments. E.g.,`+`

,`-`

,`*`

, and`/`

(though this is only a partial operator). Also,`%%`

(for mod), and`^`

(again partial)

`## [1] -7`

`## [1] 12`

`## [1] 2`

- Basic interaction with R is by typing in the
**console**, i.e.,**terminal**, or**command line** - You type in commands, R gives back answers (or errors)
- Menus and other graphical interfaces are extras built on top of the console

These are also binary operators; they take two objects, and give back a Boolean

`## [1] TRUE`

`## [1] FALSE`

`## [1] TRUE`

`## [1] FALSE`

`## [1] FALSE`

`## [1] TRUE`

Warning: `==`

is a comparison operator, `=`

is not!

These basic ones are `&`

(and) and `|`

(or)

`## [1] FALSE`

`## [1] TRUE`

`## [1] FALSE`

`## [1] TRUE`

Note: The double forms `&&`

and `||`

are different! We’ll see them later

- The
`typeof()`

function returns the data type `is.foo()`

functions return Booleans for whether the argument is of type*foo*`as.foo()`

(tries to) “cast” its argument to type*foo*, to translate it sensibly into such a value

`## [1] "double"`

`## [1] TRUE`

`## [1] FALSE`

`## [1] FALSE`

`## [1] TRUE`

`## [1] FALSE`

`## [1] TRUE`

`## [1] TRUE`

`## [1] FALSE`

`## [1] "0.833333333333333"`

`## [1] 0.8333333`

`## [1] 5`

`## [1] FALSE`

We can give names to data objects; these give us **variables**. Some variables are built-in:

`## [1] 3.141593`

Variables can be arguments to functions or operators, just like constants:

`## [1] 31.41593`

`## [1] -1`

We create variables with the **assignment operator**, `<-`

or `=`

`## [1] 3.142857`

`## [1] 31.42857`

The assignment operator also changes values:

`## [1] 31.42857`

`## [1] 30`

- The code you write will be made of variables, with descriptive names
- Easier to design, easier to debug, easier to improve, and easier for others to read
- Avoid “magic constants”; instead use named variables
- Named variables are a first step towards
**abstraction**

What variables have you defined?

`## [1] "approx_pi" "circumference" "diameter"`

Getting rid of variables:

`## [1] "approx_pi" "diameter"`

`## character(0)`

*Data structures*

- A
**data structure**is a grouping of related data values into an object - A
**vector**is a sequence of values, all of the same type

`## [1] 7 8 10 45`

`## [1] TRUE`

- The
`c()`

function returns a vector containing all its arguments in specified order `1:5`

is shorthand for`c(1,2,3,4,5)`

, and so on`x[1]`

would be the first element,`x[4]`

the fourth element, and`x[-4]`

is a vector containing*all but*the fourth element

`vector(length = n)`

returns an empty vector of length *n*; helpful for filling things up later

`## [1] FALSE FALSE FALSE FALSE FALSE`

`## [1] 0 0 0 0 8`

Arithmetic operator apply to vectors in a “componentwise” fashion

`## [1] 0 0 0 0`

`## [1] -49 -64 -100 -2025`

**Recycling** repeat elements in shorter vector when combined with a longer one

`## [1] 0 0 3 37`

`## [1] 7.000000 1.000000 0.100000 6.708204`

Single numbers are vectors of length 1 for purposes of recycling:

`## [1] 14 16 20 90`

Can do componentwise comparisons with vectors:

`## [1] FALSE FALSE TRUE TRUE`

Logical operators also work elementwise:

`## [1] FALSE FALSE TRUE FALSE`

To compare whole vectors, best to use `identical()`

or `all.equal()`

:

`## [1] TRUE TRUE TRUE TRUE`

`## [1] TRUE`

`## [1] FALSE`

`## [1] TRUE`

Note: these functions are slightly different; we’ll see more later

Many functions can take vectors as arguments:

`mean()`

,`median()`

,`sd()`

,`var()`

,`max()`

,`min()`

,`length()`

, and`sum()`

return single numbers`sort()`

returns a new vector`hist()`

takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plot`ecdf()`

similarly produces a cumulative-density-function object`summary()`

gives a five-number summary of numerical vectors`any()`

and`all()`

are useful on Boolean vectors

Vector of indices:

`## [1] 8 45`

Vector of negative indices:

`## [1] 8 45`

Boolean vector:

`## [1] 10 45`

`## [1] -10 -45`

`which()`

gives the elements of a Boolean vector that are `TRUE`

:

`## [1] 3 4`

`## [1] -10 -45`

We can give names to elements/components of vectors, and index vectors accordingly

`## [1] "v1" "v2" "v3" "fred"`

```
## fred v1
## 45 7
```

Note: here R is printing the labels, these are not additional components of `x`

`names()`

returns another vector (of characters):

`## [1] "fred" "v1" "v2" "v3"`

`## [1] 4`

A **matrix** is a specialization of a 2d array

```
## [,1] [,2]
## [1,] 40 60
## [2,] 1 3
```

`## [1] 2 2`

`## [1] TRUE`

`## [1] TRUE`

`dim`

says how many rows and columns; filled by columns by default- Could also specify
`ncol`

for the number of columns - To fill by rows, use
`byrow = TRUE`

- Elementwise operations with the usual arithmetic and comparison operators (e.g.,
`z_mat/3`

)

Can access a matrices either by pairs of indices or by the underlying vector (column-major order):

`## [1] 60`

`## [1] 60`

Omitting an index means “all of it”:

`## [1] 60 3`

`## [1] 60 3`

```
## [,1]
## [1,] 60
## [2,] 3
```

Many functions applied to an array will just boil things down to the underlying vector:

`## [1] 1 3`

This happens unless the function is set up to handle arrays specifically

And there are several functions/operators that *do* preserve array structure:

```
## [,1] [,2]
## [1,] 41 63
## [2,] 3 7
```

Others specifically act on each row or column of the array separately:

`## [1] 100 4`

`## [1] 41 63`

Has its own special operator, written `%*%`

:

```
## [,1] [,2] [,3]
## [1,] 7 7 7
## [2,] 7 7 7
```

```
## [,1] [,2] [,3]
## [1,] 700 700 700
## [2,] 28 28 28
```

Numeric vectors can act like column or row vectors, as needed:

```
## [,1]
## [1,] 1600
## [2,] 70
```

```
## [,1] [,2]
## [1,] 420 660
```

Transpose:

```
## [,1] [,2]
## [1,] 40 1
## [2,] 60 3
```

Determinant:

`## [1] 60`

- We can name either rows or columns or both, with
`rownames()`

and`colnames()`

- These are just character vectors, and we use them just like we do
`names()`

for vectors - Names help us understand what we’re working with

A **list** is sequence of values, but not necessarily all of the same type

```
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE
```

Most of what you can do with vectors you can also do with lists

- Can use
`[ ]`

as with vectors

- Or use
`[[ ]]`

, but only with a single index`[[ ]]`

drops names and structures,`[ ]`

does not

```
## [[1]]
## [1] 7
```

`## [1] 7`

`## [1] 49`

Add to lists with `c()`

(also works with vectors):

```
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [1] 9
```

Chop off the end of a list by setting the length to something smaller (also works with vectors):

`## [1] 4`

```
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] FALSE
```

Pluck out all but one piece of a list (also works with vectors):

```
## [[1]]
## [1] "exponential"
##
## [[2]]
## [1] FALSE
```

We can name some or all of the elements of a list:

```
## $family
## [1] "exponential"
##
## $mean
## [1] 7
##
## $is.symmetric
## [1] FALSE
```

`## [1] "exponential"`

```
## $family
## [1] "exponential"
```

Lists have a special shortcut way of using names, with `$`

:

`## [1] "exponential"`

`## [1] "exponential"`

Creating a list with names:

Adding named elements:

Removing a named list element, by assigning it the value `NULL`

:

- Lists give us a natural way to store and look up data by
*name*, rather than by*position* - A really useful programming concept with many names:
**key-value pairs**, i.e.,**dictionaries**, or**associative arrays** - If all our distributions have components named
`family`

, we can look that up by name, without caring where it is (in what position it lies) in the list

- The classic data table, \(n\) rows for cases, \(p\) columns for variables
- Lots of the really-statistical parts of R presume data frames

- Not just a matrix because
*columns can have different types* - Many matrix functions also work for data frames (e.g.,
`rowSums()`

,`summary()`

,`apply()`

)

```
## v1 v2
## [1,] 35 10
## [2,] 8 4
```

`## [1] 35 8`

```
## v1 v2 logicals
## 1 35 10 TRUE
## 2 8 4 FALSE
```

`## [1] 35 8`

`## [1] 35 8`

```
## v1 v2 logicals
## 1 35 10 TRUE
```

```
## v1 v2 logicals
## 21.5 7.0 0.5
```

We can add rows or columns to an array or data frame with `rbind()`

and `cbind()`

, but be careful about forced type conversions

```
## v1 v2 logicals
## 1 35 10 TRUE
## 2 8 4 FALSE
## 3 -3 -5 TRUE
```

```
## v1 v2 logicals
## 1 35 10 1
## 2 8 4 0
## 3 3 4 6
```

Much more on data frames a bit later in the course …

So far, every list element has been a single data value. List elements can be other data structures, e.g., vectors and matrices, even other lists:

```
## $z_mat
## [,1] [,2]
## [1,] 40 60
## [2,] 1 3
##
## $my_lucky_num
## [1] 13
##
## $my_list
## $my_list$family
## [1] "exponential"
##
## $my_list$mean
## [1] 7
##
## $my_list$is.symmetric
## [1] FALSE
##
## $my_list$last_updated
## [1] "2015-09-01"
```

- We write programs by composing functions to manipulate data
- The basic data types let us represent Booleans, numbers, and characters
- Data structures let us group together related values
- Vectors let us group values of the same type
- Arrays add multi-dimensional structure to vectors
- Matrices act like you’d hope they would
- Lists let us combine different types of data
- Data frames are hybrids of matrices and lists, allowing each column to have a different data type