Last time: `ggplot`

Summary:

ggplot is plotting package for the future of R computing
ggplot acts like a grammar - we are able to extend the layers iterative
ggplot() defines global information for all following elements (relative to data and aes mappings)
aes allows you to define how to map columns of a data.frame to aesthetics of the graphic
geom_... define geometric attributes of the graphic
facets allow one to divide the graphic up conditional on a factor variable

Part I

Visualization Theory

This part lecture is a combination of a blog a wrote a long time ago - that has lots of spelling errors and a compression of common theory that is sometime taught in 315.

Graphics

Why Graphics

non-trivial patterns
complex data

When graphics (Maybe better to ask “when not graphics”)

summaries and tables can be useful for “datset of 20 numbers or less”

~Tufte, The Visual Display of Information, pg 20 and pg 178 (see extra reading for Hw 2)

Overview

Rules to make Good Graphics:

Avoid misleading visuals
Avoid clutter graphics (make sure your message comes through)
Leverage graphics for complex understanding

Human Preception

General dos & don’t for graphics

1. Represent the data as truthfully as possible

1. Truthful Representation

preserve the size of effect / differences in the data
Tufte suggests optimizing the Lie Factor: \[\mathrm{Lie\ Factor} = \frac{\mathrm{size\ of\ the\ effect\ in\ the\ graph}}{\mathrm{size\ of\ the\ effect\ in\ the\ data}}\]
don’t manipulate axes to trick the reader (scaling / truncation is ok if it’s pointed out)
have standard units
- $ in fixed units across time
- rates instead of totals (e.g. GDP vs GDP per cap)

1. Don’t abuse commonly held assumptions

Above example: double scaling

Other examples: Rose Diagrams, etc

“The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.” ~ Tufte, pg 77

1. DO avoid being misleading with labels

e.g.

tell them your range of values for axes
emphasis important (assumption related) values
make legends and labels clear to indicate information that might condition a conclusion

2. Avoid clutter graphics

make sure your message comes through

2. Decultering (Data ink)

2. “Decorating” and Data-Ink

Graphics should not draw the viewer’s attention away from the data. Extras get in the way.

Note: Decoration does not refer to appropriate graph labeling. Labels should always be clear, detailed, and thorough. Label key parts of the data. Add text explanations if necessary.

Data Ink should primarily present information about the data: the non-erasable, non-redundant core of a graphic

Tufte suggests (within reason) maximizing the data-ink ratio: \[\mathrm{Data\ ink \ ratio} = \frac{\mathrm{ink\ used\ to\ describe\ data}}{\mathrm{total\ ink\ in\ the\ graph}}\]

2. Impact / Truth in today’s world

“… at least a few computer graphics only evoke the response ‘Isn’t it remarkable that the computer can be programmed to draw like that?’ instead of ‘My, what interesting data.’ (Tufte, pg 120)

Today’s graphics (including ggplot) can auto generate:

excessive number of axis marking (even where no data lies)
interest in create pretty graphics when tables are better

3. Showcasing Complex ideas

“More information is better than less information, especially when the marginal costs of handling and interpreting additional information are low, as they are for most graphics.” (Tufte, pg 168)

3. Data Density

Maximize (within reason): \[\text{data density of a graphic } = \frac{\text{number of entries in data matrix}}{\text{area of data graphic}}\]

When people have low “data density”, the reader is often left with questions about “what has been left out?” or “what if this trend was caused by other variables?”.
Visuals with a high “data density” must be designed with care, as elements like color and shapes have the potential to overwelm readers if not used correctly.

3. Faceting / Conditional Plots

Are a great example of increasing complexity of visualizations

Graphic Integrity (Summary)

represent the data values as close to the truth as possible (be aware of viewer assumptions)
- show data variation, not design variation
- make sure data is in standard units (e.g. real $)
- Avoid having more dimensions of the graphic than dimensions of data you have
leverage labeling to avoid distortion and confusion
use graphics to showcase complex ideas (in stat graphics & and this day-in-age in general assume your reader can handle it)

Part II

Human Visual Perception

Cleveland: Human visual perception

Cleveland and McGill (1984)

Cleveland: quantitative comparisons experiment

To really understand Cleveland’s work - let’s go through an in class experiment. I’m going to show you 4 sets of 4 images, and I want you to write down the relative quantities from the first object to the later objects.

	A	B	C	D
Positions	1	?	?	?
Lengths	1	?	?	?
Angles	1	?	?	?
Areas	1	?	?	?

Quantitative perceptual tasks: position, aligned

Quantitative perceptual tasks: length

Quantitative perceptual tasks: angle

Quantitative perceptual tasks: area

Quantitative perceptual tasks: answers

	A	B	C	D
Positions	1	3/4	1/4	2/4
Lengths	1	2/4	3/4	1/4
Angles	1	2/3	1/3	4/3
Areas	1	2/4	1/4	3/4

Cleveland and McGill (1984)

Cleveland, The Elements of Graphing Data

Ordering of perceptual tasks

Cleveland and McGill’s ordering

Ordering of perceptual tasks

“generic comparisons” = less accuracy

Quantitative perceptual tasks & ranking

Lessons:

Best to show quantitative variables with position or length
Bars encode length, so start bars at 0; to zoom in, use dotplots (position) instead
If possible, avoid stacked bars (not aligned); use dots or lines (aligned baselines) instead
Avoid pies, area, and volume entirely

Graphics Dos and Don’ts: More on Human Perception

Ranking: alphabetical

Ranking: informative

Consistency

Comparing weights of newborns: Which age group weighs the least?

Consistency

Give all small multiples the same structure, usually including axis limits, to make comparisons easier and reduce cognitive load

Consistency

Ensure design changes are meaningful (tied to data changes)

Consistency

More consistent redesign, Stephen Few

Semantic associations

Orange vs blue crab species: real graphic from a talk
(crabs dataset in MASS package)

Ordering, consistency and semantic associations

Lessons:

Order your dots/bars meaningfully: ranked by a variable, not alphabetical
Use consistent mappings (colors and shapes, axis limits)
across graphs
Avoid meaningless variety in design
Use meaningful mappings:
orange vs blue crab species = orange and blue symbols
Use conventional mappings: blue = cold, red = hot
“More = more”:
deeper saturation or larger size = higher value of variable
Choose and order hues sensibly; use Color Brewer

Part III

Coding Style

What and Why is Good Coding

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. …The most important thing about a style guide is that it provides consistency, making code easier to write because you need to make fewer decisions”

Hadley Wickham

“The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify. The rules below were designed in collaboration with the entire R user community at Google.”

Google Style Guide

“Because there is no comprehensive official R style manual, students and package writers seem to think that there is no style whatsoever to be followed. While it may be true that “ugly code runs,” it is also 1) difficult to read and 2) frustrating to extend, and 3) tiring to debug. Code is a language, a medium of communication, and one should not feel free no ignore its customs.”

Paul E. Johnson, KU

When to have clean code

“When writing a program using iterative-enhancement, it is an excellent idea to beautify your code at the end of each enhancement, before proceding to the next one; each enhancement should be the best that it can be, before continuing. Ultimately, this strategy will save you time compared to the strategy often used by students: ignore style until the program is completely written. This is a penny-wise, pound-foolish strategy. It is much harder to”finish" a poorly-styled program, because it is harder to read and understand it; (software) engineers must learn to practice techniques that overcome human nature; this is one example." Richard E. Pattis, CMU
- aka continuously

R Style Guides

Hadley Wichham

Google

Lesser known:

Tidyverse

Quick Walk through of Hadley Wickham Style Guide

Notation and naming

“There are only two hard things in Computer Science: cache invalidation and naming things.”

— Phil Karlton

Variable and function names should be lowercase. Use an underscore (_) to separate words within a name. Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).

Naming Examples:

Good

day_one
day_1

Bad

first_day_of_the_month
DayOne
dayone
djm1

Where possible, avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code.

Bad

T <- FALSE
c <- 10
mean <- function(x) sum(x)

Assignment

Use <-, not =, for assignment.

Good

x <- 5

Bad

x = 5

Syntax: Spacing

Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using = in function calls. Always put a space after a comma, and never before (just like in regular English).

Good

average <- mean(feet / 12 + inches, na.rm = TRUE)
x <- 1:10 # no spaces with ":"

if (debug) do(x)
plot(x, y)

Bad

average<-mean(feet/12+inches,na.rm=TRUE)
x <- 1 : 10
if(debug)do(x)
plot (x, y)

Extra spacing (i.e., more than one space in a row) is ok if it improves alignment of equal signs or assignments (<-).

list(
  total = a + b + c, 
  mean  = (a + b + c) / n
)

Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).

Good

if (debug) do(x)
diamonds[5, ]

Bad

if ( debug ) do(x)  # No spaces around debug
x[1,]   # Needs a space after the comma
x[1 ,]  # Space goes after comma not before

Curly braces

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else.

Always indent the code inside curly braces.

Good

if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

Bad

if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to leave very short statements on the same line:

if (y < 0 && debug) message("Y is negative")

Line length

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.

Indentation

When indenting your code, use two spaces. Never use tabs or mix tabs and spaces.

The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:

long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

Organisation

Commenting guidelines - Comment your code. Each line of a comment should begin with the comment symbol and a single space: #. Comments should explain the why, not the what.

Use commented lines of - and = to break up your file into easily readable chunks.

# Load data ---------------------------

# Plot data ---------------------------

Visual Theory, Human Perception and Coding Style

Last time: ggplot

Part I

Graphics

Overview

1. Represent the data as truthfully as possible

1. Truthful Representation

1. Don’t abuse commonly held assumptions

1. DO avoid being misleading with labels

2. Avoid clutter graphics

2. Decultering (Data ink)

2. “Decorating” and Data-Ink

2. Impact / Truth in today’s world

3. Showcasing Complex ideas

3. Data Density

3. Faceting / Conditional Plots

Graphic Integrity (Summary)

Part II

Cleveland: Human visual perception

Cleveland: quantitative comparisons experiment

Quantitative perceptual tasks: position, aligned

Quantitative perceptual tasks: length

Quantitative perceptual tasks: angle

Quantitative perceptual tasks: area

Quantitative perceptual tasks: answers

Ordering of perceptual tasks

Ordering of perceptual tasks

Quantitative perceptual tasks & ranking

Graphics Dos and Don’ts: More on Human Perception

Ranking: alphabetical

Ranking: informative

Consistency

Consistency

Consistency

Consistency

Semantic associations

Ordering, consistency and semantic associations

Part III

What and Why is Good Coding

When to have clean code

R Style Guides

Quick Walk through of Hadley Wickham Style Guide

Notation and naming

Naming Examples:

Assignment

Syntax: Spacing

Bad

Curly braces

Bad

Line length

Indentation

Organisation

Last time: `ggplot`