Statistical Computing, 36-350
Friday - July 12, 2019
ggplot
Summary:
ggplot
is plotting package for the future of R
computingggplot
acts like a grammar - we are able to extend the layers iterativeggplot()
defines global information for all following elements (relative to data and aes
mappings)aes
allows you to define how to map columns of a data.frame to aesthetics of the graphicgeom_...
define geometric attributes of the graphicfacets
allow one to divide the graphic up conditional on a factor variableVisualization Theory
This part lecture is a combination of a blog a wrote a long time ago - that has lots of spelling errors and a compression of common theory that is sometime taught in 315.
Why Graphics
When graphics (Maybe better to ask “when not graphics”)
~Tufte, The Visual Display of Information, pg 20 and pg 178 (see extra reading for Hw 2)
Rules to make Good Graphics:
Human Preception
General dos & don’t for graphics
preserve the size of effect / differences in the data
Tufte suggests optimizing the Lie Factor: \[\mathrm{Lie\ Factor} = \frac{\mathrm{size\ of\ the\ effect\ in\ the\ graph}}{\mathrm{size\ of\ the\ effect\ in\ the\ data}}\]Above example: double scaling
Other examples: Rose Diagrams, etc
“The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.” ~ Tufte, pg 77
e.g.
make sure your message comes through
Graphics should not draw the viewer’s attention away from the data. Extras get in the way.
Note: Decoration does not refer to appropriate graph labeling. Labels should always be clear, detailed, and thorough. Label key parts of the data. Add text explanations if necessary.
Data Ink should primarily present information about the data: the non-erasable, non-redundant core of a graphic
Tufte suggests (within reason) maximizing the data-ink ratio: \[\mathrm{Data\ ink \ ratio} = \frac{\mathrm{ink\ used\ to\ describe\ data}}{\mathrm{total\ ink\ in\ the\ graph}}\]
“… at least a few computer graphics only evoke the response ‘Isn’t it remarkable that the computer can be programmed to draw like that?’ instead of ‘My, what interesting data.’ (Tufte, pg 120)
Today’s graphics (including ggplot
) can auto generate:
“More information is better than less information, especially when the marginal costs of handling and interpreting additional information are low, as they are for most graphics.” (Tufte, pg 168)
Maximize (within reason): \[\text{data density of a graphic } = \frac{\text{number of entries in data matrix}}{\text{area of data graphic}}\]
When people have low “data density”, the reader is often left with questions about “what has been left out?” or “what if this trend was caused by other variables?”.
Visuals with a high “data density” must be designed with care, as elements like color and shapes have the potential to overwelm readers if not used correctly.
Are a great example of increasing complexity of visualizations
Human Visual Perception
To really understand Cleveland’s work - let’s go through an in class experiment. I’m going to show you 4 sets of 4 images, and I want you to write down the relative quantities from the first object to the later objects.
A | B | C | D | |
---|---|---|---|---|
Positions | 1 | ? | ? | ? |
Lengths | 1 | ? | ? | ? |
Angles | 1 | ? | ? | ? |
Areas | 1 | ? | ? | ? |
A | B | C | D | |
---|---|---|---|---|
Positions | 1 | 3/4 | 1/4 | 2/4 |
Lengths | 1 | 2/4 | 3/4 | 1/4 |
Angles | 1 | 2/3 | 1/3 | 4/3 |
Areas | 1 | 2/4 | 1/4 | 3/4 |
“generic comparisons” = less accuracy
Lessons:
Comparing weights of newborns: Which age group weighs the least?
Give all small multiples the same structure, usually including axis limits, to make comparisons easier and reduce cognitive load
Ensure design changes are meaningful (tied to data changes)
More consistent redesign, Stephen Few
Orange vs blue crab species: real graphic from a talk
(crabs
dataset in MASS
package)
Lessons:
Coding Style
“There are only two hard things in Computer Science: cache invalidation and naming things.”
Variable and function names should be lowercase. Use an underscore (_) to separate words within a name. Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).
Good
day_one
day_1
Bad
first_day_of_the_month
DayOne
dayone
djm1
Where possible, avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code.
Bad
T <- FALSE
c <- 10
mean <- function(x) sum(x)
Use <-, not =, for assignment.
Good
x <- 5
Bad
x = 5
Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using = in function calls. Always put a space after a comma, and never before (just like in regular English).
Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
x <- 1:10 # no spaces with ":"
if (debug) do(x)
plot(x, y)
average<-mean(feet/12+inches,na.rm=TRUE)
x <- 1 : 10
if(debug)do(x)
plot (x, y)
Extra spacing (i.e., more than one space in a row) is ok if it improves alignment of equal signs or assignments (<-).
list(
total = a + b + c,
mean = (a + b + c) / n
)
Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).
Good
if (debug) do(x)
diamonds[5, ]
Bad
if ( debug ) do(x) # No spaces around debug
x[1,] # Needs a space after the comma
x[1 ,] # Space goes after comma not before
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else.
Always indent the code inside curly braces.
Good
if (y < 0 && debug) {
message("Y is negative")
}
if (y == 0) {
log(x)
} else {
y ^ x
}
if (y < 0 && debug)
message("Y is negative")
if (y == 0) {
log(x)
}
else {
y ^ x
}
It’s ok to leave very short statements on the same line:
if (y < 0 && debug) message("Y is negative")
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.
When indenting your code, use two spaces. Never use tabs or mix tabs and spaces.
The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:
long_function_name <- function(a = "a long argument",
b = "another argument",
c = "another long argument") {
# As usual code is indented by two spaces.
}
Commenting guidelines - Comment your code. Each line of a comment should begin with the comment symbol and a single space: #. Comments should explain the why, not the what.
-
and =
to break up your file into easily readable chunks.# Load data ---------------------------
# Plot data ---------------------------