Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
Important note: this assignment is to be completed using ggplot
graphics. please do not use base R graphic commands.
This week’s agenda: getting familiar with basic plotting tools; understanding the way geoms work; how to map data frame columns to plot aesthetics; recalling basic text manipulations.
my_data
data.frame. Describe in words how my_y is created (i.e. explain what the ifelse
function is doing, etc). Add to the graphic a
: + geom_line
and add to graphic b
: + geom_path
and revisualize them. Describe any differences you see - and explain why (hint try typing ?geom_line
and ?geom_path
in the console - and print out my_x
)library(ggplot2)
library(gridExtra)
n = 200
set.seed(0)
my_x <- runif(n, min = -2, max = 2)
my_c <- sample(c(0, 1),size = n,replace = TRUE)
my_y <- ifelse(my_c == 0, my_x^3, my_x^2) + rnorm(n)
my_data <- data.frame(x = my_x, y = my_y, c = my_c)
a <- ggplot(data = my_data, aes(x = x, y = y)) + geom_point()
b <- ggplot(data = my_data, aes(x = x, y = y)) + geom_point()
grid.arrange(a,b)
1b. Examining data conditional to certain properties / attributes is very common in statistics. In scatter plots, a common way to show different classes is by using different colors. The color
argument can used to do such things. Plot x
and y
from my_data
in a scatter plot as above, but map the c
column to the color aesthetic of the plot. Does the legend make sense to you (comment on what it looks like)? Try using color = factor(c)
instead. Does the legend make more sense? Look up ?factor
- in your own words describe what a factor
is in R (data type, etc.).
1c. The xlim
and ylim
add ons can be used to change the limits on the x-axis and y-axis, respectively. Each argument takes a vector of length 2, as in + xlim(c(-1, 0))
, to set the x limit to be from -1 to 0. Plot y
versus x
(with class coloring), with the x limit set to be from -1 to 1, and the y limit set to be from -5 to 5. Assign x and y labels “Trimmed x” and “Trimmed y”, respectively. **After plotting the figure for yourself update “{r}" to "
{r warning = FALSE}”
1d. Again plot y
versus x
, only showing points whose x values are between -1 and 1. But this time, make a new data.frame (name it my_data_trimmed
) that contains only x
values between -1 and 1. Now recreate the figure in 1c. without using + xlim()
and + ylim()
: now you should see that the y limit is (automatically) set as “tight” as possible. Hint: use logical indexing to define your new data.frame.
1e. The shape
argument controls the point type in the display (and size
controls the size of the point displayed. In the lecture examples, in lecture we varied size and shape - do so below (with the trimmed plot - mapping the class to size
and shape
).
1f. Those sizes are very different (assuming you correctly used the factor(c)
notation). We can use the + scale_*_
functions to change the values for each aesthetic. Specifical examine the documentation for scale_size_manual()
and scale_shape_manual()
. Please use sizes 1, 1.25 and open circle and an open diamond for the shape (see https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/ for shape help (or type ?pch
into your console.)
ggplot2
:Ben thinks this question is very important for your future use of ggplot
. Also if you’re already taken 315 you’ve probably read all of these - just say “I took 315” instead of answering this question. (but they’re a good refresher if you want to do them still)
2a. Read this article on ggplot()
from Liz Sander.
ggplot()
discussed in this article?ggplot()
.plot()
) or ggplot()
graphs? Why?2b. Read this tutorial on ggplot()
.
R
, feel free to go through the other lessons on this page.2c. Read this article on ggplot()
.
ggplot2
grammar?2d. Read this article on data visualization in R
.
ggplot()
for data visualization in R
?my_data_cube
: x
column defined by sort(runif(100, -2, 2))
, a y_clean
column that is x^3
, and y
column as y_clean + rnorm(100)
my_data_square
: x
column defined by sort(runif(50, -2, 2))
(make if different values than the x
column in my_data_cube
), a y_clean
column that is x^2
, and y
column as y_clean + rnorm(50)
3b. Produce a ggplot that displays the cubic x
and y
points and a line defined by x
and y_clean
. Note that there are multiple ways to order the aesethic mapping - demonstrate 2 options include the most “efficient one” (the one that uses the least code).
3c. Update the previous graphic with similar visualizations relative to the my_data_square
. That is, use the same code you have in 3b. but also add in geom_point
and geom_line
relative to my_data_square
. This time, make the elements related to my_data_square
be blue. Note that you can define data
inside geom
s. Relative to just the points, this figure and the figure in 1a. look pretty similar - comment on the differences.
3d. Because we know how y
relates to x
(and that the noise component has a standard deviation of 1), add to the my_data_cube
columns y_upper
and y_lower
that contain y_clean
\(\pm\) 2. Plot x
and y
on a scatter plot. Then examine the documentation for geom_ribbon
and add a band around the y_clean
between y_upper
and y_lower
(you’ll still have x = x
). (Ask Ben if this is vague…). Please make the ribbon transparent - say .5 (hint alpha
). Statistically, give the generative process for y
, how many of the 100 points would you expect to lie outside this ribbon?
3e. geom_ribbon
isn’t the last geom
you’ll have to use in your life. From your experience above, describe how you would go about learning how to use a new geom
. Fill us in on how you’d go about checking your options for different parameters (like you did with shape
and size
a above).
4a. Define attack_time
to be the Date
column of shark_data
. Define a character vector attack_year
to contain the first 4 characters of an entry of attack_time
. Hint: hopefully you haven’t forgotten … use substr()
here. Finally, convert attack_year
into a numeric vector. Do a similar thing for attack_month
. Then append these onto the shark_data matrix (as year
and month
). We will use these added columns later.
4b. Characterize the distribution of the Age
of shark attack victims with a histogram. Do you notice anything weird - do you have a guess of what is happening? Use a different xlim
to make the histogram look more natural. Make a second graphic with geom_histogram(aes(..., y = ..density..))
- what does ..density..
do?
4c. Add onto your scaled histogram an kernel density estimate (using geom_density
). Make a second figure that changes the order of geom_density
and geom_histogram
. What do you see? Discuss how this relates to layering.
4d. Examine if year has an effect on the Age of the shark attack victims with a scatter plot. (Continue to correct for age oddities.). Add a geom_smooth(method=lm)
to this plot. What does this addition do?
4e. What if I believe that we might not see this trend if we condition on gender. Add color globally to the plot above. What do you see?
4f. Now change the aes
making of color
to be within geom_point
. Describe what happened relative to the ggplot
paradigm.