Chapter 5 Data visualization with ggplot2
5.1 Introduction to data visualization principles
Data visualization is a fundamental aspect of data analysis and interpretation. It involves the graphical representation of data to communicate insights, patterns, and trends effectively. In the realm of data science and analytics, creating informative and visually appealing visualizations is essential to convey complex information to both technical and non-technical audiences. This section documents some key principles of data visualization in an attempt to help you create compelling and meaningful visual representations of your data.
Know your audience: Understand who will be viewing your visualizations. Are they experts in the field, managers, or the general public? Tailor your visualizations to match their level of expertise and interests.
Choose the right visualization: Different types of data call for different types of visualizations. Bar charts, line graphs, scatter plots, histograms, and more each have their specific use cases. Select the one that effectively represents the relationships and patterns in your data.
Simplify complexity: Avoid cluttering your visualizations with unnecessary elements. Keep the design clean and focus on conveying the most important information. Use labels, legends, and colors judiciously to prevent overwhelming the audience.
Use appropriate scaling: Proper scaling of axes is crucial. Misleading visualizations can arise from incorrectly scaled axes that distort the perception of trends and relationships.
Color and contrast: Color can enhance visualizations, but it should be used carefully. Ensure sufficient contrast, especially for text and data points, and be mindful of colorblind accessibility. Here is a website I find helpful for creating color palettes: https://www.learnui.design/tools/data-color-picker.html
Annotations and titles: Clearly label your visualizations with informative titles, axis labels, and data source references. Annotations such as data points, trend lines, and explanatory notes can provide context and clarity.
Avoid misrepresentation: Be honest and accurate in representing data. Distorted visualizations can lead to misinterpretations and incorrect conclusions.
In the upcoming sections, we will explore various types of visualizations that can be created using R, along with practical examples and code snippets to help you get started on your data visualization journey. By following these principles, you’ll be able to create impactful visualizations that enable meaningful data-driven insights and effective communication.
5.2 The grammar of ggplot2
At the core of every ggplot2 visualization is the ggplot()
function. It initializes a plot object and serves as the canvas on which you build your visualization layers.
The function takes the data you want to visualize and defines the basic aesthetics using the aes()
function (short for aesthetic). The aes()
function is used to define how variables from your dataset are mapped to different visual properties of a plot. These properties include the x-axis, y-axis, color, fill, shape, size, and more. The aes()
function forms the foundation of the ggplot2 grammar, allowing you to create intricate visualizations by specifying how your data attributes relate to the plot’s aesthetics.
Here’s an explanation of some commonly used aesthetics.
x
andy
define the variables that determine the position of data points on the x-axis and y-axis, respectively.color
specifies the variable or value that determines the color of data points, lines, or borders. It’s often used to distinguish different groups or categories.fill
is similar to the color aesthetic but is specifically used for filling areas, such as in bar plots, histograms, and box plots.shape
determines the shape of individual data points. It’s useful when you want to differentiate data points within a group.size
specifies the size of data points or other elements based on a variable. It’s often used to highlight the importance or quantity of a certain aspect.
data %>%
ggplot(aes(x = variable1,
y = variable2,
color = varaible3,
fill = variable4,
shape = variable5,
size = variable6))
5.2.1 Geoms and layers
Geometric objects, or geoms, determine the visual representation of your data points. They correspond to the type of plot you want to create and, by default, will inherit the aesthetic properties defined in the aes()
function. You can add geoms using functions like geom_point()
, geom_line()
, geom_bar()
, etc.
Geoms, as well as other objects, can be added to a ggplot as a layer using the +
operator. Each +
adds a new layer to your plot.
5.3 Basic plots: Scatter plots, bar plots, and line plots
5.3.1 Scatter plots
Scatter plots are typically used to visualize the relationship between two continuous variables. Here is an example of a basic scatter plot.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point()
5.3.2 Bar plots
Bar plots are a good way to display the distribution of a variable. They can also be used to represent average scores of independent groups.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a bar plot with the frequency distribution of score1
data %>%
ggplot(aes(x = score1)) +
geom_bar()
# Create a bar plot with the average of score1 by condition1
data %>%
group_by(condition) %>% # Group by condition
summarize(score1_mean = mean(score1)) %>% # Summarize by computing the mean of score1 by condition
ggplot(aes(x = condition,
y = score1_mean)) +
geom_bar(stat = "identity")
5.3.3 Line plots
Line plots are typically used to visualize differences in paired or dependent data, such as how variables change across time.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1))
# Create a line plot with score1 over time
data %>%
ggplot(aes(x = time,
y = score1)) +
geom_line()
5.4 Customizing plots: Axes, colors, and themes
5.4.1 Axes
Axes are highly customizable in ggplot()
. There are many ways of doing so, but one way I’ve found to be most effective is via the scale_y_continuous()
and scale_x_continuous()
when x or y is continuous, and scale_y_discrete()
and scale_x_discrete()
when x or y is categorical.
Axis functions have their own grammar that is important to understand. Here is an explanation of some commonly used operators as they apply to scale_y_continuous()
and scale_x_continuous()
.
name
sets the title of the axis.limits
sets the max and minimum value of the axis.breaks
assigns data values to points on the axis.labels
assigns text or numbers to data values, as defined bybreaks
, to be presented on the axis.
Here is an explanation of the same operators as they apply to scale_y_discrete()
and scale_y_discrete()
.
name
sets the title of the axis.limits
defines the data values and their order to the axis.labels
assigns text or numbers to data values, as defined bylimits
, to be presented on the axis.
Here are some examples of how axes can be customized.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot with custom x- and y-axes
data %>%
ggplot(aes(x = score2,
y = score1)) +
geom_point() +
scale_y_continuous(name = "Score 1", # Set y-axis title to "Score 1"
limits = c(0, 6), # Set y-axis minimum value to 0 and maximum to 6
breaks = c(0, 2, 4, 6), # Put lines from 0 to 6 (must match labels)
labels = c(0, 2, 4, 6)) + # Put labels on every other integer from 0 to 6
scale_x_continuous(name = "Score 2", # Set y-axis title to "Score 2"
limits = c(0, 6), # Set y-axis minimum value to 0 and maximum to 6
breaks = c(0, 2, 4, 6), # Put lines from 0 to 6 (must match labels)
labels = c(0, 2, 4, 6)) # Put labels on every other integer from 0 to 6
# Create a bar plot with custom x- and y-axes
data %>%
group_by(condition) %>% # Group by condition
summarize(score1_mean = mean(score1)) %>% # Summarize by computing the mean of score1 by condition
ggplot(aes(x = condition,
y = score1_mean)) +
geom_bar(stat = "identity") +
scale_y_continuous(name = "Average of Score 1", # Set y-axis title to "Average of Score 1"
limits = c(0, 6), # Set y-axis minimum value to 0 and maximum to 6
breaks = 0:6, # Put lines from 0 to 6 (must match labels)
labels = 0:6) + # Put labels on every integer from 0 to 6
scale_x_discrete(name = "Condition", # Set x-axis title to "Condition"
limits = c("low", "medium", "high"), # Set x-axis order to low, medium, then high
labels = c("Low", "Medium", "High")) # Set x-axis condition labels to "Low", "Medium, and "High"
5.4.2 Colors
It is possible to change the color of your geoms using the color
and fill
operators. color
generally dictates the color of lines and geom borders whilst fill
dictates the inner space of geoms. However, this is not always the case and I find you will sometimes have to try both to get the desired appearance.
The simplest way to change the color of your geoms is to set them all to a single color. This is done by entering the color
and fill
operators into the relevant geom() function and by reference the color via natural language (e.g., “red”) or hexadecimal code (e.g., “#00FFFF”). Some examples are presented below.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot with red data points using the string "red"
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point(color = "red")
# Create a bar plot with blue bars using the hex code "#00FFFF"
data %>%
group_by(condition) %>% # Group by condition
summarize(score1_mean = mean(score1)) %>% # Summarize by computing the mean of score1 by condition
ggplot(aes(x = condition,
y = score1_mean)) +
geom_bar(stat = "identity",
fill = "#00FFFF")
It is also possible to vary the color of your geoms as a function of a variable by entering the color
and fill
operators in the aes()
function. color
generally dictates the color of lines and geom borders whilst fill
dictates the inner space of geoms. When this approach is taken, we can also add the scale_color_manual()
and scale_fill_manual()
to exert further control over the colors.
Adding scale_color_manual()
or scale_fill_manual()
will add a legend to your figure. Many familiar operators can be applied to these functions. Here is an explanation of them.
name
sets the title of the color or fill legend.limits
defines the data values and their order to be colored and filled.labels
assigns text or numbers to data values, as defined bylimits
, to be presented in the color or fill legend.values
defines colors and assigns them to data values, as defined bylimits
.
Here are some examples of how you can use the fill
and color
operators in the aes()
function as well as scale_color_manual()
and scale_fill_manual()
to change the appearance of your plots.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "high", "high", "low", "low", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot with different colors by condition
data %>%
ggplot(aes(x = score1,
y = score2,
color = condition)) +
geom_point() +
scale_color_manual(name = "Condition", # Set color legend title
limits = c("high", "low"), # Set color order to high then low
labels = c("High", "Low"), # Set color labels to "High" and "Low"
values = c("red", "blue")) # Set color colors using natural language
# Create a bar plot with different colors by condition
data %>%
group_by(condition) %>% # Group by condition
summarize(score1_mean = mean(score1)) %>% # Summarize by computing the mean of score1 by condition
ggplot(aes(x = condition,
y = score1_mean,
fill = condition)) +
geom_bar(stat = "identity") +
scale_fill_manual(name = "Condition", # Set fill legend title
limits = c("high", "low"), # Set fill order to high then low
labels = c("High", "Low"), # Set fill labels to "High" and "Low"
values = c("#003f5c", "#ffa600")) # Set fill colors using hex codes
5.4.3 Themes
Themes allow you to control various visual aspects of your plots, such as colors, fonts, grid lines, axis labels, and more. Many of the theme functions apply numerous changes to provide a consistent and coherent look across your plots, making it easier to maintain a cohesive visual style in your data visualizations. They can be applied by adding a theme function to you plot, such as theme_classic()
, theme_bw()
, and theme_minimal()
Here are some examples of cohesive themes and how to apply them.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot with classic theme
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point() +
theme_classic()
# Create a scatter plot with black and white theme
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point() +
theme_bw()
# Create a scatter plot with minimal theme
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point() +
theme_minimal()
The theme()
function allows more fine-grained control over visual elements, such as axis front and legend formatting.
# Load packages
library(tidyverse)
library(ggplot2)
# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
condition = c("high", "high", "high", "high", "high", "high", "low", "low", "low", "low", "low", "low"),
score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))
# Create a scatter plot with customized theme
data %>%
ggplot(aes(x = score1,
y = score2)) +
geom_point() +
theme(panel.background = element_blank(), # Remove grey background
axis.line = element_line(size = 1, color = "black"), # Add bold black axis line
axis.text = element_text(size = 12, color = "black")) # Set the axis text to size 12 and black
# Create a bar plot with different colors by condition with customized theme
data %>%
group_by(condition) %>% # Group by condition
summarize(score1_mean = mean(score1)) %>% # Summarize by computing the mean of score1 by condition
ggplot(aes(x = condition,
y = score1_mean,
fill = condition)) +
geom_bar(stat = "identity") +
theme(legend.position = "none", # Hide figure legend
axis.ticks = element_blank()) # Hide axis ticks
5.5 Plotting inferential statistics: Confidence intervals and predicted outcomes
As researchers we often want to extrapolate from the sample we have to some population. We rely on inferential statistics to do so and, as such, often want to represent these statistics visually including: lines-of-best-fit, points, and the precision with which they are estimated.
5.6 Advanced visualizations: Box plots, density distributions, and violin plots
Some scientists have expressed concern over the tenancy to display simplified and abstracted indices in figures, arguing that this can obscure important features of the underlying data such as skewness. A proposed solution is to include visual elements that more faithfully communicate relevant aspects of the raw data. Box plots, density distributions, and violin plots fit the bill as they can give a better sense of the underlying data than can averages and trend lines.