Chapter 5 Data visualization with ggplot2

5.1 Introduction to data visualization principles

Data visualization is a fundamental aspect of data analysis and interpretation. It involves the graphical representation of data to communicate insights, patterns, and trends effectively. In the realm of data science and analytics, creating informative and visually appealing visualizations is essential to convey complex information to both technical and non-technical audiences. This section documents some key principles of data visualization in an attempt to help you create compelling and meaningful visual representations of your data.

Know your audience: Understand who will be viewing your visualizations. Are they experts in the field, managers, or the general public? Tailor your visualizations to match their level of expertise and interests.

Choose the right visualization: Different types of data call for different types of visualizations. Bar charts, line graphs, scatter plots, histograms, and more each have their specific use cases. Select the one that effectively represents the relationships and patterns in your data.

Simplify complexity: Avoid cluttering your visualizations with unnecessary elements. Keep the design clean and focus on conveying the most important information. Use labels, legends, and colors judiciously to prevent overwhelming the audience.

Use appropriate scaling: Proper scaling of axes is crucial. Misleading visualizations can arise from incorrectly scaled axes that distort the perception of trends and relationships.

Color and contrast: Color can enhance visualizations, but it should be used carefully. Ensure sufficient contrast, especially for text and data points, and be mindful of colorblind accessibility. Here is a website I find helpful for creating color palettes: https://www.learnui.design/tools/data-color-picker.html

Annotations and titles: Clearly label your visualizations with informative titles, axis labels, and data source references. Annotations such as data points, trend lines, and explanatory notes can provide context and clarity.

Avoid misrepresentation: Be honest and accurate in representing data. Distorted visualizations can lead to misinterpretations and incorrect conclusions.

In the upcoming sections, we will explore various types of visualizations that can be created using R, along with practical examples and code snippets to help you get started on your data visualization journey. By following these principles, you’ll be able to create impactful visualizations that enable meaningful data-driven insights and effective communication.

5.2 The grammar of ggplot2

At the core of every ggplot2 visualization is the ggplot() function. It initializes a plot object and serves as the canvas on which you build your visualization layers.

The function takes the data you want to visualize and defines the basic aesthetics using the aes() function (short for aesthetic). The aes() function is used to define how variables from your dataset are mapped to different visual properties of a plot. These properties include the x-axis, y-axis, color, fill, shape, size, and more. The aes() function forms the foundation of the ggplot2 grammar, allowing you to create intricate visualizations by specifying how your data attributes relate to the plot’s aesthetics.

Here’s an explanation of some commonly used aesthetics.

  • x and y define the variables that determine the position of data points on the x-axis and y-axis, respectively.

  • color specifies the variable or value that determines the color of data points, lines, or borders. It’s often used to distinguish different groups or categories.

  • fill is similar to the color aesthetic but is specifically used for filling areas, such as in bar plots, histograms, and box plots.

  • shape determines the shape of individual data points. It’s useful when you want to differentiate data points within a group.

  • size specifies the size of data points or other elements based on a variable. It’s often used to highlight the importance or quantity of a certain aspect.

data %>%
  ggplot(aes(x = variable1, 
             y = variable2,
             color = varaible3,
             fill = variable4,
             shape = variable5,
             size = variable6))

5.2.1 Geoms and layers

Geometric objects, or geoms, determine the visual representation of your data points. They correspond to the type of plot you want to create and, by default, will inherit the aesthetic properties defined in the aes() function. You can add geoms using functions like geom_point(), geom_line(), geom_bar(), etc.

Geoms, as well as other objects, can be added to a ggplot as a layer using the + operator. Each + adds a new layer to your plot.

data %>%
  ggplot(aes(x = variable_x, 
             y = variable_y)) +
  geom_point() +
  geom_line()

5.3 Basic plots: Scatter plots, bar plots, and line plots

5.3.1 Scatter plots

Scatter plots are typically used to visualize the relationship between two continuous variables. Here is an example of a basic scatter plot.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point()

5.3.2 Bar plots

Bar plots are a good way to display the distribution of a variable. They can also be used to represent average scores of independent groups.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a bar plot with the frequency distribution of score1
data %>%
  ggplot(aes(x = score1)) +
  geom_bar()

# Create a bar plot with the average of score1 by condition1
data %>%
  group_by(condition) %>%                    # Group by condition
  summarize(score1_mean = mean(score1)) %>%  # Summarize by computing the mean of score1 by condition
  ggplot(aes(x = condition, 
             y = score1_mean)) +
  geom_bar(stat = "identity")

5.3.3 Line plots

Line plots are typically used to visualize differences in paired or dependent data, such as how variables change across time.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   score1 = c(1, 4, 1, 2, 2, 4, 7, 3, 4, 6, 4, 1))

# Create a line plot with score1 over time
data %>%
  ggplot(aes(x = time,
             y = score1)) +
  geom_line()

5.4 Customizing plots: Axes, colors, and themes

5.4.1 Axes

Axes are highly customizable in ggplot(). There are many ways of doing so, but one way I’ve found to be most effective is via the scale_y_continuous() and scale_x_continuous() when x or y is continuous, and scale_y_discrete() and scale_x_discrete() when x or y is categorical.

Axis functions have their own grammar that is important to understand. Here is an explanation of some commonly used operators as they apply to scale_y_continuous() and scale_x_continuous().

  • name sets the title of the axis.

  • limits sets the max and minimum value of the axis.

  • breaks assigns data values to points on the axis.

  • labels assigns text or numbers to data values, as defined by breaks, to be presented on the axis.

Here is an explanation of the same operators as they apply to scale_y_discrete() and scale_y_discrete().

  • name sets the title of the axis.

  • limits defines the data values and their order to the axis.

  • labels assigns text or numbers to data values, as defined by limits, to be presented on the axis.

Here are some examples of how axes can be customized.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot with custom x- and y-axes
data %>%
  ggplot(aes(x = score2, 
             y = score1)) +
  geom_point() +
  scale_y_continuous(name = "Score 1",          # Set y-axis title to "Score 1"
                     limits = c(0, 6),          # Set y-axis minimum value to 0 and maximum to 6 
                     breaks = c(0, 2, 4, 6),    # Put lines from 0 to 6 (must match labels)
                     labels = c(0, 2, 4, 6)) +  # Put labels on every other integer from 0 to 6
  scale_x_continuous(name = "Score 2",          # Set y-axis title to "Score 2"
                     limits = c(0, 6),          # Set y-axis minimum value to 0 and maximum to 6 
                     breaks = c(0, 2, 4, 6),    # Put lines from 0 to 6 (must match labels)
                     labels = c(0, 2, 4, 6))    # Put labels on every other integer from 0 to 6

# Create a bar plot with custom x- and y-axes
data %>%
  group_by(condition) %>%                                # Group by condition
  summarize(score1_mean = mean(score1)) %>%              # Summarize by computing the mean of score1 by condition
  ggplot(aes(x = condition, 
             y = score1_mean)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(name = "Average of Score 1",        # Set y-axis title to "Average of Score 1"
                     limits = c(0, 6),                   # Set y-axis minimum value to 0 and maximum to 6 
                     breaks = 0:6,                       # Put lines from 0 to 6 (must match labels)
                     labels = 0:6) +                     # Put labels on every integer from 0 to 6
  scale_x_discrete(name = "Condition",                   # Set x-axis title to "Condition"
                   limits = c("low", "medium", "high"),  # Set x-axis order to low, medium, then high
                   labels = c("Low", "Medium", "High"))  # Set x-axis condition labels to "Low", "Medium, and "High"

5.4.2 Colors

It is possible to change the color of your geoms using the color and fill operators. color generally dictates the color of lines and geom borders whilst fill dictates the inner space of geoms. However, this is not always the case and I find you will sometimes have to try both to get the desired appearance.

The simplest way to change the color of your geoms is to set them all to a single color. This is done by entering the color and fill operators into the relevant geom() function and by reference the color via natural language (e.g., “red”) or hexadecimal code (e.g., “#00FFFF”). Some examples are presented below.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot with red data points using the string "red"
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point(color = "red")

# Create a bar plot with blue bars using the hex code "#00FFFF"
data %>%
  group_by(condition) %>%                    # Group by condition
  summarize(score1_mean = mean(score1)) %>%  # Summarize by computing the mean of score1 by condition
  ggplot(aes(x = condition, 
             y = score1_mean)) +
  geom_bar(stat = "identity",
           fill = "#00FFFF")

It is also possible to vary the color of your geoms as a function of a variable by entering the color and fill operators in the aes() function. color generally dictates the color of lines and geom borders whilst fill dictates the inner space of geoms. When this approach is taken, we can also add the scale_color_manual() and scale_fill_manual() to exert further control over the colors.

Adding scale_color_manual() or scale_fill_manual() will add a legend to your figure. Many familiar operators can be applied to these functions. Here is an explanation of them.

  • name sets the title of the color or fill legend.

  • limits defines the data values and their order to be colored and filled.

  • labels assigns text or numbers to data values, as defined by limits, to be presented in the color or fill legend.

  • values defines colors and assigns them to data values, as defined by limits.

Here are some examples of how you can use the fill and color operators in the aes() function as well as scale_color_manual() and scale_fill_manual() to change the appearance of your plots.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "high", "high", "low", "low", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot with different colors by condition
data %>%
  ggplot(aes(x = score1, 
             y = score2,
             color = condition)) +
  geom_point() +
  scale_color_manual(name = "Condition",         # Set color legend title
                     limits = c("high", "low"),  # Set color order to high then low 
                     labels = c("High", "Low"),  # Set color labels to "High" and "Low" 
                     values = c("red", "blue"))  # Set color colors using natural language

# Create a bar plot with different colors by condition
data %>%
  group_by(condition) %>%                             # Group by condition
  summarize(score1_mean = mean(score1)) %>%           # Summarize by computing the mean of score1 by condition
  ggplot(aes(x = condition, 
             y = score1_mean,
             fill = condition)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(name = "Condition",               # Set fill legend title
                    limits = c("high", "low"),        # Set fill order to high then low 
                    labels = c("High", "Low"),        # Set fill labels to "High" and "Low" 
                    values = c("#003f5c", "#ffa600")) # Set fill colors using hex codes

5.4.3 Themes

Themes allow you to control various visual aspects of your plots, such as colors, fonts, grid lines, axis labels, and more. Many of the theme functions apply numerous changes to provide a consistent and coherent look across your plots, making it easier to maintain a cohesive visual style in your data visualizations. They can be applied by adding a theme function to you plot, such as theme_classic(), theme_bw(), and theme_minimal()

Here are some examples of cohesive themes and how to apply them.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot with classic theme
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point() +
  theme_classic()

# Create a scatter plot with black and white theme
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point() +
  theme_bw()

# Create a scatter plot with minimal theme
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point() +
  theme_minimal()

The theme() function allows more fine-grained control over visual elements, such as axis front and legend formatting.

# Load packages
library(tidyverse)
library(ggplot2)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition = c("high", "high", "high", "high", "high", "high", "low", "low", "low", "low", "low", "low"),
                   score1 = c(1, 4, 1, 2, 2, 4, 6, 3, 4, 6, 4, 1),
                   score2 = c(1, 2, 3, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Create a scatter plot with customized theme
data %>%
  ggplot(aes(x = score1, 
             y = score2)) +
  geom_point() +
  theme(panel.background = element_blank(),                   # Remove grey background 
        axis.line = element_line(size = 1, color = "black"),  # Add bold black axis line
        axis.text = element_text(size = 12, color = "black")) # Set the axis text to size 12 and black

# Create a bar plot with different colors by condition with customized theme
data %>%
  group_by(condition) %>%                             # Group by condition
  summarize(score1_mean = mean(score1)) %>%           # Summarize by computing the mean of score1 by condition
  ggplot(aes(x = condition, 
             y = score1_mean,
             fill = condition)) +
  geom_bar(stat = "identity") +
  theme(legend.position = "none",     # Hide figure legend
        axis.ticks = element_blank()) # Hide axis ticks  

5.5 Plotting inferential statistics: Confidence intervals and predicted outcomes

As researchers we often want to extrapolate from the sample we have to some population. We rely on inferential statistics to do so and, as such, often want to represent these statistics visually including: lines-of-best-fit, points, and the precision with which they are estimated.

5.5.1 Confidence intervals

This section will demonstrate how to use the geom_smooth() function to plot a line

5.5.2 Model predictions with ggeffects

The simple estimates derived from geom_smooth() and stat_smooth() won’t cut it here. In these cases, it is often more appropriate to plot model predictions. This section will show how to do so using ggpredict() and ggeffects() from the ggeffects package.

5.6 Advanced visualizations: Box plots, density distributions, and violin plots

Some scientists have expressed concern over the tenancy to display simplified and abstracted indices in figures, arguing that this can obscure important features of the underlying data such as skewness. A proposed solution is to include visual elements that more faithfully communicate relevant aspects of the raw data. Box plots, density distributions, and violin plots fit the bill as they can give a better sense of the underlying data than can averages and trend lines.

5.6.1 Box plots

5.6.2 Density distributions

5.6.3 Violin plots

5.7 Multi-panel plots: Faceting and combining figures

5.7.1 Faceting

5.7.2 Combining plots with ggstance

5.8 Exporting and saving plots