install.packages(ggplot2)
## or
install.packages(tidyverse)
Internation College of Digital Innnovation, CMU
2023-09-16
Created by Hadley Wickham in 2005 and released in 2007, ggplot2 is a crucial data visualization package for the R programming language.
ggplot2 can serve as a replacement for the base graphics in R(ggplot2 is the one part of the tidyverse package). Now, ggplot2 has a lot of extension libraries.
Note: The “gg” in ggplot2 stands for “grammar of graphics”.
For each plot with ggplot can have a maximum of 7 layers.
The first three layers are necessary to produce a plot(data + aesthetics + geometrices).
Data (ggplot()): Where is the Data you want to plot?(recommend data is the data frame).
Aesthetics(aes()): What should be on the x and y axis, fill, color, group, or alpha?(character, numeric, date).
Geometries(geom_XXX()): Type of plot, scatter plot, line plot, bar plot or mixed plot.
Facet(facet_XXX()): Generating multiple small plots, each showing a different subset of the data.
Statistics(stat_XXX()): drawing attention to the statistical transformation (for example, draw a graph of empirical distribution or draw the graph of probability density function)
Coordinates(coord_XXX()): Combine the two position aesthetics to produce a 2d position like Linear coordinate systems or Non-linear coordinate systems.
Theme: Set up the background theme, title, legend, axis, font, etc.
How to used ggplot2.
In this note, we make the ggplot from data frame only.
or
Move the variable from x axis to y axis
By put another categorical data into argument fill.
Stacked bar graph is bar graph with add more character or factor variable
By setup geom_bar(position = "dodge"
) or geom_bar(position = "dodge2")
The histogram is the most commonly used graph to show frequency distributions(discrete or continuous). It looks like a bar plot, but they have essential differences. For example, the histogram is useful when:
The data are numerical
You want to see the shape of the data’s distribution.
It’s helpful to select the parametric distribution function to test with the data.
Determining whether the distribution of two or more variable are different.
You wish to communicate data distribution quickly and efficiently to others.
By default, the number of the bin is 30. However, we can change the bin number using argument bins = "any positive integer number"
in geom_histogram.
We need to add argument y = ..density..
in aes().
If we do multiple plot by categorical variable, we set argument fill = variable in aes() and set argument color = "any color"
in geom_histogram()
.
It is a smoothed histogram version and is used in the same concept. A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable.
In the ggplot2, the graph with two continuous variables is used to observe the correlation between their variables. And the scatter plot can add more information by using different colors to display the character or factor variable and putting the size of the point in the scattering point to show the value. We call this graph is bubble plot. Moreover, the scatter plot with the ggplot 2 can add a prediction line for linear and nonlinear regression with/with confidence interval and the ellipses to display the Normal bivariate distribution.
The example from mtcars dataset.
For the variable to put in argument size it should be numeric.
For the variable to put in argument color, we recommend using character, factor, or setup the color palette by yourself.
In ggplot has 25 types of shapes, the default shape is the number 16, and the shape’s color is based on argument color. The color of shapes number 21-25 control by argument fill.
We recommend using character or factor for the variable to put in argument shape. If you expert user can see the manual in command scale_shape_identity() and set argument into integer number 0-25.
By adding scale_shape_manual() command and argument values.
Normally, ggplot cannot display Chinese, Thai, or another language font except only English.
We can solve this problem by changing the font at the theme() function
A line plot or line chart is handy for display time series data. For R programming, we recommend you to learn structure about ts, zoo, and xts object. But if you keep your data in data frame format with at least one variable as the date and another numeric, you can use ggplot2 to draw the line chart.
spc_tbl_ [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:574], format: "1967-07-01" "1967-08-01" ...
$ pce : num [1:574] 507 510 516 512 517 ...
$ pop : num [1:574] 198712 198911 199113 199311 199498 ...
$ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
$ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
$ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...
In ggplot2 we use geom_line()
Line plot with area under the line We can mixed line chart with area chart( geom_area()
). ::: {.panel-tabset} ### Code
:::
Use stat_smooth()
and set some argument.
::: {.panel-tabset} ### Code
:::
In ggplot2 we can add horizontal, vertical, or diagonal lines by using this geom and specific arguments.
geom_hline(yintercept = ): for horizontal line.
geom_vline(xintercept = ): for vertical line.
geom_abline(intercept = , slope = ): for diagonal line.
ggplot(data = economics)+
aes(x= date, y = uempmed)+
geom_line(color = "green")+
geom_hline(yintercept = 15, color ="red" )+
geom_vline(xintercept = as.numeric(economics$date[500]),
color = "blue")+
geom_abline(intercept = 5,
slope = (20-5)/(as.numeric(economics$date[500])-
as.numeric(economics$date[1])),
color = "black")
We can geom_text() for add only text and geom_label() for the rectangle with text into to the graph. For both geoms we use an argument label to put the text at the x-axis and the y-axis in aes(). We recommend adding the text with an important thing that we need to focus on. Example: We focus on this car
By the default of theme in ggplot is theme_gray() (No need to add this command). You can use another theme
theme_bw(): The classic dark-on-light ggplot2 theme.
theme_linedraw(): A theme with only black lines of various widths on white backgrounds, reminiscent of a line drawing.
theme_light(): A theme similar to theme_linedraw()
theme_dark(): The dark cousin of theme_light(), with similar line sizes but a dark background.
theme_minimal(): A minimalistic theme with no background annotations.
theme_void(): A completely empty theme.
Try to use the difference theme by yourself.
For extra theme in ggplot2 you need to install package “ggthemes”
## 1 plot
p1 <- ggplot(mtcars, aes(x = disp, y = mpg, size = gear,
color = cyl)) +
geom_point() + geom_smooth()
p2 <- ggplot(mtcars, aes(x = disp, y = mpg, size = gear,
color = cyl)) +
geom_point() + geom_smooth(method = lm)
p3 <- ggplot(mtcars, aes(x = disp, y = mpg, size = gear,
color = as.character(cyl))) +
geom_point() +geom_smooth(method = lm, se = FALSE)
We can change static plot from ggplot to interactive plot using plotly package, We use interactive plot in htm. For basic command in plotly can follow this step
Step1: Make a ggplot and assign to any variable name(example p1, p2 , or p3 in previous topic).
Step2: Put the variable into ggplotly() command.
888152