Code
research_funding <- readr::read_csv(
"data/chap03/gun_publications_funds_2004_2015_ch3.csv",
show_col_types = FALSE)
save_data_file("chap03", research_funding, "research_funding.rds")
Objectives for chapter 03
SwR Achievements
Achievements for chapter 03
Research about gun violence in under developed. Harris refers to an article by Stark & Shah (2017) (Figure 1 and 2).
Data for figure 3 (Homicides in the US by guns 2012-2016) comes from the Uniform Crime Reporting (UCR):
The Uniform Crime Reporting (UCR) Program generates reliable statistics for use in law enforcement. It also provides information for students of criminal justice, researchers, the media, and the public. The program has been providing crime statistics since 1930.
The UCR Program includes data from more than 18,000 city, university and college, county, state, tribal, and federal law enforcement agencies. Agencies participate voluntarily and submit their crime data either through a state UCR program or directly to the FBI’s UCR Program.
Figure 4: Handguns were the most widely used type of gun for homicide in 2016.
Gun manufacturers play an essential role: Figure 5 and 6.
Resource 3.1 : Data, codebook, and R packages for data visualization
Data
There are two options:
I will only work with the second option.
Manufactured firearms data not mentioned
Harris provides the “total_firearms_manufactured_US_1990to2015.csv” file with firearm production in the US from 1990-2015 but did not mention the source. I have looked around on the web and reported the results of my research in Section 3.3.2.1.
Codebook
Again there are two options:
I will only work with the second option.
Packages
Instead of downloading the file with {httr} I will use utils::download.file()
Harris lists the older {httr} package, but now there is {httr2}, “a modern re-imagining of {httr} that uses a pipe-based interface and solves more of the problems that API wrapping packages face.” (See Section A.40)
The {httr} package is in the book just used for downloading the excel file from the FBI website. For this specific task there is no need to download, to install and to learn a new package. You can use utils::download.file()
, a function as I have it already applied successfully in Listing / Output 2.1.
There are different sources of data for this chapter. A special case are the data provided by Harris about guns manufactured in the US between 1990 and 2015. There is no source available because this dataset was not mentioned in the section about “Data, codebook, and R packages” (see: Section 3.3.1). But a Google searched revealed several interesting possible sources:
.csv
file from USAFacts a reference to a PDF file where all the data I am interesting in are listed. To the best of my knowledge there are no better accessible data on the ATF website.But even if you have data of manufactured, exported and imported guns, this does not provide the exact numbers of guns circulating in the US:
Although a few data points are consistently collected, there is a clear lack of data on firearms in the US. It is impossible, for instance, to know the total number of firearms owned in the United States, or how many guns were bought in the past year, or what the most commonly owned firearms are. Instead, Americans are left to draw limited conclusions from available data, such as the number of firearms processed by the National Firearm Administration (NFA), the number of background checks conducted for firearm purchase, and the number of firearms manufactured. However, none of these metrics provide a complete picture because state laws about background checks and gun registration differ widely. (USAFact.org)
Remark 3.1. How to calculate the numbers of circulated guns in the US
If you are interested to research the relationship between gun ownership in the USA and homicides then you would need to reflect how to get a better approximation as the yearly manufactured guns. Besides that not all gun manufacturer have reported their production numbers to the ATF, there is a big difference between gun production and gun sales. Additionally there are some other issues that influence the number of circulated guns in the US. So you have to take into account for instance
As an exercise I have subtracted from the manufactured gun the exported guns and have added the imported firearms. You will find the result in Listing / Output 3.14.
To get the data for this chapter is a three step procedure:
Procedure 3.1 : How to get data from the internet
utils::donwload.file()
.readxl::read_excel()
Example 3.1 : Get data for chapter 3
R Code 3.1 : Get data from the FBI’s Uniform Crime Reporting database
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.4.)
In changed the recommended R code by Harris for downloading the FBI Excel Data I in several ways:
utils::download.file()
. This worked in the console (compiling the R code chunk), but did not work when rendered with Quarto. I changed to {curl} and used the curl_download()
function which worked in both situations (see: Section A.11).base::data.frame()
I used a tibble::tibble()
. This has the advantage that the column names were not changed. In the original files the column names are years, but in base R is not allowed that column names start with a number. In tidyverse this is possible but you must refer to this column names with enclosed accents like 2016
.R Code 3.2 : Get NHANES data 2011-2012 from the CDC website with {RNHANES}
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.5)
{RNHANES} combines as a package specialized to download data from the National Health and Nutrition Examination Survey (NHANES) step 2 and 3 of Procedure 3.1. But as it turned out it only doesn’t work with newer audiology data than 2012. I tried to use the package with data from 2016 and 2018 (For 2014 there are no audiology data available), but I got an error message.
Error in validate_year(year) : Invalid year: 2017-2018
The problem lies in the function RNHANES:::validate_year()
. It qualifies in version 1.1.0 only downloads until ‘2013-2014’ as valid:
function (year, throw_error = TRUE)
{
if (length(year) > 1) {
Map(validate_year, year, throw_error = throw_error) %>%
unlist() %>% unname() %>% return()
}
else {
valid <- switch(as.character(year), `1999-2000` = TRUE,
`2001-2002` = TRUE, `2003-2004` = TRUE, `2005-2006` = TRUE,
`2007-2008` = TRUE, `2009-2010` = TRUE, `2011-2012` = TRUE,
`2013-2014` = TRUE, FALSE)
if (throw_error == TRUE && valid == FALSE) {
stop(paste0("Invalid year: ", year))
}
return(valid)
}
}
Conclusion: Special data packages can facilitate your work, but to know how to download data programmatically on your own is an indispensable data science skill.
See tab “NHANES 2018” how this is done.
R Code 3.3 : Get NHANES data 2017-2018 from the CDC website with {haven}
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.6.)
The download with {haven} has the additional advantage that the variables are labelled as already explained in Section 1.8.
R Code 3.4 : Research funding for different kind of research topics (2004-2015)
research_funding <- readr::read_csv(
"data/chap03/gun_publications_funds_2004_2015_ch3.csv",
show_col_types = FALSE)
save_data_file("chap03", research_funding, "research_funding.rds")
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.7.)
R Code 3.5 : Get guns manufactured 1986-2019 from USAFacts.org
guns_manufactured_2019 <- readr::read_csv(
"https://a.usafacts.org/api/v4/Metrics/csv/16185?&_ga=2.61283563.611609354.1708598369-436724430.1708598369", show_col_types = FALSE
)
save_data_file("chap03", guns_manufactured_2019, "guns_manufactured_2019.rds")
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.8.)
R Code 3.6 : Get guns exported from a PDF report by the ATF
guns_export <- tabulizer::extract_tables(
"data/chap03/firearms_commerce_2019.pdf",
pages = 4
)
guns_exported_2017 <- guns_export[[1]]
save_data_file("chap03", guns_exported_2017, "guns_exported_2017.rds")
(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.9.)
There are several innovation applied in this R code chunk:
1. First of all I have used the {**tabulizer**} package to scrap the export and import data tables from the original <a class='glossary' title='Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF): ATF’s responsibilities include the investigation and prevention of federal offenses involving the unlawful use, manufacture, and possession of firearms and explosives; acts of arson and bombings; and illegal trafficking of alcohol and tobacco products. The ATF also regulates, via licensing, the sale, possession, and transportation of firearms, ammunition, and explosives in interstate commerce. (ATF)'>ATF</a>-PDF. To make this procedure reproducible I have downloaded the [PDF from the ATF website](https://www.atf.gov/firearms/docs/report/2019-firearms-commerce-report/download) as I assume that this PDF will updated regularily.
WATCH OUT! Read carefully the installation instructions
Please read carefully the installation advices at the start of the page but — even more important — Installing Java on Windows with Chocolatey if you working on a windows machine or generally the Troubleshooting section for macOs, Linux and Windows.
I have recoded the data in several ways: - I turned the resulted matrices from the {tabulizer} package into tibbles. - Now I could rename all the columns with one name vector. - As the export data end with the year 2017 I had to reduce the import data ans also my original gun manufactured file to this shorter period.
R Code 3.7 : Get and recode guns imported from a PDF by the ATF
guns_import <- tabulizer::extract_tables(
"data/chap03/firearms_commerce_2019.pdf",
pages = 6
)
guns_imported_2018 <- guns_import[[1]]
save_data_file("chap03", guns_imported_2018, "guns_imported_2018.rds")
(For this R code chunk is no output available. But you can inspect the data at Listing / Output 3.10.)
Example 3.2 : Show raw data for chapter 3
R Code 3.8 : Show raw data from the FBI’s Uniform Crime Reporting database
#> tibble [18 × 6] (S3: tbl_df/tbl/data.frame)
#> $ Weapons: chr [1:18] "Total" "Total firearms:" "Handguns" "Rifles" ...
#> $ 2012 : num [1:18] 12888 8897 6404 298 310 ...
#> $ 2013 : num [1:18] 12253 8454 5782 285 308 ...
#> $ 2014 : num [1:18] 12270 8312 5673 258 264 ...
#> $ 2015 : num [1:18] 13750 9778 6569 258 272 ...
#> $ 2016 : num [1:18] 15070 11004 7105 374 262 ...
Name | fbi_deaths |
Number of rows | 18 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Weapons | 0 | 1 | 4 | 44 | 0 | 18 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
2012 | 0 | 1 | 1926.28 | 3652.75 | 8 | 87.75 | 304.0 | 1403.50 | 12888 | ▇▁▁▁▁ |
2013 | 0 | 1 | 1831.11 | 3449.87 | 2 | 87.25 | 296.5 | 1330.00 | 12253 | ▇▁▁▁▁ |
2014 | 0 | 1 | 1825.11 | 3430.87 | 7 | 75.50 | 261.0 | 1414.25 | 12270 | ▇▁▁▁▁ |
2015 | 0 | 1 | 2071.00 | 3918.39 | 1 | 87.75 | 265.0 | 1410.00 | 13750 | ▇▁▁▁▁ |
2016 | 0 | 1 | 2285.78 | 4327.90 | 1 | 100.25 | 318.0 | 1428.75 | 15070 | ▇▁▁▁▁ |
R Code 3.9 : Show raw NHANES data 2011-2012 from the CDC website with {RNHANES}
#> 'data.frame': 9364 obs. of 81 variables:
#> $ SEQN : num 62161 62162 62163 62164 62165 ...
#> $ cycle : chr "2011-2012" "2011-2012" "2011-2012" "2011-2012" ...
#> $ SDDSRVYR : num 7 7 7 7 7 7 7 7 7 7 ...
#> $ RIDSTATR : num 2 2 2 2 2 2 2 2 2 2 ...
#> $ RIAGENDR : num 1 2 1 2 2 1 1 1 1 1 ...
#> $ RIDAGEYR : num 22 3 14 44 14 9 6 21 15 14 ...
#> $ RIDAGEMN : num NA NA NA NA NA NA NA NA NA NA ...
#> $ RIDRETH1 : num 3 1 5 3 4 3 5 5 5 1 ...
#> $ RIDRETH3 : num 3 1 6 3 4 3 7 6 7 1 ...
#> $ RIDEXMON : num 2 1 2 1 2 2 1 1 1 1 ...
#> $ RIDEXAGY : num NA 3 14 NA 14 10 6 NA 15 14 ...
#> $ RIDEXAGM : num NA 41 177 NA 179 120 81 NA 181 175 ...
#> $ DMQMILIZ : num 2 NA NA 1 NA NA NA 2 NA NA ...
#> $ DMQADFC : num NA NA NA 2 NA NA NA NA NA NA ...
#> $ DMDBORN4 : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ DMDCITZN : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ DMDYRSUS : num NA NA NA NA NA NA NA NA NA NA ...
#> $ DMDEDUC3 : num NA NA 8 NA 7 3 0 NA 9 7 ...
#> $ DMDEDUC2 : num 3 NA NA 4 NA NA NA 3 NA NA ...
#> $ DMDMARTL : num 5 NA NA 1 NA NA NA 5 NA NA ...
#> $ RIDEXPRG : num NA NA NA 2 NA NA NA NA NA NA ...
#> $ SIALANG : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ SIAPROXY : num 1 1 1 2 1 1 1 2 1 1 ...
#> $ SIAINTRP : num 2 2 2 2 2 2 2 1 2 2 ...
#> $ FIALANG : num 1 1 1 1 1 1 1 NA 1 2 ...
#> $ FIAPROXY : num 2 2 2 2 2 2 2 NA 2 2 ...
#> $ FIAINTRP : num 2 2 2 2 2 2 2 NA 2 2 ...
#> $ MIALANG : num 1 NA 1 NA 1 1 NA 1 1 1 ...
#> $ MIAPROXY : num 2 NA 2 NA 2 2 NA 2 2 2 ...
#> $ MIAINTRP : num 2 NA 2 NA 2 2 NA 2 2 2 ...
#> $ AIALANGA : num 1 NA 1 NA 1 NA NA 1 1 1 ...
#> $ WTINT2YR : num 102641 15458 7398 127351 12210 ...
#> $ WTMEC2YR : num 104237 16116 7869 127965 13384 ...
#> $ SDMVPSU : num 1 3 3 1 2 1 2 1 3 3 ...
#> $ SDMVSTRA : num 91 92 90 94 90 91 103 92 91 92 ...
#> $ INDHHIN2 : num 14 4 15 8 4 77 14 2 15 9 ...
#> $ INDFMIN2 : num 14 4 15 8 4 77 14 2 15 9 ...
#> $ INDFMPIR : num 3.15 0.6 4.07 1.67 0.57 NA 3.48 0.33 5 2.46 ...
#> $ DMDHHSIZ : num 5 6 5 5 5 6 5 5 4 4 ...
#> $ DMDFMSIZ : num 5 6 5 5 5 6 5 5 4 4 ...
#> $ DMDHHSZA : num 0 2 0 1 1 0 0 0 0 0 ...
#> $ DMDHHSZB : num 1 2 2 2 2 4 2 1 2 2 ...
#> $ DMDHHSZE : num 0 0 1 0 0 0 1 0 0 0 ...
#> $ DMDHRGND : num 2 2 1 1 2 1 1 1 1 1 ...
#> $ DMDHRAGE : num 50 24 42 52 33 44 43 51 38 43 ...
#> $ DMDHRBR4 : num 1 1 1 1 2 1 1 2 2 2 ...
#> $ DMDHREDU : num 5 3 5 4 2 5 4 1 5 3 ...
#> $ DMDHRMAR : num 1 6 1 1 77 1 1 4 1 1 ...
#> $ DMDHSEDU : num 5 NA 4 4 NA 5 5 NA 5 4 ...
#> $ AUQ054 : num 2 1 2 1 2 1 1 2 2 2 ...
#> $ AUQ060 : num 1 NA NA NA NA NA NA 2 NA NA ...
#> $ AUQ070 : num NA NA NA NA NA NA NA 1 NA NA ...
#> $ AUQ080 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ090 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ100 : num 5 NA NA 4 NA NA NA 4 NA NA ...
#> $ AUQ110 : num 5 NA NA 5 NA NA NA 4 NA NA ...
#> $ AUQ136 : num 1 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUQ138 : num 1 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUQ144 : num 4 NA NA 4 NA NA NA 2 NA NA ...
#> $ AUQ146 : num 2 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUD148 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ152 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ154 : num 2 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUQ191 : num 2 NA NA 1 NA NA NA 2 NA NA ...
#> $ AUQ250 : num NA NA NA 5 NA NA NA NA NA NA ...
#> $ AUQ255 : num NA NA NA 1 NA NA NA NA NA NA ...
#> $ AUQ260 : num NA NA NA 2 NA NA NA NA NA NA ...
#> $ AUQ270 : num NA NA NA 1 NA NA NA NA NA NA ...
#> $ AUQ280 : num NA NA NA 1 NA NA NA NA NA NA ...
#> $ AUQ300 : num 2 NA NA 1 NA NA NA 2 NA NA ...
#> $ AUQ310 : num NA NA NA 2 NA NA NA NA NA NA ...
#> $ AUQ320 : num NA NA NA 1 NA NA NA NA NA NA ...
#> $ AUQ330 : num 2 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUQ340 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ350 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ360 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ AUQ370 : num 2 NA NA 2 NA NA NA 2 NA NA ...
#> $ AUQ380 : num 1 NA NA 6 NA NA NA 5 NA NA ...
#> $ file_name : chr "AUQ_G" "AUQ_G" "AUQ_G" "AUQ_G" ...
#> $ begin_year: num 2011 2011 2011 2011 2011 ...
#> $ end_year : num 2012 2012 2012 2012 2012 ...
Name | nhanes_2012 |
Number of rows | 9364 |
Number of columns | 81 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 79 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
cycle | 0 | 1 | 9 | 9 | 0 | 1 | 0 |
file_name | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SEQN | 0 | 1.00 | 67029.29 | 2811.36 | 62161.00 | 64598.75 | 67024.50 | 69457.25 | 71916.0 | ▇▇▇▇▇ |
SDDSRVYR | 0 | 1.00 | 7.00 | 0.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.0 | ▁▁▇▁▁ |
RIDSTATR | 0 | 1.00 | 1.96 | 0.20 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
RIAGENDR | 0 | 1.00 | 1.50 | 0.50 | 1.00 | 1.00 | 2.00 | 2.00 | 2.0 | ▇▁▁▁▇ |
RIDAGEYR | 0 | 1.00 | 32.72 | 24.22 | 1.00 | 11.00 | 28.00 | 53.00 | 80.0 | ▇▅▃▃▃ |
RIDAGEMN | 9130 | 0.02 | 17.94 | 3.41 | 12.00 | 15.00 | 18.00 | 21.00 | 24.0 | ▇▆▇▆▆ |
RIDRETH1 | 0 | 1.00 | 3.24 | 1.25 | 1.00 | 3.00 | 3.00 | 4.00 | 5.0 | ▃▃▇▇▅ |
RIDRETH3 | 0 | 1.00 | 3.45 | 1.60 | 1.00 | 3.00 | 3.00 | 4.00 | 7.0 | ▆▇▇▁▅ |
RIDEXMON | 408 | 0.96 | 1.52 | 0.50 | 1.00 | 1.00 | 2.00 | 2.00 | 2.0 | ▇▁▁▁▇ |
RIDEXAGY | 5946 | 0.37 | 9.64 | 5.18 | 2.00 | 5.00 | 9.00 | 14.00 | 20.0 | ▇▇▅▆▃ |
RIDEXAGM | 5737 | 0.39 | 114.53 | 64.66 | 12.00 | 57.00 | 109.00 | 167.00 | 239.0 | ▇▇▇▅▅ |
DMQMILIZ | 3357 | 0.64 | 1.91 | 0.29 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
DMQADFC | 8813 | 0.06 | 1.50 | 0.59 | 1.00 | 1.00 | 1.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
DMDBORN4 | 0 | 1.00 | 1.27 | 2.11 | 1.00 | 1.00 | 1.00 | 1.00 | 99.0 | ▇▁▁▁▁ |
DMDCITZN | 5 | 1.00 | 1.13 | 0.44 | 1.00 | 1.00 | 1.00 | 1.00 | 7.0 | ▇▁▁▁▁ |
DMDYRSUS | 7292 | 0.22 | 7.44 | 14.71 | 1.00 | 3.00 | 5.00 | 6.00 | 99.0 | ▇▁▁▁▁ |
DMDEDUC3 | 6765 | 0.28 | 6.04 | 6.13 | 0.00 | 2.00 | 5.00 | 9.00 | 66.0 | ▇▁▁▁▁ |
DMDEDUC2 | 3804 | 0.59 | 3.47 | 1.28 | 1.00 | 3.00 | 4.00 | 5.00 | 9.0 | ▃▇▃▁▁ |
DMDMARTL | 3804 | 0.59 | 2.75 | 3.34 | 1.00 | 1.00 | 2.00 | 5.00 | 99.0 | ▇▁▁▁▁ |
RIDEXPRG | 8156 | 0.13 | 2.02 | 0.34 | 1.00 | 2.00 | 2.00 | 2.00 | 3.0 | ▁▁▇▁▁ |
SIALANG | 0 | 1.00 | 1.12 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | 2.0 | ▇▁▁▁▁ |
SIAPROXY | 4 | 1.00 | 1.65 | 0.48 | 1.00 | 1.00 | 2.00 | 2.00 | 2.0 | ▅▁▁▁▇ |
SIAINTRP | 0 | 1.00 | 1.96 | 0.18 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
FIALANG | 99 | 0.99 | 1.08 | 0.27 | 1.00 | 1.00 | 1.00 | 1.00 | 2.0 | ▇▁▁▁▁ |
FIAPROXY | 99 | 0.99 | 2.00 | 0.04 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
FIAINTRP | 99 | 0.99 | 1.97 | 0.17 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
MIALANG | 2651 | 0.72 | 1.05 | 0.22 | 1.00 | 1.00 | 1.00 | 1.00 | 2.0 | ▇▁▁▁▁ |
MIAPROXY | 2651 | 0.72 | 1.99 | 0.08 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
MIAINTRP | 2651 | 0.72 | 1.97 | 0.17 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
AIALANGA | 3610 | 0.61 | 1.11 | 0.37 | 1.00 | 1.00 | 1.00 | 1.00 | 3.0 | ▇▁▁▁▁ |
WTINT2YR | 0 | 1.00 | 32347.76 | 34440.37 | 3600.63 | 11761.87 | 18490.48 | 35509.75 | 220233.3 | ▇▁▁▁▁ |
WTMEC2YR | 0 | 1.00 | 32347.76 | 35612.05 | 0.00 | 11582.07 | 18605.93 | 36132.36 | 222579.8 | ▇▁▁▁▁ |
SDMVPSU | 0 | 1.00 | 1.64 | 0.64 | 1.00 | 1.00 | 2.00 | 2.00 | 3.0 | ▇▁▇▁▂ |
SDMVSTRA | 0 | 1.00 | 95.88 | 3.98 | 90.00 | 92.00 | 96.00 | 99.00 | 103.0 | ▇▆▃▆▅ |
INDHHIN2 | 78 | 0.99 | 11.53 | 16.49 | 1.00 | 5.00 | 7.00 | 14.00 | 99.0 | ▇▁▁▁▁ |
INDFMIN2 | 49 | 0.99 | 11.10 | 16.20 | 1.00 | 4.00 | 7.00 | 14.00 | 99.0 | ▇▁▁▁▁ |
INDFMPIR | 805 | 0.91 | 2.22 | 1.64 | 0.00 | 0.87 | 1.64 | 3.62 | 5.0 | ▇▇▃▂▆ |
DMDHHSIZ | 0 | 1.00 | 3.73 | 1.70 | 1.00 | 2.00 | 4.00 | 5.00 | 7.0 | ▇▅▆▅▅ |
DMDFMSIZ | 0 | 1.00 | 3.56 | 1.78 | 1.00 | 2.00 | 4.00 | 5.00 | 7.0 | ▇▅▅▃▃ |
DMDHHSZA | 0 | 1.00 | 0.49 | 0.77 | 0.00 | 0.00 | 0.00 | 1.00 | 3.0 | ▇▂▁▁▁ |
DMDHHSZB | 0 | 1.00 | 0.95 | 1.13 | 0.00 | 0.00 | 1.00 | 2.00 | 4.0 | ▇▃▃▁▁ |
DMDHHSZE | 0 | 1.00 | 0.41 | 0.71 | 0.00 | 0.00 | 0.00 | 1.00 | 3.0 | ▇▂▁▁▁ |
DMDHRGND | 0 | 1.00 | 1.49 | 0.50 | 1.00 | 1.00 | 1.00 | 2.00 | 2.0 | ▇▁▁▁▇ |
DMDHRAGE | 0 | 1.00 | 45.85 | 15.88 | 18.00 | 34.00 | 43.00 | 57.00 | 80.0 | ▅▇▆▃▃ |
DMDHRBR4 | 361 | 0.96 | 1.44 | 3.12 | 1.00 | 1.00 | 1.00 | 2.00 | 99.0 | ▇▁▁▁▁ |
DMDHREDU | 358 | 0.96 | 3.43 | 1.33 | 1.00 | 2.00 | 4.00 | 4.00 | 9.0 | ▅▇▃▁▁ |
DMDHRMAR | 128 | 0.99 | 3.17 | 7.49 | 1.00 | 1.00 | 1.00 | 5.00 | 99.0 | ▇▁▁▁▁ |
DMDHSEDU | 4716 | 0.50 | 3.59 | 1.36 | 1.00 | 3.00 | 4.00 | 5.00 | 9.0 | ▃▇▆▁▁ |
AUQ054 | 1 | 1.00 | 1.93 | 4.07 | 1.00 | 1.00 | 2.00 | 2.00 | 99.0 | ▇▁▁▁▁ |
AUQ060 | 6459 | 0.31 | 1.34 | 0.92 | 1.00 | 1.00 | 1.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ070 | 8587 | 0.08 | 1.30 | 0.65 | 1.00 | 1.00 | 1.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ080 | 9151 | 0.02 | 1.29 | 0.69 | 1.00 | 1.00 | 1.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ090 | 9310 | 0.01 | 1.46 | 0.50 | 1.00 | 1.00 | 1.00 | 2.00 | 2.0 | ▇▁▁▁▇ |
AUQ100 | 4689 | 0.50 | 4.14 | 1.13 | 1.00 | 4.00 | 5.00 | 5.00 | 9.0 | ▂▆▇▁▁ |
AUQ110 | 4689 | 0.50 | 4.59 | 0.85 | 1.00 | 4.00 | 5.00 | 5.00 | 9.0 | ▁▂▇▁▁ |
AUQ136 | 3805 | 0.59 | 2.02 | 1.32 | 1.00 | 2.00 | 2.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ138 | 3805 | 0.59 | 2.00 | 0.62 | 1.00 | 2.00 | 2.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ144 | 4689 | 0.50 | 3.84 | 1.53 | 1.00 | 3.00 | 4.00 | 5.00 | 9.0 | ▅▇▇▁▁ |
AUQ146 | 4689 | 0.50 | 1.99 | 0.12 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
AUD148 | 9301 | 0.01 | 1.03 | 0.18 | 1.00 | 1.00 | 1.00 | 1.00 | 2.0 | ▇▁▁▁▁ |
AUQ152 | 9303 | 0.01 | 3.15 | 1.63 | 1.00 | 1.00 | 3.00 | 5.00 | 5.0 | ▇▂▃▅▇ |
AUQ154 | 4689 | 0.50 | 1.98 | 0.12 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
AUQ191 | 4689 | 0.50 | 1.86 | 0.38 | 1.00 | 2.00 | 2.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ250 | 8680 | 0.07 | 3.22 | 1.50 | 1.00 | 2.00 | 3.00 | 5.00 | 9.0 | ▆▇▆▁▁ |
AUQ255 | 8680 | 0.07 | 3.02 | 1.57 | 1.00 | 1.00 | 3.00 | 5.00 | 9.0 | ▇▇▅▁▁ |
AUQ260 | 8680 | 0.07 | 1.88 | 0.72 | 1.00 | 2.00 | 2.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ270 | 8680 | 0.07 | 1.62 | 0.56 | 1.00 | 1.00 | 2.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ280 | 8680 | 0.07 | 2.15 | 1.00 | 1.00 | 1.00 | 2.00 | 3.00 | 5.0 | ▅▇▅▁▁ |
AUQ300 | 4689 | 0.50 | 1.66 | 0.48 | 1.00 | 1.00 | 2.00 | 2.00 | 7.0 | ▇▁▁▁▁ |
AUQ310 | 7751 | 0.17 | 2.11 | 1.42 | 1.00 | 1.00 | 2.00 | 3.00 | 9.0 | ▇▃▁▁▁ |
AUQ320 | 7751 | 0.17 | 3.06 | 1.81 | 1.00 | 1.00 | 3.00 | 5.00 | 9.0 | ▇▂▇▁▁ |
AUQ330 | 4689 | 0.50 | 1.69 | 0.50 | 1.00 | 1.00 | 2.00 | 2.00 | 3.0 | ▅▁▇▁▁ |
AUQ340 | 7828 | 0.16 | 4.77 | 4.59 | 1.00 | 3.00 | 5.00 | 7.00 | 99.0 | ▇▁▁▁▁ |
AUQ350 | 7828 | 0.16 | 1.37 | 0.52 | 1.00 | 1.00 | 1.00 | 2.00 | 9.0 | ▇▁▁▁▁ |
AUQ360 | 8383 | 0.10 | 4.65 | 3.59 | 1.00 | 3.00 | 5.00 | 7.00 | 99.0 | ▇▁▁▁▁ |
AUQ370 | 4689 | 0.50 | 1.88 | 0.32 | 1.00 | 2.00 | 2.00 | 2.00 | 2.0 | ▁▁▁▁▇ |
AUQ380 | 4689 | 0.50 | 4.68 | 2.11 | 1.00 | 5.00 | 5.00 | 6.00 | 77.0 | ▇▁▁▁▁ |
begin_year | 0 | 1.00 | 2011.00 | 0.00 | 2011.00 | 2011.00 | 2011.00 | 2011.00 | 2011.0 | ▁▁▇▁▁ |
end_year | 0 | 1.00 | 2012.00 | 0.00 | 2012.00 | 2012.00 | 2012.00 | 2012.00 | 2012.0 | ▁▁▇▁▁ |
R Code 3.10 : Show raw NHANES data 2017-2018 from the CDC website with {haven}
#> tibble [8,897 × 58] (S3: tbl_df/tbl/data.frame)
#> $ SEQN : num [1:8897] 93703 93704 93705 93706 93707 ...
#> ..- attr(*, "label")= chr "Respondent sequence number"
#> $ AUQ054 : num [1:8897] 1 1 2 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr "General condition of hearing"
#> $ AUQ060 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Hear a whisper from across a quiet room?"
#> $ AUQ070 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Hear normal voice across a quiet room?"
#> $ AUQ080 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Hear a shout from across a quiet room?"
#> $ AUQ090 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Hear if spoken loudly to in better ear?"
#> $ AUQ400 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "When began to have hearing loss?"
#> $ AUQ410A: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Genetic/Hereditary"
#> $ AUQ410B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Ear infections"
#> $ AUQ410C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Ear diseases"
#> $ AUQ410D: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Illness/Infections"
#> $ AUQ410E: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Drugs/Medications"
#> $ AUQ410F: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Head/Neck injury"
#> $ AUQ410G: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Loud brief noise"
#> $ AUQ410H: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Long-term noise"
#> $ AUQ410I: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Aging"
#> $ AUQ410J: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Cause of hearing loss-Others"
#> $ AUQ156 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Ever used assistive listening devices?"
#> $ AUQ420 : num [1:8897] NA NA NA 2 1 NA 2 NA 2 NA ...
#> ..- attr(*, "label")= chr "Ever had ear infections or earaches?"
#> $ AUQ430 : num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Ever had 3 or more ear infections?"
#> $ AUQ139 : num [1:8897] NA NA NA NA 2 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Ever had tube placed in ear?"
#> $ AUQ144 : num [1:8897] NA NA NA 1 3 NA 4 NA 2 NA ...
#> ..- attr(*, "label")= chr "Last time hearing tested by specialist?"
#> $ AUQ147 : num [1:8897] NA NA NA 2 2 NA 2 NA 2 NA ...
#> ..- attr(*, "label")= chr "Now use hearing aid/ amplifier/ implant"
#> $ AUQ149A: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Now use a hearing aid"
#> $ AUQ149B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Now use a personal sound amplifier"
#> $ AUQ149C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Now have a cochlear implant"
#> $ AUQ153 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Past 2 weeks how often worn hearing aid"
#> $ AUQ630 : num [1:8897] NA NA NA 2 2 NA 2 NA 2 NA ...
#> ..- attr(*, "label")= chr "Ever worn hearing aid/amplifier/implant"
#> $ AUQ440 : num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Ever had Special Ed/Early Intervention"
#> $ AUQ450A: num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for speech-language"
#> $ AUQ450B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for reading"
#> $ AUQ450C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for hearing/listening skills"
#> $ AUQ450D: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for intellectual disability"
#> $ AUQ450E: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for movement/mobility issues"
#> $ AUQ450F: num [1:8897] NA NA NA NA 6 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Had service for other disabilities"
#> $ AUQ460 : num [1:8897] NA NA NA NA 2 NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Exposed to very loud noise 10+ hrs/wk"
#> $ AUQ470 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "How long exposed to loud noise 10+hrs/wk"
#> $ AUQ101 : num [1:8897] NA NA NA 5 NA NA 5 NA 5 NA ...
#> ..- attr(*, "label")= chr "Difficult to follow conversation w/noise"
#> $ AUQ110 : num [1:8897] NA NA NA 5 NA NA 5 NA 5 NA ...
#> ..- attr(*, "label")= chr "Hearing causes frustration when talking?"
#> $ AUQ480 : num [1:8897] NA NA NA 5 NA NA 5 NA 4 NA ...
#> ..- attr(*, "label")= chr "Avoid groups, limit social life?"
#> $ AUQ490 : num [1:8897] NA NA NA 2 NA NA 1 NA 2 NA ...
#> ..- attr(*, "label")= chr "Problem with dizziness, lightheadedness?"
#> $ AUQ191 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#> ..- attr(*, "label")= chr "Ears ringing, buzzing past year?"
#> $ AUQ250 : num [1:8897] NA NA NA NA NA NA NA NA 1 NA ...
#> ..- attr(*, "label")= chr "How long bothered by ringing, buzzing?"
#> $ AUQ255 : num [1:8897] NA NA NA NA NA NA NA NA 5 NA ...
#> ..- attr(*, "label")= chr "In past yr how often had ringing/roaring"
#> $ AUQ260 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "Bothered by ringing after loud sounds?"
#> $ AUQ270 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "Bothered by ringing when going to sleep?"
#> $ AUQ280 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "How much of a problem is ringing?"
#> $ AUQ500 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "Discussed ringing with doctor?"
#> $ AUQ300 : num [1:8897] NA NA NA 2 NA NA 2 NA 2 NA ...
#> ..- attr(*, "label")= chr "Ever used firearms for any reason?"
#> $ AUQ310 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "How many total rounds ever fired?"
#> $ AUQ320 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "Wear hearing protection when shooting?"
#> $ AUQ330 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#> ..- attr(*, "label")= chr "Ever had job exposure to loud noise?"
#> $ AUQ340 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "How long exposed to loud noise at work"
#> $ AUQ350 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#> ..- attr(*, "label")= chr "Ever exposed to very loud noise at work"
#> $ AUQ360 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "label")= chr "How long exposed to very loud noise?"
#> $ AUQ370 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#> ..- attr(*, "label")= chr "Had off-work exposure to loud noise?"
#> $ AUQ510 : num [1:8897] NA NA NA NA NA NA NA NA 1 NA ...
#> ..- attr(*, "label")= chr "How long exposed to loud noise 10+hrs/wk"
#> $ AUQ380 : num [1:8897] NA NA NA 6 NA NA 5 NA 3 NA ...
#> ..- attr(*, "label")= chr "Past year: worn hearing protection?"
#> - attr(*, "label")= chr "Audiometry"
Name | nhanes_2018 |
Number of rows | 8897 |
Number of columns | 58 |
_______________________ | |
Column type frequency: | |
numeric | 58 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SEQN | 0 | 1.00 | 98333.86 | 2671.90 | 93703 | 96017 | 98348 | 100645 | 102956 | ▇▇▇▇▇ |
AUQ054 | 0 | 1.00 | 1.79 | 2.02 | 1 | 1 | 2 | 2 | 99 | ▇▁▁▁▁ |
AUQ060 | 7234 | 0.19 | 1.49 | 1.13 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ070 | 8293 | 0.07 | 1.46 | 1.02 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ080 | 8683 | 0.02 | 1.36 | 1.01 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ090 | 8842 | 0.01 | 1.62 | 1.52 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ400 | 8291 | 0.07 | 7.12 | 12.62 | 1 | 4 | 6 | 7 | 99 | ▇▁▁▁▁ |
AUQ410A | 8813 | 0.01 | 34.83 | 46.87 | 1 | 1 | 1 | 99 | 99 | ▇▁▁▁▅ |
AUQ410B | 8816 | 0.01 | 2.00 | 0.00 | 2 | 2 | 2 | 2 | 2 | ▁▁▇▁▁ |
AUQ410C | 8879 | 0.00 | 3.00 | 0.00 | 3 | 3 | 3 | 3 | 3 | ▁▁▇▁▁ |
AUQ410D | 8871 | 0.00 | 4.00 | 0.00 | 4 | 4 | 4 | 4 | 4 | ▁▁▇▁▁ |
AUQ410E | 8889 | 0.00 | 5.00 | 0.00 | 5 | 5 | 5 | 5 | 5 | ▁▁▇▁▁ |
AUQ410F | 8879 | 0.00 | 6.00 | 0.00 | 6 | 6 | 6 | 6 | 6 | ▁▁▇▁▁ |
AUQ410G | 8807 | 0.01 | 7.00 | 0.00 | 7 | 7 | 7 | 7 | 7 | ▁▁▇▁▁ |
AUQ410H | 8709 | 0.02 | 8.00 | 0.00 | 8 | 8 | 8 | 8 | 8 | ▁▁▇▁▁ |
AUQ410I | 8585 | 0.04 | 9.00 | 0.00 | 9 | 9 | 9 | 9 | 9 | ▁▁▇▁▁ |
AUQ410J | 8871 | 0.00 | 10.00 | 0.00 | 10 | 10 | 10 | 10 | 10 | ▁▁▇▁▁ |
AUQ156 | 8291 | 0.07 | 1.80 | 0.40 | 1 | 2 | 2 | 2 | 2 | ▂▁▁▁▇ |
AUQ420 | 5544 | 0.38 | 1.55 | 0.71 | 1 | 1 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ430 | 7276 | 0.18 | 1.62 | 0.96 | 1 | 1 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ139 | 7276 | 0.18 | 1.95 | 0.79 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ144 | 5544 | 0.38 | 3.00 | 1.88 | 1 | 2 | 2 | 5 | 9 | ▇▃▃▁▁ |
AUQ147 | 5544 | 0.38 | 1.94 | 0.25 | 1 | 2 | 2 | 2 | 2 | ▁▁▁▁▇ |
AUQ149A | 8690 | 0.02 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
AUQ149B | 8889 | 0.00 | 2.00 | 0.00 | 2 | 2 | 2 | 2 | 2 | ▁▁▇▁▁ |
AUQ149C | 8893 | 0.00 | 3.00 | 0.00 | 3 | 3 | 3 | 3 | 3 | ▁▁▇▁▁ |
AUQ153 | 8682 | 0.02 | 3.65 | 0.91 | 1 | 3 | 4 | 4 | 5 | ▁▁▂▇▁ |
AUQ630 | 5759 | 0.35 | 1.99 | 0.12 | 1 | 2 | 2 | 2 | 2 | ▁▁▁▁▇ |
AUQ440 | 7181 | 0.19 | 1.79 | 0.44 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ450A | 8658 | 0.03 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
AUQ450B | 8742 | 0.02 | 2.00 | 0.00 | 2 | 2 | 2 | 2 | 2 | ▁▁▇▁▁ |
AUQ450C | 8859 | 0.00 | 3.00 | 0.00 | 3 | 3 | 3 | 3 | 3 | ▁▁▇▁▁ |
AUQ450D | 8849 | 0.01 | 4.00 | 0.00 | 4 | 4 | 4 | 4 | 4 | ▁▁▇▁▁ |
AUQ450E | 8855 | 0.00 | 5.00 | 0.00 | 5 | 5 | 5 | 5 | 5 | ▁▁▇▁▁ |
AUQ450F | 8819 | 0.01 | 6.00 | 0.00 | 6 | 6 | 6 | 6 | 6 | ▁▁▇▁▁ |
AUQ460 | 7181 | 0.19 | 1.99 | 0.27 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ470 | 8866 | 0.00 | 2.55 | 1.75 | 1 | 1 | 2 | 4 | 9 | ▇▆▁▁▁ |
AUQ101 | 7261 | 0.18 | 3.64 | 1.25 | 1 | 3 | 4 | 5 | 9 | ▃▇▅▁▁ |
AUQ110 | 7261 | 0.18 | 4.24 | 1.11 | 1 | 4 | 5 | 5 | 9 | ▂▅▇▁▁ |
AUQ480 | 7261 | 0.18 | 4.59 | 0.90 | 1 | 5 | 5 | 5 | 9 | ▁▂▇▁▁ |
AUQ490 | 7261 | 0.18 | 1.65 | 0.58 | 1 | 1 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ191 | 7261 | 0.18 | 1.84 | 0.45 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ250 | 8615 | 0.03 | 3.58 | 1.61 | 1 | 2 | 4 | 5 | 9 | ▆▆▇▁▁ |
AUQ255 | 8615 | 0.03 | 2.62 | 1.67 | 1 | 1 | 2 | 4 | 9 | ▇▃▃▁▁ |
AUQ260 | 8615 | 0.03 | 1.96 | 1.02 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ270 | 8615 | 0.03 | 1.70 | 0.64 | 1 | 1 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ280 | 8615 | 0.03 | 2.18 | 0.96 | 1 | 1 | 2 | 3 | 5 | ▆▇▆▁▁ |
AUQ500 | 8615 | 0.03 | 1.53 | 0.81 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ300 | 7261 | 0.18 | 1.61 | 0.51 | 1 | 1 | 2 | 2 | 7 | ▇▁▁▁▁ |
AUQ310 | 8251 | 0.07 | 2.09 | 1.40 | 1 | 1 | 2 | 3 | 9 | ▇▃▁▁▁ |
AUQ320 | 8251 | 0.07 | 3.52 | 1.71 | 1 | 2 | 4 | 5 | 9 | ▆▃▇▁▁ |
AUQ330 | 7261 | 0.18 | 1.77 | 0.60 | 1 | 1 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ340 | 8412 | 0.05 | 4.99 | 2.14 | 1 | 3 | 6 | 7 | 7 | ▃▂▂▂▇ |
AUQ350 | 8412 | 0.05 | 1.42 | 0.69 | 1 | 1 | 1 | 2 | 9 | ▇▁▁▁▁ |
AUQ360 | 8600 | 0.03 | 5.82 | 7.94 | 1 | 4 | 6 | 7 | 99 | ▇▁▁▁▁ |
AUQ370 | 7261 | 0.18 | 1.89 | 0.36 | 1 | 2 | 2 | 2 | 9 | ▇▁▁▁▁ |
AUQ510 | 8712 | 0.02 | 2.75 | 1.40 | 1 | 1 | 3 | 4 | 9 | ▆▇▁▁▁ |
AUQ380 | 7261 | 0.18 | 4.83 | 1.18 | 1 | 5 | 5 | 5 | 6 | ▁▁▁▇▃ |
R Code 3.11 : Show raw research funding data for different kinds of research topics (2004-2015)
#> spc_tbl_ [30 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ Cause of Death : chr [1:30] "Heart disease" "Cancer" "Lung disease" "Cerebrovascular disease" ...
#> $ Mortality Rate per 100,000 Population: num [1:30] 201.5 186.4 44.7 43.9 26 ...
#> $ Publications : num [1:30] 358915 1078144 259196 136502 40179 ...
#> $ Funding : num [1:30] 1.29e+10 2.06e+10 1.09e+10 7.58e+09 5.93e+09 ...
#> $ Predicted Publications : num [1:30] 260875 248160 99170 98070 70085 ...
#> $ Publications (Studentized Residuals) : num [1:30] 0.27 1.25 0.75 0.26 -0.43 0.66 -0.11 -0.73 -0.13 0.02 ...
#> $ Predicted Funding : chr [1:30] "$12,075,841,121" "$11,415,379,454" "$4,065,188,916" "$4,014,434,824" ...
#> $ Funding (Studentized Residuals) : num [1:30] 0.04 0.33 0.52 0.34 0.4 0.92 -0.19 -0.64 -0.57 -0.7 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. `Cause of Death` = col_character(),
#> .. `Mortality Rate per 100,000 Population` = col_double(),
#> .. Publications = col_double(),
#> .. Funding = col_double(),
#> .. `Predicted Publications` = col_number(),
#> .. `Publications (Studentized Residuals)` = col_double(),
#> .. `Predicted Funding` = col_character(),
#> .. `Funding (Studentized Residuals)` = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
Name | research_funding |
Number of rows | 30 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Cause of Death | 0 | 1 | 3 | 44 | 0 | 30 | 0 |
Predicted Funding | 0 | 1 | 12 | 15 | 0 | 29 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Mortality Rate per 100,000 Population | 0 | 1 | 22.42 | 4.807000e+01 | 0.59 | 1.77 | 7.760000e+00 | 1.481000e+01 | 2.015400e+02 | ▇▁▁▁▁ |
Publications | 0 | 1 | 93913.90 | 2.028434e+05 | 1034.00 | 12550.00 | 3.949800e+04 | 5.406425e+04 | 1.078144e+06 | ▇▁▁▁▁ |
Funding | 0 | 1 | 4136597358.43 | 5.799997e+09 | 3474852.00 | 358026213.50 | 1.659781e+09 | 4.830357e+09 | 2.059661e+10 | ▇▂▁▁▁ |
Predicted Publications | 0 | 1 | 48162.03 | 6.143627e+04 | 6163.00 | 12448.50 | 3.221550e+04 | 4.880050e+04 | 2.608750e+05 | ▇▁▁▁▁ |
Publications (Studentized Residuals) | 0 | 1 | -0.01 | 1.060000e+00 | -2.63 | -0.36 | 1.200000e-01 | 7.400000e-01 | 1.460000e+00 | ▂▁▂▇▇ |
Funding (Studentized Residuals) | 0 | 1 | -0.02 | 1.090000e+00 | -3.71 | -0.53 | 3.400000e-01 | 5.800000e-01 | 1.920000e+00 | ▁▁▃▇▂ |
R Code 3.12 : Show raw data about guns manufactured in ten US (1986 - 2019) from USAFact.org website
#> spc_tbl_ [11 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ Years: chr [1:11] "Firearms manufactured (Items)" "By type of firearm" "Pistols (Items)" "Revolvers (Items)" ...
#> $ 1980 : chr [1:11] NA NA NA NA ...
#> $ 1981 : chr [1:11] NA NA NA NA ...
#> $ 1982 : chr [1:11] NA NA NA NA ...
#> $ 1983 : chr [1:11] NA NA NA NA ...
#> $ 1984 : chr [1:11] NA NA NA NA ...
#> $ 1985 : chr [1:11] NA NA NA NA ...
#> $ 1986 : chr [1:11] "3040934" NA "662973" "761414" ...
#> $ 1987 : num [1:11] 3559663 NA 964561 722512 857949 ...
#> $ 1988 : num [1:11] 3963877 NA 1101011 754744 928070 ...
#> $ 1989 : num [1:11] 4418393 NA 1404753 628573 935541 ...
#> $ 1990 : num [1:11] 3959968 NA 1371427 470495 848948 ...
#> $ 1991 : num [1:11] 3563106 NA 1378252 456966 828426 ...
#> $ 1992 : num [1:11] 4175836 NA 1669537 469413 1018204 ...
#> $ 1993 : num [1:11] 5055637 NA 2093362 562292 1144940 ...
#> $ 1994 : num [1:11] 5173217 NA 2004298 586450 1254926 ...
#> $ 1995 : num [1:11] 4316342 NA 1195284 527664 1173645 ...
#> $ 1996 : num [1:11] 3854439 NA 987528 498944 925732 ...
#> $ 1997 : num [1:11] 3593504 NA 1036077 370428 915978 ...
#> $ 1998 : num [1:11] 3713590 NA 960365 324390 868639 ...
#> $ 1999 : num [1:11] 4047747 NA 995446 335784 1106995 ...
#> $ 2000 : num [1:11] 3793541 NA 962901 318960 898442 ...
#> $ 2001 : num [1:11] 2932655 NA 626836 320143 679813 ...
#> $ 2002 : num [1:11] 3366895 NA 741514 347070 741325 ...
#> $ 2003 : num [1:11] 3308404 NA 811660 309364 726078 ...
#> $ 2004 : num [1:11] 3099025 NA 728511 294099 731769 ...
#> $ 2005 : num [1:11] 3241494 NA 803425 274205 709313 ...
#> $ 2006 : num [1:11] 3653324 NA 1021260 385069 714618 ...
#> $ 2007 : num [1:11] 3922613 NA 1219664 391334 645231 ...
#> $ 2008 : num [1:11] 4498944 NA 1609381 431753 630710 ...
#> $ 2009 : num [1:11] 5555818 NA 1868258 547195 752699 ...
#> $ 2010 : num [1:11] 5459240 NA 2258450 558927 743378 ...
#> $ 2011 : num [1:11] 6541886 NA 2598133 572857 862401 ...
#> $ 2012 : num [1:11] 8578610 NA 3487883 667357 949010 ...
#> $ 2013 : num [1:11] 10844792 NA 4441726 725282 1203072 ...
#> $ 2014 : num [1:11] 9050626 NA 3633454 744047 935411 ...
#> $ 2015 : num [1:11] 9358661 NA 3557199 885259 777273 ...
#> $ 2016 : num [1:11] 11497441 NA 4720075 856291 848617 ...
#> $ 2017 : num [1:11] 8327792 NA 3691010 720917 653139 ...
#> $ 2018 : num [1:11] 9052628 NA 3881158 664835 536126 ...
#> $ 2019 : num [1:11] 7011945 NA 3046013 580601 480735 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. Years = col_character(),
#> .. `1980` = col_character(),
#> .. `1981` = col_character(),
#> .. `1982` = col_character(),
#> .. `1983` = col_character(),
#> .. `1984` = col_character(),
#> .. `1985` = col_character(),
#> .. `1986` = col_character(),
#> .. `1987` = col_double(),
#> .. `1988` = col_double(),
#> .. `1989` = col_double(),
#> .. `1990` = col_double(),
#> .. `1991` = col_double(),
#> .. `1992` = col_double(),
#> .. `1993` = col_double(),
#> .. `1994` = col_double(),
#> .. `1995` = col_double(),
#> .. `1996` = col_double(),
#> .. `1997` = col_double(),
#> .. `1998` = col_double(),
#> .. `1999` = col_double(),
#> .. `2000` = col_double(),
#> .. `2001` = col_double(),
#> .. `2002` = col_double(),
#> .. `2003` = col_double(),
#> .. `2004` = col_double(),
#> .. `2005` = col_double(),
#> .. `2006` = col_double(),
#> .. `2007` = col_double(),
#> .. `2008` = col_double(),
#> .. `2009` = col_double(),
#> .. `2010` = col_double(),
#> .. `2011` = col_double(),
#> .. `2012` = col_double(),
#> .. `2013` = col_double(),
#> .. `2014` = col_double(),
#> .. `2015` = col_double(),
#> .. `2016` = col_double(),
#> .. `2017` = col_double(),
#> .. `2018` = col_double(),
#> .. `2019` = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
Name | guns_manufactured_2019 |
Number of rows | 11 |
Number of columns | 41 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 33 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Years | 2 | 0.82 | 8 | 29 | 0 | 9 | 0 |
1980 | 9 | 0.18 | 6 | 58 | 0 | 2 | 0 |
1981 | 9 | 0.18 | 6 | 35 | 0 | 2 | 0 |
1982 | 10 | 0.09 | 5 | 5 | 0 | 1 | 0 |
1983 | 9 | 0.18 | 3 | 51 | 0 | 2 | 0 |
1984 | 9 | 0.18 | 11 | 79 | 0 | 2 | 0 |
1985 | 9 | 0.18 | 4 | 7 | 0 | 2 | 0 |
1986 | 3 | 0.73 | 4 | 15 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
1987 | 5 | 0.55 | 1186554.3 | 1218822 | 6980 | 756371.2 | 911255.0 | 996886 | 3559663 | ▂▇▁▁▂ |
1988 | 5 | 0.55 | 1321292.3 | 1355892 | 35345 | 798075.5 | 1014540.5 | 1133783 | 3963877 | ▅▇▁▁▂ |
1989 | 5 | 0.55 | 1472797.7 | 1531902 | 42126 | 705315.0 | 1170147.0 | 1406738 | 4418393 | ▅▇▁▁▂ |
1990 | 5 | 0.55 | 1319989.3 | 1379959 | 57434 | 565108.2 | 1030306.0 | 1331486 | 3959968 | ▅▇▁▁▂ |
1991 | 5 | 0.55 | 1187702.0 | 1249591 | 15980 | 549831.0 | 855954.0 | 1254560 | 3563106 | ▅▇▁▁▂ |
1992 | 5 | 0.55 | 1391945.3 | 1473834 | 16849 | 602518.0 | 1010018.5 | 1506704 | 4175836 | ▅▇▁▁▂ |
1993 | 5 | 0.55 | 1685212.3 | 1783997 | 81349 | 707954.0 | 1159317.0 | 1863445 | 5055637 | ▇▇▃▁▃ |
1994 | 5 | 0.55 | 1724405.7 | 1821553 | 10936 | 753569.0 | 1285766.5 | 1832375 | 5173217 | ▅▇▁▁▂ |
1995 | 5 | 0.55 | 1438780.7 | 1502410 | 8629 | 689159.2 | 1184464.5 | 1357161 | 4316342 | ▅▇▁▁▂ |
1996 | 5 | 0.55 | 1284813.0 | 1346281 | 17920 | 605641.0 | 956630.0 | 1315118 | 3854439 | ▅▇▁▁▂ |
1997 | 5 | 0.55 | 1197834.7 | 1258599 | 19680 | 506815.5 | 976027.5 | 1197525 | 3593504 | ▅▇▁▁▂ |
1998 | 5 | 0.55 | 1237863.3 | 1321963 | 24506 | 460452.2 | 914502.0 | 1391859 | 3713590 | ▇▇▃▁▃ |
1999 | 5 | 0.55 | 1349249.0 | 1432202 | 39837 | 500699.5 | 1051220.5 | 1454012 | 4047747 | ▅▇▁▁▂ |
2000 | 5 | 0.55 | 1264513.7 | 1352038 | 30196 | 463830.5 | 930671.5 | 1428007 | 3793541 | ▇▇▃▁▃ |
2001 | 5 | 0.55 | 977551.7 | 1046414 | 21309 | 396816.2 | 653324.5 | 1133369 | 2932655 | ▇▇▃▁▃ |
2002 | 5 | 0.55 | 1122298.3 | 1207898 | 21700 | 445633.8 | 741419.5 | 1321843 | 3366895 | ▇▇▃▁▃ |
2003 | 5 | 0.55 | 1102801.3 | 1181269 | 30978 | 413542.5 | 768869.0 | 1275658 | 3308404 | ▇▇▃▁▃ |
2004 | 5 | 0.55 | 1033008.3 | 1105477 | 19508 | 402702.0 | 730140.0 | 1176796 | 3099025 | ▇▇▃▁▃ |
2005 | 5 | 0.55 | 1080498.0 | 1164096 | 23179 | 382982.0 | 756369.0 | 1274385 | 3241494 | ▇▇▃▁▃ |
2006 | 5 | 0.55 | 1217774.7 | 1295505 | 35872 | 467456.2 | 867939.0 | 1377694 | 3653324 | ▇▂▂▁▂ |
2007 | 5 | 0.55 | 1307537.7 | 1398868 | 55461 | 454808.2 | 932447.5 | 1513108 | 3922613 | ▇▂▂▁▂ |
2008 | 5 | 0.55 | 1499648.0 | 1608622 | 92564 | 481492.2 | 1120045.5 | 1703247 | 4498944 | ▇▅▁▁▂ |
2009 | 5 | 0.55 | 1851939.3 | 1986052 | 138815 | 598571.0 | 1310478.5 | 2153703 | 5555818 | ▇▅▁▁▂ |
2010 | 5 | 0.55 | 1819746.7 | 1962427 | 67929 | 605039.8 | 1286967.0 | 2151476 | 5459240 | ▇▂▂▁▂ |
2011 | 5 | 0.55 | 2180628.7 | 2345097 | 190407 | 645243.0 | 1590244.5 | 2528122 | 6541886 | ▇▅▁▁▂ |
2012 | 5 | 0.55 | 2859536.7 | 3103979 | 306154 | 737770.2 | 2058608.0 | 3407964 | 8578610 | ▇▅▁▁▂ |
2013 | 5 | 0.55 | 3614930.7 | 3923969 | 495142 | 844729.5 | 2591321.0 | 4326187 | 10844792 | ▇▅▁▁▂ |
2014 | 5 | 0.55 | 3016875.3 | 3270622 | 358165 | 791888.0 | 2157480.0 | 3569978 | 9050626 | ▇▅▁▁▂ |
2015 | 5 | 0.55 | 3119553.7 | 3378333 | 447131 | 804269.5 | 2221229.0 | 3658149 | 9358661 | ▇▅▁▁▂ |
2016 | 5 | 0.55 | 3832480.3 | 4158420 | 833123 | 850535.5 | 2547813.0 | 4599890 | 11497441 | ▇▅▁▁▂ |
2017 | 5 | 0.55 | 2775930.7 | 2984401 | 653139 | 730346.2 | 1631363.0 | 3394280 | 8327792 | ▇▅▁▁▂ |
2018 | 5 | 0.55 | 3017542.7 | 3243302 | 536126 | 771119.5 | 1985254.5 | 3631002 | 9052628 | ▇▅▁▁▂ |
2019 | 5 | 0.55 | 2337315.0 | 2488560 | 480735 | 672183.0 | 1452298.0 | 2773926 | 7011945 | ▇▅▁▁▂ |
R Code 3.13 : Show raw data of exported guns from the ATF PDF report (1998-2019)
#> chr [1:32, 1:7] "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993" ...
Name | guns_exported_2017 |
Number of rows | 32 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
V1 | 0 | 1 | 4 | 4 | 0 | 32 | 0 |
V2 | 0 | 1 | 6 | 7 | 0 | 32 | 0 |
V3 | 0 | 1 | 6 | 7 | 0 | 32 | 0 |
V4 | 0 | 1 | 6 | 7 | 0 | 32 | 0 |
V5 | 0 | 1 | 6 | 7 | 0 | 32 | 0 |
V6 | 0 | 1 | 3 | 6 | 0 | 32 | 0 |
V7 | 0 | 1 | 7 | 7 | 0 | 32 | 0 |
R Code 3.14 : Show raw data of imported guns from the ATF PDF report (1986-2018)
#> chr [1:33, 1:5] "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993" ...
Name | guns_imported_2018 |
Number of rows | 33 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
V1 | 0 | 1 | 4 | 4 | 0 | 33 | 0 |
V2 | 0 | 1 | 7 | 7 | 0 | 33 | 0 |
V3 | 0 | 1 | 7 | 9 | 0 | 33 | 0 |
V4 | 0 | 1 | 7 | 9 | 0 | 33 | 0 |
V5 | 0 | 1 | 7 | 9 | 0 | 33 | 0 |
Example 3.3 : Recode data for chapter 3
R Code 3.15 : Recode FBI deaths data 2012-2016
fbi_deaths <- base::readRDS("data/chap03/fbi_deaths.rds")
fbi_deaths_clean <- fbi_deaths |>
dplyr::slice(3:7) |>
tidyr::pivot_longer(
cols = -Weapons,
values_to = "deaths",
names_to = "year"
) |>
dplyr::rename(weapons = Weapons)
base::saveRDS(fbi_deaths_clean,
"data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean
#> # A tibble: 25 × 3
#> weapons year deaths
#> <chr> <chr> <dbl>
#> 1 Handguns 2012 6404
#> 2 Handguns 2013 5782
#> 3 Handguns 2014 5673
#> 4 Handguns 2015 6569
#> 5 Handguns 2016 7105
#> 6 Rifles 2012 298
#> 7 Rifles 2013 285
#> 8 Rifles 2014 258
#> 9 Rifles 2015 258
#> 10 Rifles 2016 374
#> # ℹ 15 more rows
R Code 3.16 : Recode gun use variable AUQ300
from NHANES data 2011-2012
## load data
nhanes_2012 <- base::readRDS("data/chap03/nhanes_2012.rds")
## recode data
nhanes_2012_clean1 <- nhanes_2012 |>
dplyr::mutate(AUQ300 =
dplyr::na_if(x = AUQ300, y = 7)) |>
dplyr::mutate(AUQ300 =
dplyr::na_if(x = AUQ300, y = 9)) |>
## see my note in the text under the code
# tidyr::drop_na() |>
dplyr::mutate(AUQ300 = forcats::as_factor(AUQ300)) |>
dplyr::mutate(AUQ300 =
forcats::fct_recode(AUQ300,
"Yes" = "1",
"No" = "2")
) |>
dplyr::rename(gun_use = AUQ300) |>
dplyr::relocate(gun_use)
gun_use_2012 <- nhanes_2012_clean1 |>
dplyr::count(gun_use)|>
dplyr::mutate(percent = round(n / sum(n), 2) * 100)
glue::glue("Result calculated manually with `dplyr::count()` and `dplyr::mutate()`")
#> Result calculated manually with `dplyr::count()` and `dplyr::mutate()`
gun_use_2012
#> gun_use n percent
#> 1 Yes 1613 17
#> 2 No 3061 33
#> 3 <NA> 4690 50
glue::glue(" ")
glue::glue("*******************************************************************")
#> *******************************************************************
glue::glue("Result calculated with `janitor::tabyl()` and `janitor::adorn_pct_formating()`")
#> Result calculated with `janitor::tabyl()` and `janitor::adorn_pct_formating()`
nhanes_2012_clean1 |>
janitor::tabyl(gun_use) |>
janitor::adorn_pct_formatting()
#> gun_use n percent valid_percent
#> Yes 1613 17.2% 34.5%
#> No 3061 32.7% 65.5%
#> <NA> 4690 50.1% -
For recoding the levels of the categorical variable I have looked up the appropriate passage in the codebook (see: Graph 3.1).
With the last line dplyr::relocate(gun_use)
I put the column gun_use
to the front of the data frame. If neither the .before
nor the .after
argument of the function are supplied the column will move columns to the left-hand side of the data frame. So it is easy to find for visual inspections via the RStudio data frame viewer.
When to remove the NA’s?
It would be not correct to remove the NA’s here in the recoding code chunk, because this would remove the rows with missing values from the gun_use
variable across the whole data frame! This implies that values of other variable that are not missing would be removed too. It is correct to remove the NA’s when the output for the analysis (graph or table) is prepared via the pipe operator without changing the stored data.
R Code 3.17 : Recode rounds fired variable AUQ310
from NHANES data 2011-2012
nhanes_2012_clean2 <- nhanes_2012_clean1 |>
dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |>
dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
"1 to less than 100" = "1",
"100 to less than 1000" = "2",
"1000 to less than 10k" = "3",
"10k to less than 50k" = "4",
"50k or more" = "5",
"Refused to answer" = "7",
"Don't know" = "9")
) |>
dplyr::rename(rounds_fired = AUQ310) |>
dplyr::relocate(rounds_fired, .after = gun_use)
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `AUQ310 = forcats::fct_recode(...)`.
#> Caused by warning:
#> ! Unknown levels in `f`: 7
#> rounds_fired prop n
#> 1 1 to less than 100 43 701
#> 2 100 to less than 1000 26 423
#> 3 1000 to less than 10k 18 291
#> 4 10k to less than 50k 7 106
#> 5 50k or more 4 66
#> 6 Don't know 2 26
I got a warning about a unknown level 7
because no respondent refused an answer. But this is not important. I could either choose to not recode level 7
or turn warning off in the chunk option — or simply to ignore the warning.
R Code 3.18 : Recode sex variable RIAGENDR
from NHANES data 2011-2012
nhanes_2012_clean3 <- nhanes_2012_clean2 |>
dplyr::mutate(RIAGENDR = forcats::as_factor(RIAGENDR)) |>
dplyr::mutate(RIAGENDR = forcats::fct_recode(RIAGENDR,
"Male" = '1',
"Female" = '2')
) |>
dplyr::rename(sex = RIAGENDR) |>
dplyr::relocate(sex, .after = rounds_fired)
nhanes_2012_clean3 |>
janitor::tabyl(sex) |>
janitor::adorn_pct_formatting()
#> sex n percent
#> Male 4663 49.8%
#> Female 4701 50.2%
R Code 3.19 : Recode ear plugs variable AUQ320
from NHANES data 2011-2012
nhanes_2012_clean4 <- nhanes_2012_clean3 |>
dplyr::mutate(AUQ320 = forcats::as_factor(AUQ320)) |>
dplyr::mutate(AUQ320 = forcats::fct_recode(AUQ320,
"Always" = "1",
"Usually" = "2",
"About half the time" = "3",
"Seldom" = "4",
"Never" = "5",
# "Refused to answer" = "7", # nobody refused
"Don't know" = "9")
) |>
dplyr::rename(ear_plugs = AUQ320) |>
dplyr::relocate(ear_plugs, .after = gun_use)
base::saveRDS(nhanes_2012_clean4,
"data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean4 |>
janitor::tabyl(ear_plugs) |>
janitor::adorn_pct_formatting()
#> ear_plugs n percent valid_percent
#> Always 583 6.2% 36.1%
#> Usually 152 1.6% 9.4%
#> About half the time 123 1.3% 7.6%
#> Seldom 110 1.2% 6.8%
#> Never 642 6.9% 39.8%
#> Don't know 3 0.0% 0.2%
#> <NA> 7751 82.8% -
R Code 3.20 : Recode guns exported from a PDF report by the ATF
guns_exported_2017 <- base::readRDS("data/chap03/guns_exported_2017.rds")
lookup_export <- c(Year = "V1",
Pistols = "V2",
Revolvers = "V3",
Rifles = "V4",
Shotguns = "V5",
Misc = "V6",
Total = "V7")
guns_exported_clean <- dplyr::rename(
tibble::as_tibble(guns_exported_2017),
dplyr::all_of(lookup_export)
) |>
## comma separated character columns to numeric
dplyr::mutate(dplyr::across(2:7,
function(x)
{ base::as.numeric(as.character(base::gsub(",", "", x))) })
) |>
## add Pistols and Revolvers to Handguns
dplyr::mutate(Handguns = Pistols + Revolvers) |>
## specify the same order for all three data frames
dplyr::select(-c(Pistols, Revolvers)) |>
dplyr::relocate(c(Year, Handguns, Rifles,
Shotguns, Misc, Total))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#> $ Year : chr [1:32] "1986" "1987" "1988" "1989" ...
#> $ Handguns: num [1:32] 121082 159552 131859 118464 180218 ...
#> $ Rifles : num [1:32] 37224 42161 53896 73247 71834 ...
#> $ Shotguns: num [1:32] 58943 76337 68699 67559 104250 ...
#> $ Misc : num [1:32] 199 9995 2728 2012 5323 ...
#> $ Total : num [1:32] 217448 288045 257182 261282 361625 ...
Name | guns_exported_clean |
Number of rows | 32 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Year | 0 | 1 | 4 | 4 | 0 | 32 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Handguns | 0 | 1 | 130703.47 | 60973.09 | 39081 | 81605.00 | 138057.5 | 172895.25 | 297100 | ▇▃▇▁▁ |
Rifles | 0 | 1 | 86827.50 | 37709.88 | 37224 | 62492.25 | 77941.0 | 92641.50 | 207934 | ▆▇▁▂▁ |
Shotguns | 0 | 1 | 64981.97 | 37306.30 | 18797 | 36113.00 | 56324.5 | 87122.00 | 171475 | ▇▅▃▁▁ |
Misc | 0 | 1 | 8057.84 | 7633.02 | 199 | 2505.50 | 5689.0 | 10987.25 | 34022 | ▇▃▂▁▁ |
Total | 0 | 1 | 290570.78 | 99914.95 | 139920 | 204422.50 | 281317.5 | 380893.75 | 488300 | ▇▆▅▆▃ |
I have recoded the data in several ways:
- I turned the resulted matrices from the {**tabulizer**} package into tibbles.
- Now I could rename all the columns with one named vector `lookup_export`.
- All the columns are of character type. Before I could change the appropriate column to numeric I had to remove the comma for big numbers, otherwise the base::as-numeric() function would not have worked.
- I added `Pistols` and `Revolvers` to `Handguns`, because the dataset about imported guns have only this category.
- Finnaly I relocated the columns to get the same structure in all data frames.
R Code 3.21 : Recode guns imported from a PDF by the ATF
guns_imported_2018 <- base::readRDS("data/chap03/guns_imported_2018.rds")
lookup_import <- c(Year = "V1",
Shotguns = "V2",
Rifles = "V3",
Handguns = "V4",
Total = "V5")
## reduce the reported period for all file from 1980 to 2017
guns_imported_clean <- dplyr::rename(
tibble::as_tibble(guns_imported_2018),
dplyr::all_of(lookup_import)
) |>
## comma separated character columns to numeric
dplyr::mutate(dplyr::across(2:5,
function(x)
{ base::as.numeric(as.character(base::gsub(",", "", x))) })
) |>
dplyr::slice(1:dplyr::n() - 1) |>
dplyr::mutate(Misc = 0) |>
dplyr::relocate(c(Year, Handguns, Rifles,
Shotguns, Misc, Total))
save_data_file("chap03", guns_imported_clean, "guns_imported_clean.rds")
utils::str(guns_imported_clean)
skimr::skim(guns_imported_clean)
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#> $ Year : chr [1:32] "1986" "1987" "1988" "1989" ...
#> $ Handguns: num [1:32] 231000 342113 621620 440132 448517 ...
#> $ Rifles : num [1:32] 269000 413780 282640 293152 203505 ...
#> $ Shotguns: num [1:32] 201000 307620 372008 274497 191787 ...
#> $ Misc : num [1:32] 0 0 0 0 0 0 0 0 0 0 ...
#> $ Total : num [1:32] 701000 1063513 1276268 1007781 843809 ...
Name | guns_imported_clean |
Number of rows | 32 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Year | 0 | 1 | 4 | 4 | 0 | 32 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Handguns | 0 | 1 | 1228289.6 | 959074.1 | 231000 | 486461.0 | 858514 | 1739603.2 | 3671837 | ▇▂▂▁▁ |
Rifles | 0 | 1 | 613776.0 | 390186.3 | 198191 | 290524.0 | 556201 | 797873.2 | 1592522 | ▇▅▃▁▂ |
Shotguns | 0 | 1 | 433837.8 | 233098.7 | 106296 | 239432.2 | 417866 | 570714.2 | 973465 | ▇▆▆▃▂ |
Misc | 0 | 1 | 0.0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ▁▁▇▁▁ |
Total | 0 | 1 | 2275903.5 | 1419964.9 | 701000 | 1049580.0 | 1895880 | 3095591.8 | 5539539 | ▇▂▂▂▂ |
Here applies similar note as in the previous tab:
- I turned the resulted matrices from the {**tabulizer**} package into tibbles.
- Now I could rename all the columns with one named vector `lookup_import`.
- All the columns are of character type. Before I could change the appropriate column to numeric I had to remove the comma for big numbers, otherwise the base::as-numeric() function would not have worked.
- I relocated the columns to get the same structure in all data frames.
R Code 3.22 : Recode firearms production dataset 1986-2019 from USAFact.org
guns_manufactured_2019 <- base::readRDS("data/chap03/guns_manufactured_2019.rds")
guns_manufactured_2019_clean <- guns_manufactured_2019 |>
dplyr::select(-c(2:7)) |>
dplyr::slice(c(1,3:7)) |>
## strange: `1986` is a character variable
dplyr::mutate(`1986` = as.numeric(`1986`))
base::saveRDS(guns_manufactured_2019_clean,
"data/chap03/guns_manufactured_2019_clean.rds")
utils::str(guns_manufactured_2019_clean)
skimr::skim(guns_manufactured_2019_clean)
#> tibble [6 × 35] (S3: tbl_df/tbl/data.frame)
#> $ Years: chr [1:6] "Firearms manufactured (Items)" "Pistols (Items)" "Revolvers (Items)" "Shotguns (Items)" ...
#> $ 1986 : num [1:6] 3040934 662973 761414 641482 4558 ...
#> $ 1987 : num [1:6] 3559663 964561 722512 857949 6980 ...
#> $ 1988 : num [1:6] 3963877 1101011 754744 928070 35345 ...
#> $ 1989 : num [1:6] 4418393 1404753 628573 935541 42126 ...
#> $ 1990 : num [1:6] 3959968 1371427 470495 848948 57434 ...
#> $ 1991 : num [1:6] 3563106 1378252 456966 828426 15980 ...
#> $ 1992 : num [1:6] 4175836 1669537 469413 1018204 16849 ...
#> $ 1993 : num [1:6] 5055637 2093362 562292 1144940 81349 ...
#> $ 1994 : num [1:6] 5173217 2004298 586450 1254926 10936 ...
#> $ 1995 : num [1:6] 4316342 1195284 527664 1173645 8629 ...
#> $ 1996 : num [1:6] 3854439 987528 498944 925732 17920 ...
#> $ 1997 : num [1:6] 3593504 1036077 370428 915978 19680 ...
#> $ 1998 : num [1:6] 3713590 960365 324390 868639 24506 ...
#> $ 1999 : num [1:6] 4047747 995446 335784 1106995 39837 ...
#> $ 2000 : num [1:6] 3793541 962901 318960 898442 30196 ...
#> $ 2001 : num [1:6] 2932655 626836 320143 679813 21309 ...
#> $ 2002 : num [1:6] 3366895 741514 347070 741325 21700 ...
#> $ 2003 : num [1:6] 3308404 811660 309364 726078 30978 ...
#> $ 2004 : num [1:6] 3099025 728511 294099 731769 19508 ...
#> $ 2005 : num [1:6] 3241494 803425 274205 709313 23179 ...
#> $ 2006 : num [1:6] 3653324 1021260 385069 714618 35872 ...
#> $ 2007 : num [1:6] 3922613 1219664 391334 645231 55461 ...
#> $ 2008 : num [1:6] 4498944 1609381 431753 630710 92564 ...
#> $ 2009 : num [1:6] 5555818 1868258 547195 752699 138815 ...
#> $ 2010 : num [1:6] 5459240 2258450 558927 743378 67929 ...
#> $ 2011 : num [1:6] 6541886 2598133 572857 862401 190407 ...
#> $ 2012 : num [1:6] 8578610 3487883 667357 949010 306154 ...
#> $ 2013 : num [1:6] 10844792 4441726 725282 1203072 495142 ...
#> $ 2014 : num [1:6] 9050626 3633454 744047 935411 358165 ...
#> $ 2015 : num [1:6] 9358661 3557199 885259 777273 447131 ...
#> $ 2016 : num [1:6] 11497441 4720075 856291 848617 833123 ...
#> $ 2017 : num [1:6] 8327792 3691010 720917 653139 758634 ...
#> $ 2018 : num [1:6] 9052628 3881158 664835 536126 1089973 ...
#> $ 2019 : num [1:6] 7011945 3046013 580601 480735 946929 ...
Name | guns_manufactured_2019_cl… |
Number of rows | 6 |
Number of columns | 35 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Years | 0 | 1 | 14 | 29 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
1986 | 0 | 1 | 1013644.7 | 1044520 | 4558 | 646854.8 | 712193.5 | 918233.8 | 3040934 | ▂▇▁▁▂ |
1987 | 0 | 1 | 1186554.3 | 1218822 | 6980 | 756371.2 | 911255.0 | 996886.0 | 3559663 | ▂▇▁▁▂ |
1988 | 0 | 1 | 1321292.3 | 1355892 | 35345 | 798075.5 | 1014540.5 | 1133783.0 | 3963877 | ▅▇▁▁▂ |
1989 | 0 | 1 | 1472797.7 | 1531902 | 42126 | 705315.0 | 1170147.0 | 1406738.2 | 4418393 | ▅▇▁▁▂ |
1990 | 0 | 1 | 1319989.3 | 1379959 | 57434 | 565108.2 | 1030306.0 | 1331486.2 | 3959968 | ▅▇▁▁▂ |
1991 | 0 | 1 | 1187702.0 | 1249591 | 15980 | 549831.0 | 855954.0 | 1254559.5 | 3563106 | ▅▇▁▁▂ |
1992 | 0 | 1 | 1391945.3 | 1473834 | 16849 | 602518.0 | 1010018.5 | 1506703.8 | 4175836 | ▅▇▁▁▂ |
1993 | 0 | 1 | 1685212.3 | 1783997 | 81349 | 707954.0 | 1159317.0 | 1863445.0 | 5055637 | ▇▇▃▁▃ |
1994 | 0 | 1 | 1724405.7 | 1821553 | 10936 | 753569.0 | 1285766.5 | 1832375.2 | 5173217 | ▅▇▁▁▂ |
1995 | 0 | 1 | 1438780.7 | 1502410 | 8629 | 689159.2 | 1184464.5 | 1357161.0 | 4316342 | ▅▇▁▁▂ |
1996 | 0 | 1 | 1284813.0 | 1346281 | 17920 | 605641.0 | 956630.0 | 1315118.2 | 3854439 | ▅▇▁▁▂ |
1997 | 0 | 1 | 1197834.7 | 1258599 | 19680 | 506815.5 | 976027.5 | 1197525.0 | 3593504 | ▅▇▁▁▂ |
1998 | 0 | 1 | 1237863.3 | 1321963 | 24506 | 460452.2 | 914502.0 | 1391858.8 | 3713590 | ▇▇▃▁▃ |
1999 | 0 | 1 | 1349249.0 | 1432202 | 39837 | 500699.5 | 1051220.5 | 1454012.5 | 4047747 | ▅▇▁▁▂ |
2000 | 0 | 1 | 1264513.7 | 1352038 | 30196 | 463830.5 | 930671.5 | 1428006.8 | 3793541 | ▇▇▃▁▃ |
2001 | 0 | 1 | 977551.7 | 1046414 | 21309 | 396816.2 | 653324.5 | 1133368.8 | 2932655 | ▇▇▃▁▃ |
2002 | 0 | 1 | 1122298.3 | 1207898 | 21700 | 445633.8 | 741419.5 | 1321843.0 | 3366895 | ▇▇▃▁▃ |
2003 | 0 | 1 | 1102801.3 | 1181269 | 30978 | 413542.5 | 768869.0 | 1275658.0 | 3308404 | ▇▇▃▁▃ |
2004 | 0 | 1 | 1033008.3 | 1105477 | 19508 | 402702.0 | 730140.0 | 1176795.8 | 3099025 | ▇▇▃▁▃ |
2005 | 0 | 1 | 1080498.0 | 1164096 | 23179 | 382982.0 | 756369.0 | 1274385.2 | 3241494 | ▇▇▃▁▃ |
2006 | 0 | 1 | 1217774.7 | 1295505 | 35872 | 467456.2 | 867939.0 | 1377693.8 | 3653324 | ▇▂▂▁▂ |
2007 | 0 | 1 | 1307537.7 | 1398868 | 55461 | 454808.2 | 932447.5 | 1513108.2 | 3922613 | ▇▂▂▁▂ |
2008 | 0 | 1 | 1499648.0 | 1608622 | 92564 | 481492.2 | 1120045.5 | 1703247.2 | 4498944 | ▇▅▁▁▂ |
2009 | 0 | 1 | 1851939.3 | 1986052 | 138815 | 598571.0 | 1310478.5 | 2153702.8 | 5555818 | ▇▅▁▁▂ |
2010 | 0 | 1 | 1819746.7 | 1962427 | 67929 | 605039.8 | 1286967.0 | 2151476.5 | 5459240 | ▇▂▂▁▂ |
2011 | 0 | 1 | 2180628.7 | 2345097 | 190407 | 645243.0 | 1590244.5 | 2528121.8 | 6541886 | ▇▅▁▁▂ |
2012 | 0 | 1 | 2859536.7 | 3103979 | 306154 | 737770.2 | 2058608.0 | 3407963.8 | 8578610 | ▇▅▁▁▂ |
2013 | 0 | 1 | 3614930.7 | 3923969 | 495142 | 844729.5 | 2591321.0 | 4326187.0 | 10844792 | ▇▅▁▁▂ |
2014 | 0 | 1 | 3016875.3 | 3270622 | 358165 | 791888.0 | 2157480.0 | 3569977.8 | 9050626 | ▇▅▁▁▂ |
2015 | 0 | 1 | 3119553.7 | 3378333 | 447131 | 804269.5 | 2221229.0 | 3658149.0 | 9358661 | ▇▅▁▁▂ |
2016 | 0 | 1 | 3832480.3 | 4158420 | 833123 | 850535.5 | 2547813.0 | 4599890.0 | 11497441 | ▇▅▁▁▂ |
2017 | 0 | 1 | 2775930.7 | 2984401 | 653139 | 730346.2 | 1631363.0 | 3394280.5 | 8327792 | ▇▅▁▁▂ |
2018 | 0 | 1 | 3017542.7 | 3243302 | 536126 | 771119.5 | 1985254.5 | 3631002.5 | 9052628 | ▇▅▁▁▂ |
2019 | 0 | 1 | 2337315.0 | 2488560 | 480735 | 672183.0 | 1452298.0 | 2773926.5 | 7011945 | ▇▅▁▁▂ |
R Code 3.23 : Recode the guns manufactured data frame to get an appropriate structure to add the imported and subtract the exported data.
guns_manufactured_clean <- base::readRDS("data/chap03/guns_manufactured_2019_clean.rds")
lookup_manufactured <- c(
Total = "Firearms manufactured (Items)",
Pistols = "Pistols (Items)",
Rifles = "Rifles (Items)",
Shotguns = "Shotguns (Items)",
Revolvers = "Revolvers (Items)",
Misc = "Misc. Firearms (Items)"
)
guns_manufactured_clean <- guns_manufactured_clean |>
dplyr::select(-c(`2018`, `2019`)) |>
tidyr::pivot_longer(
cols = !Years,
names_to = "Year",
values_to = "count"
) |>
tidyr::pivot_wider(
names_from = Years,
values_from = count
) |>
dplyr::rename(
dplyr::all_of(lookup_manufactured)
) |>
dplyr::mutate(Handguns = Pistols + Revolvers) |>
dplyr::select(-c(Pistols, Revolvers)) |>
dplyr::relocate(c(Year, Handguns, Rifles,
Shotguns, Misc, Total))
save_data_file("chap03", guns_manufactured_clean, "guns_manufactured_clean.rds")
utils::str(guns_manufactured_clean)
skimr::skim(guns_manufactured_clean)
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#> $ Year : chr [1:32] "1986" "1987" "1988" "1989" ...
#> $ Handguns: num [1:32] 1424387 1687073 1855755 2033326 1841922 ...
#> $ Rifles : num [1:32] 970507 1007661 1144707 1407400 1211664 ...
#> $ Shotguns: num [1:32] 641482 857949 928070 935541 848948 ...
#> $ Misc : num [1:32] 4558 6980 35345 42126 57434 ...
#> $ Total : num [1:32] 3040934 3559663 3963877 4418393 3959968 ...
Name | guns_manufactured_clean |
Number of rows | 32 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Year | 0 | 1 | 4 | 4 | 0 | 32 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Handguns | 0 | 1 | 2294575.4 | 1305428 | 946979 | 1387554.2 | 1838570.0 | 2696084.8 | 5576366 | ▇▂▁▂▁ |
Rifles | 0 | 1 | 1815044.8 | 910422 | 883482 | 1276250.8 | 1463938.5 | 1935129.8 | 4239335 | ▇▂▁▁▁ |
Shotguns | 0 | 1 | 864117.9 | 172169 | 630710 | 730346.2 | 853448.5 | 935443.5 | 1254926 | ▇▅▅▁▂ |
Misc | 0 | 1 | 134637.4 | 216279 | 4558 | 19637.0 | 35608.5 | 104126.8 | 833123 | ▇▁▁▁▁ |
Total | 0 | 1 | 5108375.4 | 2388668 | 2932655 | 3585904.5 | 4005812.0 | 5483384.5 | 11497441 | ▇▂▁▂▁ |
R Code 3.24 : Manufactured guns - exported guns + imported guns
As a first approximation to get guns data circulated in the US I have taken the manufactured numbers, subtracted the exported guns and added the imported guns.
I have already listed some reasons why the above result is still not the real amount of circulated guns per year in the US (see: Remark 3.1)
There are different possible classification of graph types. In the book Harris uses as major criteria the types and numbers of variables. This is very sensitive subject orientated arrangement addressed at the statistics novice with the main question: What type of graph should I use for my data?
The disadvantage of the subject oriented selection criteria is that there some graph types (e.g. the bar chart) that appear several times under different headings. Explaining the graph types is therefore somewhat redundant on the one hand and piecemeal on the other hand.
Another classification criteria would be the type of the graph itself. Under this pattern one could concentrate of the special features for each graph type. One of these features would be their applicability referring the variable types.
Example 3.4 : Five different categorization approaches
You can see the redundancy when you categorize the graph types by variable types. But the advantage is, that your choice of graph is driven by essential statistical aspects.
This chapter will follow the book and therefore it will present the same order in the explication of the graphs as in the book outlined.
This is a very abridged list of graphs used for presentation of statistical results. Although it is just a tiny selection of possible graphs it contains those graphs that are most important, most used and most well know types.
Although this list has only 8 categories it is in contrast to Bullet List 3.2 a more complete list of different graphs. It features also not so well known graph types. Besides a miscellaneous category where the members of this group do not share a common feature the graph are sorted in categorization schema that has — with the exception of bar charts — no redundancy, e.g. is almost a taxonomy of graph types.
This is the same variable oriented approach as in the book but with much more details and differentiation. It is cluttered with redundancies but should be helpful for selecting an appropriate graph type for your data analysis. And the interactive style on the web allows for a much better orientation as implemented in the above list.
There are several options for visualizing a single categorical variable:
Example 3.5 : Creating Bar Charts for Firearm Usage
R Code 3.25 : Bar charts for gun use (NHANES 2011-2012) with different width of bars
## bar chart: bars with wide width
p_normal <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use)
) +
ggplot2::geom_bar() +
ggplot2::labs(x = "Gun use",
y = "Number of participants") +
ggplot2::theme_minimal()
## bar chart: bars with small width
p_small <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use)
) +
ggplot2::geom_bar(width = 0.4) +
ggplot2::theme_minimal() +
ggplot2::theme(aspect.ratio = 4/1) +
ggplot2::labs(x = "Gun use",
y = "Number of participants")
## display both charts side by side
gridExtra::grid.arrange(p_normal, p_small, ncol = 2)
width
argument inside the ggplot2::geom_bar()
function because that would create unproportional wide space between the two bars. One need to apply the aspect ratio for the used theme as well. In this case all commands to the theme (e.g. my ggplot2::the_bw()
) has to come before the aspect.ratio
argument. One has to try out which aspect ratio works best.R Code 3.26 : Bar charts for gun use (NHANES 2011-2012) with different colorizing methods
## bar chart: filled colors within aes() (data controlled)
p_fill_in <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
ggplot2::geom_bar(
ggplot2::aes(fill = gun_use), width = 0.4) +
ggplot2::theme_bw() +
ggplot2::theme(legend.position = "none") +
ggplot2::theme(aspect.ratio = 3/1) +
ggplot2::labs(x = "Gun use",
y = "Number of participants",
subtitle = "Filled inside \naes()")
## bar chart: filled colors outside aes() (manually controlled)
p_fill_out <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
# ggplot2::theme(legend.position = "none") +
ggplot2::geom_bar(width = 0.4, fill = c("darkred", "steelblue")) +
ggplot2::theme_bw() +
ggplot2::theme(aspect.ratio = 3/1) +
ggplot2::labs(x = "Gun use",
y = "Number of participants",
subtitle = "Filled outside \naes() my colors")
## ## bar chart: fill = data controlled by my own colors
p_fill_in_my_colors <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
ggplot2::geom_bar(
ggplot2::aes(fill = gun_use), width = 0.4) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("darkred", "steelblue"), guide = "none") +
ggplot2::theme(aspect.ratio = 3/1) +
ggplot2::labs(x = "Gun use",
y = "Number of participants",
subtitle = "Filled inside \nwith my colors")
## bar chart: manually controlled colors with my own color
p_fill_out_my_colors <- nhanes_2012_clean1 |>
dplyr::select(gun_use) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
ggplot2::geom_bar(width = 0.4, fill = c("darkred", "steelblue")) +
ggplot2::theme_bw() +
ggplot2::theme(aspect.ratio = 3/1) +
ggplot2::labs(x = "Gun use",
y = "Number of participants",
subtitle = "Filled outside \nwith my colors")
## patchwork with :: syntax ############################
## display all 4 charts side by side
## using the trick from
## https://github.com/thomasp85/patchwork/issues/351#issuecomment-1931140157
patchwork:::"|.ggplot"(
p_fill_in,
p_fill_out
)
patchwork:::"|.ggplot"(
p_fill_in_my_colors,
p_fill_out_my_colors
)
# library(patchwork)
# p_fill_in | p_fill_out |
# p_fill_in_my_colors | p_fill_out_my_colors
aes()
and is therefore data controlled. This means that the colors will be settled automatically by factor level.aes()
and is therefore manually controlled. One needs to supply colors otherwise one gets a graph without any colors at all.aes()
and is therefore data controlled, you can change the color composition. But you has also the responsibility to provide a correct legend — or as I have done in this example — to remove the legend from the display. (The argument guide = FALSE
as used in the book is superseded with guide = "none"
)aes()
with specified colors.I used {patchwork} here to show all four example graphs (see Section A.62). As always I didn’t want to use the base::library()
function to load and attach the package. But I didn’t know how to do this with the {patchwork} operators. Finally I asked my question on StackOverflow and received as answer the solution.
At first I tried it with the +
operator. But that produced two very small graphs in the first row of the middle of the display area, and other two small graphs in the second row of the middle of the display area. Beside this ugly display the text of the subtitle was also truncated. After some experimentation I learned that I had to use the |
operator.
Pie charts show parts of a whole. The pie, or circle, represents the whole. The slices of pie shown in different colors represent the parts. A similar graph type is the but they are not recommended” for several reasons
Resource 3.2 : Why you shouldn’t use pie charts
But there are some cases, where pie chart (or donut charts sometimes also called ring chart) are appropriate:
Visualize an important number by highlighting just one junk of the circle
BTW: Donut charts are even worse than pie charts:
The middle of the pie is gone. The middle of the pie … where the angle is established, which is what humans would look at to try to determine wedge size, proportion, and data. Even though we aren’t accurate at interpreting angles, the situation is made worse when we remove the middle of the pie. Now we are left to judge curvature and … compare wedges by both curvature and angle (Evergreen 2019, 32).
Use a very limited number of wedges (best not more than two) for making a clear point.
Example 3.6 : Creating Pie & Donut Charts for Firearm Usage
R Code 3.27 : Visualize percentage of gun user from NHANES survey 2011-2012
lab <- "<span style='font-size:36pt; color:white'>**35%**</span>"
gun_use_2012 |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = '', y = percent)) +
ggplot2::geom_col(fill = c("#fa8324", "grey")) +
ggplot2::coord_polar(theta = 'y') +
ggtext::geom_richtext(ggplot2::aes(
x = 1.1, y = 8,
label = lab),
fill = "#fa8324",
label.colour = "#fa8324") +
ggplot2::annotate("text", x = .7, y = 10,
label = " used firearms",
color = "white",
size = 6) +
ggplot2::theme_void()
#> Warning in ggtext::geom_richtext(ggplot2::aes(x = 1.1, y = 8, label = lab), : All aesthetics have length 1, but the data has 2 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#> a single row.
The most important code line to create a pie graph is ggplot2::coord_polar(theta = 'y')
. In the concept of gg
(grammar of graphics) a car chart and a pie chart are — with the exception of the above code line — identical (C 2010).
Beside the ggplot2::annotate()
function for text comments inside graphics I had for to get the necessary formatting options for the big number also to use {ggtext}, one of 132 registered {ggplot2} extensions. {ggtext} enables the rendering of complex formatted plot labels (see Section A.33).
For training purposes I tried to create exactly the same pattern (color, text size etc.) of a pie chart as in Graph 3.5 (a).
R Code 3.28 : Ever used firearms for any reason? (NHANES survey 2011-2012)
gun_use_2012 |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = '', y = percent,
fill = forcats::fct_rev(gun_use))
) +
ggplot2::geom_col() +
ggplot2::geom_text(ggplot2::aes(
label = gun_use),
color = "white",
position = ggplot2::position_stack(vjust = 0.5),
size = 10) +
ggplot2::coord_polar(theta = 'y') +
ggplot2::theme_void() +
ggplot2::theme(legend.position = "none") +
ggplot2::labs(x = '', y = '') +
viridis::scale_fill_viridis(
discrete = TRUE,
option = "turbo",
begin = 0.1,
end = 0.9)
I have used {viridis} to produce colorblind-friendly color maps (see Section A.96). Instead of using as the first default color yellow I have chosen with the color map options and the begin/end argument, what color should appear for this binary variable.
R Code 3.29 : Donut chart with small hole
# Small hole
hsize <- 1
gun_use_small_hole <- gun_use_2012 |>
dplyr::mutate(x = hsize)
gun_use_small_hole |>
tidyr::drop_na() |>
dplyr::mutate(x = hsize) |>
ggplot2::ggplot(
ggplot2::aes(x = hsize, y = percent,
fill = forcats::fct_rev(gun_use))
) +
ggplot2::geom_col() +
ggplot2::coord_polar(theta = 'y') +
ggplot2::xlim(c(0.2, hsize + 0.5)) +
ggplot2::theme_void() +
ggplot2::labs(x = '', y = '', fill = "Gun used?") +
viridis::scale_fill_viridis(
breaks = c('Yes', 'No'),
discrete = TRUE,
option = "turbo",
begin = 0.1,
end = 0.9)
R Code 3.30 : Donut chart with big hole
hsize <- 2
gun_use_big_hole <- gun_use_2012 |>
dplyr::mutate(x = hsize)
gun_use_big_hole |>
tidyr::drop_na() |>
dplyr::mutate(x = hsize) |>
ggplot2::ggplot(
ggplot2::aes(x = hsize, y = percent,
fill = forcats::fct_rev(gun_use))
) +
ggplot2::geom_col() +
ggplot2::coord_polar(theta = 'y') +
ggplot2::xlim(c(0.2, hsize + 0.5)) +
ggplot2::theme_void() +
ggplot2::labs(x = '', y = '', fill = "Gun used?") +
viridis::scale_fill_viridis(
breaks = c('Yes', 'No'),
discrete = TRUE,
option = "turbo",
begin = 0.1,
end = 0.9)
It is a fact that pie chart are still very popular. I recently found for instance a pie chart in one of the most prestigious German weekly newspaper Die Zeit the following pie chart about the financing of the United Nations Relief and Works Agency for Palestine Refugees in the Near East (UNRWA)
The figure is colorful and looks nice. More important is that is has all the important information (Country names and amount of funding) written as labels into the graphic. It seems that this a good example for a pie chart, a kind of an exception to the rule.
But if we get into the details, we see that the graph was a twofold simplification of the complete data set. Twofold because besides the overall ranking of 98 sponsors there is a simplified ranking list of the top 20 donors.The problem with pie charts (and also with waffle charts) is that you can’t use them with many categories.
So it was generally a good choice of “Die Zeit” to contract financiers with less funding and to concentrate to the biggest sponsors. But this has the disadvantage not to get a complete picture that is especially cumbersome for political decisions. As a result we have a huge group of miscellaneous funding sources (2nd in the ranking!) that hides many countries that are important to understand the political commitment im the world for the UNRWA.
But let’s see how this graphic would be appear in alternative charts
Experiment 3.1 : Top 20 UNRWA Donors in 2022
R Code 3.31 : Get UNRWA data of top 20 donors from PDF file with {tabulizer}
R Code 3.32 : Recode and show UNRWA data of top 20 donors 2022
unrwa_donors <- base::readRDS("data/chap03/unrwa_donors.rds")
base::options(scipen = 999)
unrwa_donor_ranking <- unrwa_donors |>
dplyr::rename(
donor = "V1",
total = "V6"
) |>
dplyr::select(donor, total) |>
dplyr::mutate(donor = forcats::as_factor(donor)) |>
dplyr::mutate(donor = forcats::fct_recode(donor,
Spain = "Spain (including Regional Governments)*",
Belgium = "Belgium (including Government of Flanders)",
Kuwait = "Kuwait (including Kuwait Fund for Arab Economic Development)"
)
) |>
dplyr::slice(6:25) |>
dplyr::mutate(dplyr::across(2,
function(x)
{ base::as.numeric(as.character(base::gsub(",", "", x))) })
)
utils::str(unrwa_donor_ranking)
skimr::skim(unrwa_donor_ranking)
#> Warning: There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#> mangled_skimmers$funs)`.
#> ℹ In group 0: .
#> Caused by warning:
#> ! There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#> mangled_skimmers$funs)`.
#> Caused by warning in `sorted_count()`:
#> ! Variable contains value(s) of "" that have been converted to "empty".
#> tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
#> $ donor: Factor w/ 23 levels "","Donor","USA",..: 3 4 5 6 7 8 9 10 11 12 ...
#> $ total: num [1:20] 343937718 202054285 114199150 60969987 34180677 ...
Name | unrwa_donor_ranking |
Number of rows | 20 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
donor | 0 | 1 | FALSE | 20 | USA: 1, Ger: 1, EU: 1, Swe: 1 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total | 0 | 1 | 52728341 | 82086354 | 10500000 | 15363671 | 24456320 | 31159321 | 343937718 | ▇▁▁▁▁ |
R Code 3.33 : Top 20 UNRWA donors in 2022
unrwa_donor_ranking |>
ggplot2::ggplot(
ggplot2::aes(
x = stats::reorder(x = donor, X = total),
y = total / 1e6)
) +
ggplot2::geom_point(size = 3) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Donor",
y = "Donation (in Million US Dollars)"
)
This graph is not as spectacular as “Die Zeit” figure, but carries more information. But what is much more important:
To adapt a quotation from Statistical Consulting Centre in its article Why you shouldn’t use pie charts:
Some people might say this graph is boring, but as Edward Tufte warns in his book Envisioning information:
“Cosmetic decoration, which frequently distorts the data, will never salvage an underlying lack of content.”
Why is my point chart so much better?
- We can estimate the quantities – the gridlines are helpful here.
- We can immediately see the order of the donors.
- We can accurately compare the different donors.
Example 3.7 : Creating Waffle Charts
R Code 3.34 : Creating a waffle chart for number of total rounds fired (NHANES survey 2011-2012)
In contrast to the example in the book I have used percentages and not absolute numbers.
To emulate the percentage view of a pie chart, a 10x10 grid should be used with each square representing 1% of the total. (waffle homepage)
Another advantage: Using percentages I can compare 2011-2012 with 2017-2018 (see Graph 3.16 (a))
R Code 3.35 : Creating a waffle chart for number of total rounds fired (NHANES survey 2017-2018)
nhanes_2018 <- readRDS("data/chap03/nhanes_2018.rds")
rounds_fired_2018 <- nhanes_2018 |>
dplyr::select(AUQ310) |>
tidyr::drop_na() |>
dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |>
dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
"1 to less than 100" = "1",
"100 to less than 1000" = "2",
"1000 to less than 10k" = "3",
"10k to less than 50 k" = "4",
"50k or more" = "5",
"Refused to answer" = "7",
"Don't know" = "9")
) |>
dplyr::rename(rounds_fired = AUQ310)
fired_2018 <- rounds_fired_2018 |>
dplyr::count(rounds_fired) |>
dplyr::mutate(prop = round(n / sum(n), 2) * 100) |>
dplyr::relocate(n, .after = dplyr::last_col())
(
waffle_plot <-
waffle::waffle(parts = fired_2018,
rows = 10,
colors = c("lightblue1", "lightsteelblue1",
"deepskyblue1", "dodgerblue3", "black",
"gold1", "lemonchiffon1"))
)
The number of different levels of the factor variable is almost too high to realize at one glance the differences of the various categories.
Best practices suggest keeping the number of categories small, just as should be done when creating pie charts. (Create Waffle Charts Visualization)
Compare 2011-2012 with 2017-2018 (see Graph 3.15 (a)). You see there is just a small difference: Respondents in the 2017-2018 survey have fired tiny less rounds as the people asked in the 2011-2012 survey. Generally speaking: The fired total of rounds remains more or less constant during the period 2012 - 2018.
R Code 3.36 : Compare the total rounds fired between the NHANES survey participants 2011-2012 and 2017-2018
fired_df <- dplyr::full_join(x = fired_2012,
y = fired_2018,
by = dplyr::join_by(rounds_fired)
)
fired_df <- fired_df |>
dplyr::rename(
"Rounds fired" = rounds_fired,
`2012(%)` = prop.x,
`n (2012)` = n.x,
`2018(%)` = prop.y,
`n (2018)` = n.y,
) |>
dplyr::mutate(`Diff (%)` = `2012(%)` - `2018(%)`)
fired_df
#> Rounds fired 2012(%) n (2012) 2018(%) n (2018) Diff (%)
#> 1 1 to less than 100 43 701 45 289 -2
#> 2 100 to less than 1000 26 423 24 154 2
#> 3 1000 to less than 10k 18 291 20 131 -2
#> 4 10k to less than 50k 7 106 NA NA NA
#> 5 50k or more 4 66 2 15 2
#> 6 Don't know 2 26 2 10 0
#> 7 10k to less than 50 k NA NA 7 45 NA
#> 8 Refused to answer NA NA 0 2 NA
The participants of the NHANES survey 2011-2012 and 2017-2018 fired almost the same numbers of total rounds. The participants in 2017-2018 fired just a tiny amount of bullets less.
R Code 3.37 : Creating a waffle chart for number of total rounds fired (NHANES survey 2017-2018) with cividis
color scale
nhanes_2018 <- readRDS("data/chap03/nhanes_2018.rds")
rounds_fired_2018 <- nhanes_2018 |>
dplyr::select(AUQ310) |>
tidyr::drop_na() |>
dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |>
dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
"1 to less than 100" = "1",
"100 to less than 1000" = "2",
"1000 to less than 10k" = "3",
"10k to less than 50 k" = "4",
"50k or more" = "5",
"Refused to answer" = "7",
"Don't know" = "9")
) |>
dplyr::rename(rounds_fired = AUQ310)
fired_2018 <- rounds_fired_2018 |>
dplyr::count(rounds_fired) |>
dplyr::mutate(prop = round(n / sum(n), 2) * 100) |>
dplyr::relocate(n, .after = dplyr::last_col())
waffle::waffle(parts = fired_2018,
rows = 10,
colors = c("#BCAF6FFF", "#A69D75FF",
"#918C78FF", "#7C7B78FF",
"#6A6C71FF", "#575C6DFF", "#414D6BFF")
)
cividis
color scale
In contrast to Graph 3.16 (a) — where I have used individual choices of different colors without any awareness of color blindness or black & white printing — here I have used the color blindness friendly cividis
palette form the {viridis} package. Read more about my reflections about choosing color palettes in Section 3.9.1.
Options are:
Bullet List
Histograms and density plots are very similar to each other and show the overall shape of the data. These two types of graphs are especially useful in determining whether or not a variable has a normal distribution.
Boxplots show the central tendency and spread of the data, which is another way to determine whether a variable is normally distributed or skewed.
Violin plots are also useful when looking at a continuous variable and are like a combination of boxplots and density plots. Violin plots are commonly used to examine the distribution of a continuous variable for different levels (or groups) of a factor (or categorical) variable.
Experiment 3.2 : Histograms of research funding (2004-2015) with 10 and 30 bins
R Code 3.38 : Histogram of research funding (2004-2015) with 10 bins
p_histo_funding <- research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_histogram(bins = 10,
fill = "grey90",
color = "black") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Number of causes") +
ggplot2::theme_bw()
p_histo_funding
R Code 3.39 : Histogram of research funding (2004-2015) with 30 bins
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_histogram(bins = 30,
fill = "grey90",
color = "black") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Number of research topics ('Causes of Deadth')") +
ggplot2::theme_bw()
With histograms it is important to play around with the number of bins
. This changes the appearance of histograms sometimes quite profoundly. The default of 30 bins displays a warning that one should choose a better value with the argument binwidth
(another option to control the number of bins.)
It is not quite clear for me what would be the optimal number of bins of a given data set. It was easy for Graph 2.11: There were only 30 different values, each for one. So to provide the same number of bins as number of observed days (= 30) was a sensible choice.
There is not much difference in the case of 10 or 30 bins of Experiment 3.2. A big difference would be for example when the number of modes is changing, or the mode is moving far to another value. It seems to me that with a density plot it is simpler to choose the optimal curve (even if I do not understand the underlying rationale of this kernel density estimation (KDE) procedure).
Experiment 3.3 : Density plot of research funding (2004-2015) with different bandwidth
R Code 3.40 : Density plot with standard bandwidth bw = "nrd0"
p_dens_funding <- research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "grey90",
color = "black") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
p_dens_funding
This is the density plot without changing the default bandwidth. It uses nrd0
as bandwidth. One can see that it is somewhat similar to bw = 1.5
(see Graph 3.19).
bw.nrd0
implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman’s ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31))) unless the quartiles coincide when a positive result will be guaranteed. (Quoted form the help file of stats::bandwidth()
)R Code 3.41 : Density plot with bandwidth of 0.5
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "#7d70b6",
color = "black",
bw = .5) +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
I have replicated Figure 3.26 without knowing what bw = 0.5
means and how it is computed.
: Density plot of research funding (2004-2015) with bandwidth of 1.5
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "#7d70b6",
color = "black",
bw = 1.5) +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
I have replicated Figure 3.27 without knowing what bw = 1.5
means and how it is computed. This is the figure that was chosen in the book as appropriate to represent the data distribution. It is very similar to the {ggplot2} standard version (Graph 3.17), where neither bw
nor kernel
was changed.
R Code 3.42 : Density plot with bandwidth selector nrd0
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "grey90",
color = "black",
bw = "nrd0") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
Here I have stated expressively that bw = "nrd0"
. This is the default value that implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It is appropriate for normal-like distributions.
R Code 3.43 : Density plot with bandwidth selector SJ
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "grey90",
color = "black",
bw = "SJ") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
#> Warning: Computation failed in `stat_density()`.
#> Caused by error in `precompute_bw()`:
#> ! `bw` must be one of "nrd0", "nrd", "ucv", "bcv", "sj", "sj-ste", or
#> "sj-dpi", not "SJ".
#> ℹ Did you mean "sj"?
bw = "SJ"
select the bandwidth using pilot estimation of derivatives and is appropriate for multimodal or general non-normal distribution (od 2018).
R Code 3.44 : Density plot with bandwidth ucv
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "grey90",
color = "black",
bw = "ucv") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
#> Warning in stats::bw.ucv(x): minimum occurred at one end of the range
ucv
is one of the two extremes. It chooses a very small bandwidth. (The other extreme selector is bcv
which chooses very a wide bandwidth.)
R Code 3.45 : Density plot with bandwith bcv
research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_density(fill = "grey90 ",
color = "black",
bw = "bcv") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
y = "Probability density") +
ggplot2::theme_bw()
#> Warning in stats::bw.bcv(x): minimum occurred at one end of the range
bcv
is one of the two extremes. It chooses a very wide bandwidth. (The other extreme selector is ucv
which chooses a very narrow bandwidth.)
The equivalent of binwidth
in histograms is bw
(smoothing bandwidth) in density plots. I have to confess that I do not understand all the relevant factors to choose the optimal bandwidth for density plots (or binwidth for histograms).
In density plots there is also the kernel
argument for choosing the appropriate smoothing kernel. I learned from the video by od webel (2018) that the chosen kernel (gaussian”, the standard, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”) is not so important because the result all in similar distributions. Most important, however, is to choose the appropriate bandwidth.
It is said that one disadvantage of density plots is its feature to smooth out the distribution so that you cannot see anymore — in contrast to histograms — if there are gaps in the data. But if you choose a very small bandwidth like ucv
then you get a distribution similar to a histogram.
Resource 3.3 : Information about kernel density estimation and bandwidth
Much of my understanding about bandwidth and kernel density estimation comes from the video Intro to Kernel Density Estimation by od webel (2018).
Another article on this complex topic: What is Kernel Density Estimation?. From this article I learned that a kernel is nothing else as a weighting function to estimate the probability density function (PDF). (Until now I had a sketchy impression about this concept derived from another meaning of “kernel”, related with the core part of a computer.)
Histograms and density plots are great for examining the overall shape of the data for a continuous variable. Boxplots in contrasts are useful for identifying the middle value and the boundaries around the middle half of the data.
R Code 3.46 : Box plot of research funding (2004-2015)
p_box_funding <- research_funding |>
ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
ggplot2::geom_boxplot(fill = "grey90",
color = "black") +
ggplot2::labs(x = "Research funding (2004-2015) in billions dollar") +
ggplot2::theme_bw()
p_box_funding
Each of the three presented plot types (histogram, density plot and box plot) has it strength and disadvantage.
All three graphs show the right skew clearly, while the histogram and boxplot show gaps in the data toward the end of the tail. The boxplot is the only one of the three that clearly identifies the central tendency and spread of the variable.
R Code 3.47 : Comparison of histogram, density plot and box plot
gridExtra::grid.arrange(p_histo_funding,
p_dens_funding,
p_box_funding,
nrow = 3)
Mosaic plots also called Marimekko charts or Mekko charts show the relative size of groups across two or more variables. It gives an overview of the data and makes it possible to recognize relationships between different variables. For example, independence is shown when the boxes across categories all have the same areas.
Mosaic plots are the multidimensional extension of spineplots, which graphically display the same information for only one variable.
Bullet List
Example 3.8 : Mosaic chart of gun use, rounds fired, ear plugs and by sex (NHANES survey 2011-2012)
R Code 3.48 : Firearm use by sex in the US among 2011–2012 NHANES participants
nhanes_2012_clean <- readRDS("data/chap03/nhanes_2012_clean.rds")
library(ggmosaic)
library(ggplot2)
nhanes_2012_clean |>
tidyr::drop_na(c(gun_use, sex)) |>
ggplot2::ggplot() +
ggmosaic::geom_mosaic(ggplot2::aes(
x = ggmosaic::product(gun_use, sex),
fill = gun_use)
) +
ggplot2::theme_classic() +
ggplot2::labs(x = "Sex",
y = "Ever used a gun?",
fill = "Gun use") +
ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
ggplot2::scale_fill_viridis_d(
alpha = 1,
begin = 0.15,
end = 0.35,
direction = -1,
option = "turbo",
aesthetics = "fill"
)
Better use of mosaic charts when variables have different levels with different proportions
Graph 3.27 is a bad example for a mosaic chart: One of its features (that the area of the columns reflect the proportion of the level) does not shine in this example because both levels (men and women) have approximately the same proportion (49.8 vs. 50.2%). Although it is better to have variables with several levels.
WATCH OUT! Loading and attaching {ggmosaic} is mandatory
Just calling the {ggmosaic} functions with the ::
syntax results into an error. I have to expressively load and attach the package with library(ggmosiac)
. But the error only appears when I render the whole document; running just the R code chunk works. I suppose that this results from a misbehavior of the {ggmosaic} code.
There is also another warning with {ggmosaic} version 0.33: One should use tidyr::unite(
) instead of tidyr::unite_()
since {tidyr} version 1.2.0. This is a missing adaption in the {ggmosaic} package as the warning message pointed out. I could get rid of this warning message by installing the GitHub version 0.34 as recommended in this GitHub post.
R Code 3.49 : Rounds fired by sex in the United States among 2011–2012 NHANES participants
nhanes_2012_clean |>
dplyr::mutate(rounds_fired =
dplyr::na_if(rounds_fired, "Don't know")) |>
tidyr::drop_na(c(rounds_fired, sex)) |>
dplyr::mutate(rounds_fired =
forcats::fct_drop(rounds_fired)) |>
ggplot2::ggplot() +
ggmosaic::geom_mosaic(ggplot2::aes(
x = ggmosaic::product(rounds_fired, sex),
fill = rounds_fired)
) +
ggplot2::theme_classic() +
ggplot2::labs(x = "Sex",
y = "Total number of rounds fired",
fill = "Rounds fired") +
ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
ggplot2::scale_fill_viridis_d(
alpha = 1, # alpha does not work!
begin = .25,
end = .75,
direction = -1,
option = "cividis"
)
This is already a more informative chart as Graph 3.27. It shows that women are not only less using a gun, those they do, they fire less round than men.
But mosaic charts really shine when both variables have several levels as shown in Graph 3.29.
I have experimented with several approaches to provide a specific scale. In this case the color friendly cividis
scale from the {viridis} package. From the 5 different scaling option I have 4 commented out. See the code for the details.
R Code 3.50 : Number of rounds fired by wearing ear plugs among the 2011–2012 NHANES participants
nhanes_2012_clean |>
dplyr::mutate(rounds_fired =
dplyr::na_if(rounds_fired, "Don't know")) |>
dplyr::mutate(ear_plugs =
dplyr::na_if(ear_plugs, "Don't know")) |>
tidyr::drop_na(c(rounds_fired, ear_plugs)) |>
dplyr::mutate(rounds_fired =
forcats::fct_drop(rounds_fired)) |>
dplyr::mutate(ear_plugs =
forcats::fct_drop(ear_plugs)) |>
ggplot2::ggplot() +
ggmosaic::geom_mosaic(ggplot2::aes(
x = ggmosaic::product(rounds_fired, ear_plugs),
fill = rounds_fired)
) +
ggplot2::theme_bw() +
ggplot2::theme(axis.text.x =
ggplot2::element_text(angle = 45,
vjust = 1,
hjust = 1)) +
ggplot2::labs(x = "Wearing ear plugs",
y = "Total number of rounds fired",
fill = "Rounds fired") +
ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
viridis::scale_fill_viridis(
discrete = TRUE,
option = "turbo",
direction = -1)
The chart clearly shows that a part (more than 1/3 of the respondent) never use ear plugs, even if they fire many rounds.
R Code 3.51 : Firearm use by sex in the US among 2011–2012 NHANES participants using {vcd}
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
test <- nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na()
vcd::mosaic(~ gun_use + sex,
data = test,
highlighting = "gun_use",
highlighting_fill = c("#5fefbd", "#7298f6"),
highlighting_direction = "top",
direction = c("h", "v"),
main = "Firearm use by sex using {vcd}")
I have tried to replicate Graph 3.27 with the vcd::mosaic()
function. But I noticed that there are several facilities of {ggplot2} not available (for instance
Because the vcd::mosaic()
function is too complex (for my) and requires a new package to learn I will stick with {ggplot2} approaches, even if I have the impression that there are some programming glitches with {ggmosaic}. (See for instance the discussion in the post forum).
Even in all the examples I found on the Internet (see for instance Creating a Mosaic Plot in R Graphics Cookbook (Chang 2024) I could not find a solution for my problems mentioned above.
My own adaptions to the mosaic figures
In contrast to the book I have used three additional adaptions for the figures produced with the {ggmosaic} package:
alpha
for setting the transparency value does not work with {viridis}.)Concerning color friendly palettes I have added several experiments with different packages in Section 3.9.1.
Resource 3.4 {vcd} / {vcdExtra} are other packages for mosaic graphs (and more)
The packages {vcd} and {vcdExtra} are specialized package for the visualization of categorical data. I have tried it out, but its usage for the mosaic function is more complex and need time to learn.
{vcd} and {vcdExtra} are support packages for the book, “Discrete Data Analysis with R” (Friendly and Meyer 2015) (see Section A.95). There is also a website DDAR (abbreviation of the book title) and a course description with additional material. There are several introductory vignettes to learn different aspects of the two packages.
So if you are interested to learn more about visualization of categorical data besides the mosaic graph then DDAR would be the right place.
There are two different kinds of bar chars: stacked bar charts and grouped bar charts:
ggplot2::geom_bar()
if you are providing the records (rows) and geom_bar()
has to calculate the number of cases in each group (as we have already done in @#sec-chap03-bar-chart-1) orggplot2::geom_col()
if you have already counted the number of cases in each groups and provide these summarized numbers to geom_col()
.During working on Section 3.6.1 I have added several experiments for color friendly palettes (see Section 3.9.1). Here I will use these lesson learned and changed the default {ggplot2} color palette to the base R Okabe-Ito
palette, one of the available safe colorblind palettes. (Another recommended choice are the eight {viridis} palettes, especially the viridis::cividis()
palette)
Experiment 3.4 : Working with colorblind friendly color palettes
R Code 3.52 : Choose two colors from a color friendly palette
my_2_colors <- grDevices::palette.colors(2, palette = "Okabe-Ito", alpha = .8, recycle = FALSE)
# my_2_colors <- paletteer::paletteer_c("viridis::cividis", n = 10)
# my_2_colors <- viridis::cividis(2, direction = -1, alpha = .8)
colorblindcheck::palette_check(my_2_colors, plot = TRUE)
#> name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1 normal 2 64.73729 1 1 64.73729 64.73729 64.73729
#> 2 deuteranopia 2 64.73729 1 1 67.24857 67.24857 67.24857
#> 3 protanopia 2 64.73729 1 0 60.15494 60.15494 60.15494
#> 4 tritanopia 2 64.73729 1 0 62.72480 62.72480 62.72480
Checking the color distances with {colorblindcheck} we see a very great distance (more than 60) not only in the normal vision palette but also with all CVD palettes. (See for additional details of how to use and interpret results of the {colorblindcheck} in Section 3.9.1.3.3).
There are several options to define a new color palette. I have here chosen three approaches, but used only one (e.g., the other two options I have commented out). I have applied the base R color vision deficiency friendly palette Okabe-Ito
.
For the next few graphs I need only two colors. Normally it would not change the appearance if I define a color palette with more color. The first two would always the same. But there is one exception: When I need to reverse the two color used — as I have done in Listing / Output 3.18 — then the reverse function is applied to the complete color palette and results in a different color composition.
This problem could be overcome with the {ggokabeito} package, where you could manually define the order of the colors (see Section A.28). Additionally it provide the Okabe Ito
scale for direct use in {ggplot}.
Nevertheless I haven’t use more than two colors. The categorical variable in Graph 3.35 is an ordered variable (0 to 50k rounds fired) but the Okabe Ito
palette is with its distinct discrete colors not well suited to display ordered variables.
R Code 3.53 : Change the {ggplot2} default color palette
## change 2 ggplot2 colors #############
options(
ggplot2.discrete.colour = my_2_colors,
ggplot2.discrete.fill = my_2_colors
)
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
p1 <- nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = sex, fill = gun_use)) +
ggplot2::geom_bar() +
ggplot2::theme_minimal() +
ggplot2::theme(legend.position = "none") +
ggplot2::theme(aspect.ratio = 4/1) +
ggplot2::labs(x = "Gun use",
y = "Number of respondents")
p2 <- nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(ggplot2::aes(x = sex, fill = gun_use)) +
ggplot2::geom_bar(position = "dodge") +
ggplot2::theme_minimal() +
ggplot2::theme(legend.position = "none") +
ggplot2::theme(aspect.ratio = 2/1) +
ggplot2::labs(x = "Gun use",
y = "Number of respondents")
patchwork:::"-.ggplot"(p1, p2)
## restore 2 ggplot2 colors #############
# options(
# ggplot2.discrete.color = NULL,
# ggplot2.discrete.fill = NULL
# )
I have changed the default colors of the {ggplot2} palette with two discrete colors using options(ggplot2.discrete.fill = my_2_colors)
. To restore the default {ggplot2} color palette use options(ggplot2.discrete.fill = NULL)
.
Stacked bar charts (like pie & waffle charts) show parts of a whole and have similar problems as pie charts and mosaic plots. If there are many groups or groups of similar size, stacked bar charts are difficult to interpret and not recommended.
Example 3.9 : Stacked bar charts for two categorical variables
R Code 3.54 : Gun use by sex (using raw data = geom_bar()
)
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use, fill = sex)
) +
ggplot2::geom_bar() +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use"
))
geom_bar()
R Code 3.55 : Gun use by sex (using summarized data = geom_col()
)
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
janitor::tabyl(gun_use, sex) |>
tidyr::pivot_longer(
cols = !gun_use,
names_to = "sex",
values_to = "count"
) |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use,
y = count,
fill = sex)
) +
ggplot2::geom_col(position =
ggplot2::position_stack(reverse = TRUE)) +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use",
reverse = TRUE
)) +
ggplot2::scale_fill_discrete(
type = base::rev(getOption("ggplot2.discrete.fill")))
geom_col()
)
geom_col()
)
To get exactly the same appearance with geom_col()
as with geom_bar()
in Graph 3.32 I had to add three important adaptions:
To get the correct structure of the data I had — after the summation with janitor::tabyl()
— to apply tidyr::pivot_longer()
.
To get the same order of stacked bars (starting with female) I had to add inside ggplot2::geom_col()
the argument position = ggplot2::position_stack(reverse = TRUE)
.
To get the same order of colors I had to reverse the color order in ggplot2::scale_fill_discrete()
with type = base::rev(getOption("ggplot2.discrete.fill"))
.
R Code 3.56 : Stacked bar chart: Gun use by sex in percent of total respondents
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use, fill = sex,
y = 100 * ggplot2::after_stat(count)
/ sum(ggplot2::after_stat(count)))
) +
## adornments
ggplot2::geom_bar() +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use"
))
WATCH OUT! This bar chart is not correct.
Instead of displaying parts of whole (100% in each group) it relates its grouped value to the total amount of observations. This lead to a misleading perception of the relationships if the groups have different sizes. For instance in this graph male using and not using guns have approximately the same percentage (about 25%). But this is not correct: Almost 75% of the men are using fire arms.
R Code 3.57 : Proportional stacked bar chart: Gun use by sex in percent of grouped respondents
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
# prepare for proportional stacked bar
dplyr::group_by(gun_use, sex) |>
dplyr::count() |>
## pick the variable that will add # to 100%
dplyr::group_by(gun_use) |>
## compute percents within chosen variable
dplyr::mutate(percent = 100 * (n / sum(n))) |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use,
y = percent,
fill = sex)) +
ggplot2::geom_col() +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use"
))
This is the same code as in the book. It is more complex as necessary because:
position = "fill"
in geom_bar()
/ geom_col()
. Then the complex preparation for proportional bars is not necessary anymore.n / sum(n)
with 100 one could use ggplot2::scale_y_continuous(labels = scales::percent)
. This has the additional advantage that %
sign is appended at the y-values.I will show the more concise code in Listing / Output 3.20.
R Code 3.58 : Proportional stacked bar chart: Gun use by sex
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use, fill = sex)) +
ggplot2::geom_bar(position = "fill") +
## adornments
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use"
)) +
ggplot2::scale_y_continuous(labels = scales::percent)
This is the reduced code of the book snippet from Listing / Output 3.19. It does not need a special preparation for proportional bar parts. Instead just add position = "fill"
into the ggplot2::geom_bar()
or ggplot2::geom_col()
function.
Grouped bar charts are the preferred option for bar charts. If there are more than two levels of categorical variables stacked bar charts are difficult to interpret. They lack a common reference point as the different observations or percentage in each levels starts and ends at different absolute position. A comparison of the relative size is for the human eye therefore awkward.
From the conceptual point of view there is no difference in the creation between stacked and grouped bar charts. The only difference is that grouped bar charts have the argument position = "dodge"
inside the geom_bar()
or geom_col()
argument.
Example 3.10 : Grouped bar charts for two categorical variables
R Code 3.59 : Grouped bar chart with two variables with only two levels: Gun use by sex
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, sex) |>
tidyr::drop_na() |>
ggplot2::ggplot(
ggplot2::aes(x = gun_use, fill = sex)
) +
ggplot2::geom_bar(position = "dodge") +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Gun use",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Gun use"
))
R Code 3.60 : Choose six colors from a color friendly palette
my_6_colors <- viridis::cividis(6, direction = -1, alpha = .8)
colorblindcheck::palette_check(my_6_colors, plot = TRUE)
## change 6 ggplot2 colors #############
options(
ggplot2.discrete.colour = my_6_colors,
ggplot2.discrete.fill = my_6_colors
)
#> name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1 normal 6 11.43205 15 15 11.43205 41.93861 94.45609
#> 2 deuteranopia 6 11.43205 15 14 11.32162 42.45045 95.77164
#> 3 protanopia 6 11.43205 15 14 11.15705 40.71083 90.60717
#> 4 tritanopia 6 11.43205 15 13 10.49463 35.80623 78.10328
For the bar charts rounds fired by sex we need a different scale with
Checking the color distances of the cividis
palette from the {viridis} packages with {colorblindcheck} we can see that the minimum distance is 10.5. This is relatively low but still acceptable. I have therefore set the default {ggplot2} color palette to these six colors of the cividis
palette.
R Code 3.61 : Grouped bar chart of two variables with several levels: Rounds fired by sex
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(rounds_fired, sex) |>
dplyr::mutate(rounds_fired =
dplyr::na_if(rounds_fired, "Don't know")) |>
tidyr::drop_na() |>
base::droplevels() |>
ggplot2::ggplot(
ggplot2::aes(x = sex, fill = rounds_fired)
) +
ggplot2::geom_bar(position =
ggplot2::position_dodge()
) +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Rounds fired",
y = "Number of respondents"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Rounds fired")
)
From this graph it is obvious that men fire generally more rounds than women. But the absolute value is not conclusive because we cannot see how many people in each group was excluded because they didn’t even fire one round. It could be that there are much more men than women that did not even fire one round. We already know that this is not the case but from this graph alone you wouldn’t know.
To get a more complete picture:
R Code 3.62 : Grouped bar chart of two variables with several levels: Rounds fired by sex of all respondents
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
dplyr::select(gun_use, rounds_fired, sex) |>
dplyr::mutate(rounds_fired =
dplyr::if_else(gun_use == "No", "0 rounds fired", rounds_fired)
) |>
dplyr::mutate(rounds_fired =
dplyr::na_if(rounds_fired, "Don't know")) |>
tidyr::drop_na() |>
base::droplevels() |>
## prepare for percentage calculation
dplyr::group_by(rounds_fired, sex) |>
dplyr::count() |>
## pick the variable that will add to 100%
dplyr::group_by(sex) |>
## compute percents within chosen variable
dplyr::mutate(percent = n / sum(n)) |>
ggplot2::ggplot(
ggplot2::aes(x = sex,
y = percent,
fill = rounds_fired)
) +
ggplot2::geom_col(position =
ggplot2::position_dodge()
) +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Rounds fired",
y = "Percentage"
) +
ggplot2::guides(
fill = ggplot2::guide_legend(
title = "Rounds fired")
) +
ggplot2::scale_y_continuous(
labels = scales::label_percent()
) +
ggplot2::scale_fill_brewer(palette = "BuGn")
Now we can say generally that more man use firearms than women. From those respondents (men and women) that are using guns, men fire generally more rounds than women.
The is just one difference for bar charts with one categorical and one continuous variable in comparison to the bar charts used previously in this chapter: We need a summary statistics for the continuous variable so that we can present accordingly the height of the bar. These summary statistics has to be inserted into the geom_bar()
function.
Example 3.11 : Bar charts with one categorical and one continuous variable
R Code 3.63 : The R code from the book does not work
#> Warning in ggplot2::geom_bar(stat = "summary", fun.y = mean): Ignoring unknown
#> parameters: `fun.y`
#> No summary function supplied, defaulting to `mean_se()`
Here I was using exactly the same code as in the book for Figure 3.45. Because of changes in {ggplot2) this code does not work anymore. I got two messages:
- No summary function supplied, defaulting to
mean_se()
- Warning: Ignoring unknown parameters:
fun.y
The problem is that the arguments inside ggplot2::geom_bar()
didn’t work.
R Code 3.64 : Revised book code using ggplot2::stat_summary()
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(x = weapons)
) +
ggplot2::geom_bar() +
ggplot2::stat_summary(
ggplot2::aes(y = deaths),
fun = "mean",
geom = "bar"
) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed")
This code chunk works without any message and warning. There are three differences to the code from the book:
ggplot()
call.geom_bar()
is empty in contrast to the book where arguments for summarizing statistics are added inside the geom_bar()
parenthesis.stat_summary()
is added with supplies all the necessary arguments to generate the summarizing statistics. It needs an extra aesthetics for the y-axis where the calculation takes place.R Code 3.65 : Bar chart: Calculating mean values with stat_summary()
and flipped coordinates
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(
x = stats::reorder(x = weapons, X = -deaths))
) +
ggplot2::geom_bar() +
ggplot2::stat_summary(
ggplot2::aes(
y = deaths,
fill = weapons,
group = weapons),
fun = "mean",
geom = "bar"
) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed") +
ggplot2::scale_fill_manual(values =
c("Handguns" = "#7463AC",
"Firearms, type not stated" = "gray",
"Rifles" = "gray",
"Shotguns" = "gray",
"Other guns" = "gray"), guide = "none")
geom_bar()
with stat_summary()
and flipped coordinates
geom_bar()
with stat_summary()
and flipped coordinates
Here I have used the same code as before but have added several scale improvements like flipped axis, reordered the bars and changed to a sparse theme.
I had considerable problems with the fill color for this chart, because I did not know where and how to add the fill
aesthetics. As I added during my experimentation a new ggplot2::eas()
layer with the argument fill = weapons
, I got the following warning message:
The following aesthetics were dropped during statistical transformation. This can happen when ggplot fails to infer the correct grouping structure in the data. Did you forget to specify a
group
aesthetic or to convert a numerical variable into a factor?
I didn’t know how to add the group
aesthetic and all the examples I found on the web are referring to geom_line()
. (See for instance the page of the online manual for ggplot2 Aesthetics: grouping or Mapping variable values to colors in the first edition of R Graphics Cookbook). I finally found the solution. The following text snippet is now more understandable for me:
The group aesthetic is by default set to the interaction of all discrete variables in the plot. This choice often partitions the data correctly, but when it does not, or when no discrete variable is used in the plot, you will need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group. Aesthetics: grouping
An easier solution (for me) is to compute the mean values of the different firearms and then to apply a bar chart using geom_col()
as I have done in the next tab with Listing / Output 3.22.
R Code 3.66 : Bar chart: Using mean values with geom_col()
and flipped coordinates
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
dplyr::group_by(weapons) |>
dplyr::summarize(mean_deaths_2012_2016 = mean(deaths)) |>
ggplot2::ggplot(
ggplot2::aes(fill = weapons,
x = stats::reorder(
x = weapons,
X = -mean_deaths_2012_2016),
y = mean_deaths_2012_2016)
) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed") +
ggplot2::scale_fill_manual(values =
c("Handguns" = "#7463AC",
"Firearms, type not stated" = "gray",
"Rifles" = "gray",
"Shotguns" = "gray",
"Other guns" = "gray"), guide = "none")
geom_col()
and flipped coordinates
geom_col()
and flipped coordinates
In this version I have calculated the mean beforehand I have passed the data to {ggplot2}. This was easier for me to understand and I had this solution long before I solved the group
problem in Listing / Output 3.21.
R Code 3.67 : Number of Homicides by type of firearms (2012-2016)
p <- fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(fill = year,
y = deaths,
x = stats::reorder(
x = weapons,
X = -deaths)
)
) +
ggplot2::geom_bar(position = 'dodge', stat = 'identity') +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed",
fill = "Year")
my_5_colors <- viridis::cividis(5, direction = -1, alpha = .8)
withr::with_options(
list(ggplot2.discrete.fill = my_5_colors),
print(p)
)
Here I have the first time used the {withr} package (see: Section A.98) for a just temporary change of the {ggplot2) standard color palette. This has the advantage that one doesn’t restore the original value.
R Code 3.68 : Number of homicides in the years 2012-2016 per type of firearms
p <- fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(fill = weapons,
x = year,
y = deaths)
) +
ggplot2::geom_bar(position = 'dodge', stat = 'identity') +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Year",
y = "Number of homicides committed",
fill = "Weapons")
my_5_colors <- viridis::cividis(5, direction = -1, alpha = .8)
withr::with_options(
list(ggplot2.discrete.fill = my_5_colors),
print(p)
)
This graph is not very informative because it is difficult to compare the type of firearms for different years. To get this information a line chart would be much better (see Section 3.6.3.1).
Point charts are an alternative for simple bar graphs. The use less ink and are generated with the same code as bar charts with two exceptions:
geom_bar()
/ geom_col()
this type of graph uses geom_point()
.color
scale instead of the fill
scale..Example 3.12 : Point charts
R Code 3.69 : Point chart: Mean annual homicides by firearm type in the United States, 2012–2016
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
dplyr::group_by(weapons) |>
dplyr::summarize(mean_deaths_2012_2016 = mean(deaths)) |>
ggplot2::ggplot(
ggplot2::aes(color = weapons,
x = stats::reorder(
x = weapons,
X = -mean_deaths_2012_2016),
y = mean_deaths_2012_2016)
) +
ggplot2::geom_point(size = 4) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed") +
ggplot2::scale_color_manual(values =
c("Handguns" = "#7463AC",
"Firearms, type not stated" = "gray",
"Rifles" = "gray",
"Shotguns" = "gray",
"Other guns" = "gray"), guide = "none")
R Code 3.70 : Point chart with error bars: Mean annual homicides by firearm type in the United States, 2012–2016
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
dplyr::group_by(weapons) |>
dplyr::summarize(
central = mean(x = deaths),
spread = sd(x = deaths)
) |>
ggplot2::ggplot(
ggplot2::aes(x = stats::reorder(
x = weapons,
X = -central),
y = central)
) +
ggplot2::geom_errorbar(
ggplot2::aes(ymin = central - spread,
ymax = central + spread,
linetype = "Mean\n+/- sd"),
width = .2) +
ggplot2::geom_point(
ggplot2::aes(color = weapons,
size = "Mean (2012-2016)"),
alpha = 0.5
) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed") +
ggplot2::scale_color_manual(values =
c("Handguns" = "#7463AC",
"Firearms, type not stated" = "gray",
"Rifles" = "gray",
"Shotguns" = "gray",
"Other guns" = "gray"),
guide = "none"
) +
ggplot2::scale_linetype_manual(values = 1, name = "Error bars") +
ggplot2::scale_size_manual(values = 4, name = "") +
ggplot2::theme(legend.position = "top")
I added alpha = 0.5
so that the small error bars are still visible “behind” the big dots.
This is a more complex graph: It has two layers (geom_errorbar()
and geom_point()
) and three different scales (weapons, line type and size). The difficulty for me is to know where to put the different aesthetics. For instance: I could put size = 4
after the alpha = 0.5
argument, but then the argument scale_size_manual(values = 4)
would not work anymore. Otherwise it is not possible to add the alpha argument into the scale_size_manual()
function.
Only for “firearms, type not stated” and “Handguns” had a remarkable size of standard deviation from the mean. For the other types (“Riffles”, “shotguns” and “other guns”) the spread is so small that they did not even extend outside of the dots. (To see them, I had to apply an alpha
argument.)
In the book Harris supposes that the the small number of years is the reason for this tendency. I do not believe that this may be the main reason. I think it has more to do with the absolute small numbers of observation in these type of firearms.
While the bar chart and point chart were great for comparing the means of the groups, the boxplot will provide more information about the distribution in each group.
Example 3.13 : Box Plots
R Code 3.71 : Boxplot: Annual homicides by firearm type in the United States, 2012–2016
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(x = stats::reorder(
x = weapons,
X = -deaths),
y = deaths)
) +
ggplot2::geom_boxplot() +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed")
In contrast to the book I have not colored / filled the boxes.
In this type of graph we do not only see the bigger spread of handguns and firearms not stated, but we also see the skew of these two distributions.
In the next tab we can see the distribution of the mean values in relation to the boxplots.
R Code 3.72 : Boxplot with data points: Annual homicides by firearm type in the United States, 2012–2016
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")
## if jitter should always result
## in the same horizontal position
# set.seed(500)
fbi_deaths_clean |>
ggplot2::ggplot(
ggplot2::aes(x = stats::reorder(
x = weapons,
X = -deaths),
y = deaths)
) +
ggplot2::geom_boxplot() +
ggplot2::geom_jitter() +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::labs(x = "Firearm type",
y = "Number of homicides committed")
I didn’t need alpha = .8
as in the book, because I didn’t fill the boxes with color.
I noticed that whenever I run the code chunk the horizontal position of the data points changes. This is the effect of the jitter
command. I you want to have always exactly the same position than you would have to add a set.seed()
in front of the computation.
It is interesting to see that just one data point far from the median cannot change its position. Compare the far right dot of firearms not reported. It has the same relative position to the box as the far right dot of the handguns category. But it didn’t pull the median to its position. In contrast to the handguns where several other points on the right side support the one far right dot and drag the median to the right.
Violin plots can be seen as a graph type between boxplots and density plots. They are typically used to look at the distribution of continuous data within categories.
For the homicide data they do not work because there are not enough data. There were too few cases in some categories. So the book applied the violin plot to the age by sex data from the NHANES survey.
R Code 3.73 : Using a violin plot to compare respondent sex of the NHANES 2012 survey
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
nhanes_2012_clean |>
ggplot2::ggplot(
ggplot2::aes(x = sex, y = RIDAGEYR)
) +
ggplot2::geom_violin(
ggplot2::aes(fill = sex)
) +
ggplot2::theme_bw() +
ggplot2::labs(
x = "Sex",
y = "Age in years"
) +
ggokabeito::scale_fill_okabe_ito(
guide = "none"
)
When to use line graphs and when to use scatterplots
Situations where a line graph is more useful than a scatterplot:
Relationships where there is no measure of time and data that include a lot of variation are better shown with a scatterplot.
Example 3.14 : Firearms per type manufactured in the United States
R Code 3.74 : Firearms circulating in the United States 1986-2017
guns_total <-
base::readRDS("data/chap03/guns_total.rds")
p_guns_total <- guns_total |>
tidyr::pivot_longer(
cols = 2:6,
names_to = "gun_type",
values_to = "gun_count"
) |>
dplyr::mutate(Year = base::as.numeric(Year),
gun_type = forcats::as_factor(gun_type)
) |>
ggplot2::ggplot(
ggplot2::aes(
x = Year,
y = gun_count / 1e5)
) +
ggplot2::geom_line(
ggplot2::aes(color = gun_type)
) +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Number of firearms (in 100,000s)",
) +
ggplot2::scale_x_continuous(
limits = c(1986, 2017),
breaks = c(1986, 1990, 1995, 2000,
2005, 2010, 2015, 2017)) +
ggokabeito::scale_color_okabe_ito(
order = c(1:3, 7, 9),
name = "Type of
Firearm",
breaks = c('Total', 'Handguns', 'Rifles',
'Shotguns', 'Misc'))
p_guns_total
# # # palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE)
# "#E69F00" "#56B4E9" "#009E73" "#F0E442" "#0072B2" "#D55E00" "#CC79A7" "#999999" "#000000"
There are four remarkable code lines:
palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE)
generates Hex color code for all nine colors of palette. If you copy the (not runable) resulted hex code into the R code chunk, you can examine the appropirate colors visually. — I commented these two code line out.I took the ATF data form the PDF file and had calculated guns manufactured (from the USAFact.org website) and subtracted exported guns (page 4 of the ATF PDF) and added imported guns (page 6 of the ATF PDF). For details of the sources see Section 3.3.2.1.
R Code 3.75 : Firearms manufactured in the United States 1986-2019
guns_manufactured_2019_clean <-
base::readRDS("data/chap03/guns_manufactured_2019_clean.rds")
lookup_manufactured <- c(
Total = "Firearms manufactured (Items)",
Pistols = "Pistols (Items)",
Rifles = "Rifles (Items)",
Shotguns = "Shotguns (Items)",
Revolvers = "Revolvers (Items)",
Misc = "Misc. Firearms (Items)"
)
guns_manufactured_2019_clean |>
tidyr::pivot_longer(
cols = -Years,
names_to = "year",
values_to = "gun_count"
) |>
dplyr::rename(gun_type = "Years") |>
dplyr::mutate(year = base::as.numeric(year),
gun_type = forcats::as_factor(gun_type)
) |>
tidyr::pivot_wider(
names_from = gun_type,
values_from = gun_count
) |>
dplyr::rename(
dplyr::all_of(lookup_manufactured)
) |>
tidyr::pivot_longer(
cols = -year,
names_to = "gun_type",
values_to = "gun_count"
) |>
ggplot2::ggplot(
ggplot2::aes(
x = year,
y = gun_count / 1e5)
) +
ggplot2::geom_line(
ggplot2::aes(color = gun_type)
) +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Number of firearms (in 100,000s)",
x = "Year manufactured"
) +
ggplot2::scale_x_continuous(
limits = c(1986, 2019),
breaks = c(1986, 1990, 1995, 2000,
2005, 2010, 2015, 2019)) +
ggokabeito::scale_color_okabe_ito(
order = c(1:3, 6, 7, 9),
name = "Type of
Firearm",
breaks = c('Total', 'Pistols', 'Rifles',
'Misc', 'Revolvers', 'Shotguns'))
## palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE)
# "#E69F00" "#56B4E9" "#009E73" "#F0E442" "#0072B2" "#D55E00" "#CC79A7" "#999999" "#000000"
Here applies the same comment about the code structure as written for the previous tab “Circulating guns”.
Line graphs need an ggplot2::aes()
specification inside the ggplot2::geom_line()
In contrast to other types of graph: Besides the definition of x and y variables, there is a second aes()
specification inside the geom_line()
necessary where you define linetype
or the color
of lines. Otherwise the graph would not work as line graph!
Example 3.15 : Scatterplot for Mortality Rate versus Funding
R Code 3.76 : Scatterplot: Mortality Rate versus Funding (with metric x-scale)
research_funding <- base::readRDS("data/chap03/research_funding.rds")
research_funding |>
ggplot2::ggplot(
ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
y = Funding)
) +
ggplot2::geom_point() +
ggplot2::theme_bw()
R Code 3.77 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a ‘loess’ smoother)
## prevent exponential (scientific) notation
base::options(scipen = 999)
research_funding |>
ggplot2::ggplot(
ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
y = Funding / 1e9)
) +
ggplot2::geom_point() +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Funding, US $ billion"
) +
ggplot2::geom_smooth(
method = "loess",
formula = 'y ~ x'
) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10()
ggplot2::geom_smooth()
and not ggplot2::stat_smooth()
as in the book. Frankly, I do not understand the difference, because the output is the same. As parameters I have added the standard value for less than 1,000 observations method = "loess"
and formula = 'y ~x'
.options(scipen = 999)
.A penalty to be applied when deciding to print numeric values in fixed or exponential notation. Positive values bias towards fixed and negative towards scientific notation: fixed notation will be preferred unless it is more than
scipen
digits wider. Help page for Options Settings
R Code 3.78 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a linear smoother)
research_funding |>
ggplot2::ggplot(
ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
y = Funding / 1e9)
) +
ggplot2::geom_point() +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Funding, US $ billion"
) +
ggplot2::geom_smooth(
method = "lm",
formula = 'y ~ x'
) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10()
In contrast to Graph 3.48 I have here used for the trend line, e.g. the line for the best fit, the smoother for the linear model (‘lm’).
R Code 3.79 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear smoother and points labelled)
research_funding |>
ggplot2::ggplot(
ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
y = Funding / 1e9)
) +
ggplot2::geom_point() +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Funding, US $ billion"
) +
ggplot2::geom_smooth(
method = "lm",
formula = 'y ~ x'
) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggplot2::geom_text(
ggplot2::aes(label = `Cause of Death`)
)
The problem here is that the label are overlapping each other. This can be repaired with the {ggrepel} package (see Section A.32). (See Graph 3.51)
R Code 3.80 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear smoother and points labelled with {ggrepel})
research_funding <- base::readRDS("data/chap03/research_funding.rds")
research_funding |>
ggplot2::ggplot(
ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
y = Funding / 1e9)
) +
ggplot2::geom_point() +
ggplot2::theme_bw() +
ggplot2::labs(
y = "Funding, US $ billion"
) +
ggplot2::geom_smooth(
method = "lm",
formula = 'y ~ x'
) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggrepel::geom_text_repel(
ggplot2::aes(label = `Cause of Death`)
)
## return from fixed to standard notation
base::options(scipen = 0)
With the {ggrepel} package one prevents overlapping of labels in a scatterplot. Compare to Graph 3.50.
For a graph to stand alone, it should have as many of these features as possible:
- Clear labels and titles on both axes
- An overall title describing what is in the graph along with
- Source of data (added, pb)
- Date of data collection
- Units of analysis (e.g., people, organizations)
- Sample size
In addition, researchers often use the following to improve a graph:
- Scale variables with very large or very small values (e.g., using millions or billions).
- Color to draw attention to important or relevant features of a graph.
Most of the content in this chapter were tutorial material to choose and produce appropriate graphs for the data at hand. Therefore I didn’t always follow the above rules. Especially the source of the data, and the date of the data collection were most of the time not added.
In the next graph I try to add all the important information to fulfill the above advice.
R Code 3.81 : Firearms circulating in the United States 1986-2017
p_guns_total +
ggplot2::labs(
title = "Firearms circulating in the United States 1986-2017",
subtitle = "Own Calculation: Firearms manufactured (p.2)\nsubtracted Exports (p.4) and added Imports (p.6).",
caption = "Source: Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF)"
)
lightbox = true
directive in my _quarto.yml
file, because the whole picture is an active link for the lightbox. But hyperlinks do work in the figure caption (fig-cap
).\n
. But this did not work for fig-cap
.At least since creating graphs for two variable at once (Section 3.6) the question arises: What color palette should be chosen?
To dive into the topic of color palettes for R is a very deep water and I can here only scratch the surface. My main purpose in this section is to provide practical orientation on tools to choose appropriate color palettes. I am following here mainly the compiled material by Emil Hvitfeldt (2024), who is also the developer of {paleteer} (see Section A.61) (Hvitfeldt 2021).
It is not enough that the colors do not offend the eye, e.g. that they are aesthetically pleasant. There are two other important consideration as well:
Often you have a beauty colorful graphic which looks very nice in RStudio and on your web site. But how does it look when the graphics is printed out or appear in book published only in black and white?
You can check the black/white appearance of the colors your plot is using with the function colorspace::desaturate()
.
To test if the palette you want to use will be distorted when in black and white, use the colorspace::desaturate()
function.
Experiment 3.5 : Using colorspace::desaturate()
to test how color palettes perform in black & white
R Code 3.82 : Standard color palette for {ggplot2} in color and desaturated
The standard color palette of {ggplot2} is completely useless when you print it out in black & white! The problem is that the colors are picked to have constant chroma and luminance thus yielding the same shade of grey when desaturated.
R Code 3.83 : {viridis} color palette “magma” (option ‘A’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “A” (“magma”)
R Code 3.84 : {viridis} color palette “inferno” (option ‘B’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “B” (“inferno”)
R Code 3.85 : {viridis} color palette “plasma” (option ‘C’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “C” (“plasma”)
R Code 3.86 : {viridis} color palette “viridis” (option ‘D’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“viridis”)
R Code 3.87 : {viridis} color palette “cividis” (option ‘E’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“cividis”)
R Code 3.88 : {viridis} color palette “rocket” (option ‘F’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “F” (“rocket”)
R Code 3.89 : {viridis} color palette “mako” (option ‘G’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “G” (“mako”)
R Code 3.90 : {viridis} color palette “turbo” (option ‘H’) in color and desaturated
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “H” (“turbo”)
WATCH OUT! Begin and end of the “viridis::turbo” color scale are very similar in black & white
In this chapter I preferred “viridis::turbo” because it is the most colorful palette of {viridis}. But it turned out that the begin and end of the color scale are not distinguishable in black & white. I had therefore to determine a different begin and/or end of the palette to prevent this similarity.
R Code 3.91 : Test how my used colors look for printing in black & white
pal_data <- list(names = c("Normal", "desaturated"),
color = list(scales::viridis_pal(
alpha = 1,
begin = .15,
end = .35,
direction = -1,
option = "turbo")(2),
colorspace::desaturate(scales::viridis_pal(
alpha = 1,
begin = .15,
end = .35,
direction = -1,
option = "turbo")(2)))
)
list_plotter(pal_data$color, pal_data$names,
"Colors and black & white of graph gun-use by sex")
Summary and my personal preferences
Read more details about {viridis} scales in the vignette Introduction to the {viridis} color maps. These palettes are not only colorful and pretty but also perceptually uniform and robust to color blindness (CVD).
viridis
) which is easier to read for most of the CVDs. cividis
is a corrected version providing easiness for all kinds of CVDs.turbo
stands out, because it is a rainbow scale developed to address the shortcomings of the Jet rainbow color map. It is not perceptually uniform.colorspace::desaturate()
. For instance: colorspace::desaturate(viridis::turbo(5, 0.5, .25, .75, -1))
See for a practice test Listing / Output 3.23.Resource 3.5 Other colorblind friendly palettes
scale_color_OkabeIto
and scale_fill_OkabeIto
from {blindnessr} works also better for people with color-vision deficiency (CVD).350 million people are color blind. Men are affected more than women: 28% (1 out of 12) men and 0.5% (1 out of 200) women. (EnChroma)
Usually when people talk about color blindness, they are referring to the most common forms of red-green color blindness. But there are also other color vision deficiency (CVD):
Bullet List
There are different possibilities to determine the effect of color blindness on the used palettes or used graph colors.
The {dichromat} package can simulate color blindness on individual colors or entire palettes.
Experiment 3.6 : Effects of color blindness on different palettes using {dichromat}
R Code 3.92 : Standard color palette for {ggplot2} in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(scales::hue_pal()(256),
dichromat::dichromat(scales::hue_pal()(256), type = "deutan"),
dichromat::dichromat(scales::hue_pal()(256), type = "protan"),
dichromat::dichromat(scales::hue_pal()(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names, "The effect of color blindness on the ggplot2 standard palette")
R Code 3.93 : {viridis} color palette “magma” (option ‘A’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::magma(256),
dichromat::dichromat(viridis::magma(256), type = "deutan"),
dichromat::dichromat(viridis::magma(256), type = "protan"),
dichromat::dichromat(viridis::magma(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'magma' palette")
R Code 3.94 : {viridis} color palette “inferno” (option ‘B’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::inferno(256),
dichromat::dichromat(viridis::inferno(256), type = "deutan"),
dichromat::dichromat(viridis::inferno(256), type = "protan"),
dichromat::dichromat(viridis::inferno(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'inferno' palette")
R Code 3.95 : {viridis} color palette “plasma” (option ‘C’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::plasma(256),
dichromat::dichromat(viridis::plasma(256), type = "deutan"),
dichromat::dichromat(viridis::plasma(256), type = "protan"),
dichromat::dichromat(viridis::plasma(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'plasma' palette")
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “C” (“plasma”)
R Code 3.96 : {viridis} color palette “viridis” (option ‘D’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::viridis(256),
dichromat::dichromat(viridis::viridis(256), type = "deutan"),
dichromat::dichromat(viridis::viridis(256), type = "protan"),
dichromat::dichromat(viridis::viridis(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'viridis' palette")
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“viridis”)
R Code 3.97 : {viridis} color palette “cividis” (option ‘E’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::cividis(256),
dichromat::dichromat(viridis::cividis(256), type = "deutan"),
dichromat::dichromat(viridis::cividis(256), type = "protan"),
dichromat::dichromat(viridis::cividis(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'cividis' palette")
R Code 3.98 : {viridis} color palette “rocket” (option ‘F’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::rocket(256),
dichromat::dichromat(viridis::rocket(256), type = "deutan"),
dichromat::dichromat(viridis::rocket(256), type = "protan"),
dichromat::dichromat(viridis::rocket(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'rocket' palette")
R Code 3.99 : {viridis} color palette “mako” (option ‘G’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::mako(256),
dichromat::dichromat(viridis::mako(256), type = "deutan"),
dichromat::dichromat(viridis::mako(256), type = "protan"),
dichromat::dichromat(viridis::mako(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'mako' palette")
One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “G” (“mako”)
R Code 3.100 : {viridis} color palette “turbo” (option ‘H’) in color and desaturated
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(viridis::turbo(256),
dichromat::dichromat(viridis::turbo(256), type = "deutan"),
dichromat::dichromat(viridis::turbo(256), type = "protan"),
dichromat::dichromat(viridis::turbo(256), type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on the viridis 'turbo' palette")
R Code 3.101 : Color blindness test of my chosen colors for the graph of gun-use by sex
my_colors = scales::viridis_pal(
alpha = 1,
begin = .15,
end = .35,
direction = -1,
option = "turbo")(2)
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(my_colors,
dichromat::dichromat(my_colors, type = "deutan"),
dichromat::dichromat(my_colors, type = "protan"),
dichromat::dichromat(my_colors, type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects on my chosen colors for gun use by sex")
R Code 3.102 : Color blindness test of my chosen colors for the graph rounds fired by sex
my_colors = scales::viridis_pal(
alpha = 1,
begin = .25,
end = .75,
direction = -1,
option = "cividis")(5)
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(my_colors,
dichromat::dichromat(my_colors, type = "deutan"),
dichromat::dichromat(my_colors, type = "protan"),
dichromat::dichromat(my_colors, type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects for my graph: rounds fired by sex")
R Code 3.103 : Color blindness test of my chosen colors for the waffle graph: rounds fired
my_colors <- c("lightblue1", "lightsteelblue1",
"deepskyblue1", "dodgerblue3", "black",
"gold1", "lemonchiffon1")
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(my_colors,
dichromat::dichromat(my_colors, type = "deutan"),
dichromat::dichromat(my_colors, type = "protan"),
dichromat::dichromat(my_colors, type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects for my waffle graph: rounds fired 2018")
R Code 3.104 : Color blindness test of my chosen colors for the waffle graph: rounds fired 2018 with cividis
color scale
my_colors <- scales::viridis_pal(
alpha = 1,
begin = .25,
end = .75,
direction = -1,
option = "cividis")(7)
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
color = list(my_colors,
dichromat::dichromat(my_colors, type = "deutan"),
dichromat::dichromat(my_colors, type = "protan"),
dichromat::dichromat(my_colors, type = "tritan")))
list_plotter(pal_data$color, pal_data$names,
"Color blindness effects for my waffle graph:\nrounds fired with cividis color scale")
With the exception of the turbo palette form the {viridis} package show all palettes a color blindness friendly performance. Again this experiments shows that one should not rely on the standard {ggplot2} palette if one would design for CVD.
But be aware that it worked for my test cases with only 2, 5 and 7 colors. I think one should always test if the used colors are also readable with black and white printing and for people with CVD.
Another helpful package to produce colorblind friendly plots is {colorblindr}. This package is only available at GitHub. It depends on the development versions of {cowplot} and {colorspace}.
It simulates the plot colors for all kinds of CVD with three different methods:
colorblindr::cvd_grid(<gg-object>)
: Displays the various color-vision-deficiency simulations as a plot in R.colorblindr::view_cvd(<gg-object>)
: Inspect the graph in the interactive app. Use this function only interactively (not programmatically!)You get 5 different plots to compare:
R Code 3.105 : Demo of the usage of {colorblindr}
colorblindr::cvd_grid(waffle_plot)
colorblindr::cvd_grid()
function
I have used the plot for the proportion of total rounds fired (NHANES survey 2017-2018). Compare the simulation with the original in Graph 3.16 (a),
Both methods to check for colorblind friendly plots discussed so far (Section A.17 and Section A.6) have the disadvantage that one has to determine the colors by the subjective impression of the personal visual inspection. {colorblindcheck} provides in addition to the visual inspection some objective data: It calculates the distances between the colors in the input palette and between the colors in simulations of the color vision deficiencies.
In the following experiment I am going to investigate my hand-picked colors for the waffle graph in Graph 3.16 (a). I picked these colors because the appeared to me distant enough to make an understandable and pleasant graph.
Experiment 3.7 : Experiments with the {colorblindcheck} package
R Code 3.106 : Convert colors by name into hexadecimal code
#> [1] "#BFEFFF" "#CAE1FF" "#00BFFF" "#1874CD" "#000000" "#FFD700" "#FFFACD"
I had chosen my colors by name. So the first task to apply the function of the {colorblindcheck} package is to translate them into hex code. For this conversion I have used the col2hex()
function of the {gplots} package (see Section A.30)
R Code 3.107 : List different parameters for color distance values to check for colorblind friendliness
colorblindcheck::palette_check(waffle_colors)
#> name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1 normal 7 10.54291 21 21 10.542912 47.97701 98.42826
#> 2 deuteranopia 7 10.54291 21 20 2.795989 47.79041 98.41086
#> 3 protanopia 7 10.54291 21 20 4.073523 46.14816 97.30477
#> 4 tritanopia 7 10.54291 21 20 6.730892 41.78801 96.26958
The colorblindcheck::palette_check()
function returns a data.frame with 4 observations and 8 variables:
The table shows in the normal color view that the minimal distance between colors is about 10.5. But in all three CVDs palettes only 20 from the 21 pairs are above this distance. So there is one color pair where people with CVD can’t distinguish.
Let’s go into the details. At first we will inspect the color palettes visually and then we will investigate the distances between all pairs for all palettes of CVDs.
R Code 3.108 : Plot waffle colors palette normal and for different CVDs
colorblindcheck::palette_plot(waffle_colors, severity = 1)
From the visual inspections it seems that there is a problem with the color pair 1 / 2.
R Code 3.109 : Color pair distances for the normal colors
colorblindcheck::palette_dist(waffle_colors)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] NA 10.54291 18.10731 37.10142 89.17120 41.73917 26.00648
#> [2,] NA NA 19.28529 33.38842 84.16342 48.19716 31.10567
#> [3,] NA NA NA 25.22187 64.86977 60.28084 44.44362
#> [4,] NA NA NA NA 42.63815 67.09988 59.51180
#> [5,] NA NA NA NA NA 85.64217 98.42826
#> [6,] NA NA NA NA NA NA 20.57238
#> [7,] NA NA NA NA NA NA NA
The smallest distance in the normal color palette is about 10.5 between 1 and 2. But this is a very good distance, all the other colors perform even better with higher values of 18.
R Code 3.110 : Color pair distances for the deuteranopia CVD
colorblindcheck::palette_dist(waffle_colors, cvd = "deu")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] NA 2.795989 19.67791 38.31696 87.00229 42.72322 28.91615
#> [2,] NA NA 17.19632 36.15745 82.97331 45.39762 31.72386
#> [3,] NA NA NA 21.43223 61.57162 60.53591 47.99180
#> [4,] NA NA NA NA 41.37258 70.95919 60.97332
#> [5,] NA NA NA NA NA 87.88964 98.41086
#> [6,] NA NA NA NA NA NA 19.58039
#> [7,] NA NA NA NA NA NA NA
Here we can see clearly that the distance between 1 and 2 is only 2.8!. All the other colors perform excellent with higher values of 17.
R Code 3.111 : Color pair distances for the protanopia CVD
colorblindcheck::palette_dist(waffle_colors, cvd = "pro")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] NA 4.073523 16.80032 35.11481 91.09592 39.70098 26.01640
#> [2,] NA NA 12.95911 31.90976 85.83276 43.51167 30.38048
#> [3,] NA NA NA 20.91352 68.81192 54.71840 43.20273
#> [4,] NA NA NA NA 44.94125 65.17437 54.56921
#> [5,] NA NA NA NA NA 82.19301 97.30477
#> [6,] NA NA NA NA NA NA 19.88647
#> [7,] NA NA NA NA NA NA NA
Again: The problem is the color pair 1 and 2 with a slightly better difference of 4 in comparison with the deuteranopia palette of 2.8.
R Code 3.112 : Color pair distances for the tritanopia CVD
colorblindcheck::palette_dist(waffle_colors, cvd = "tri")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] NA 6.730892 13.01156 30.72849 90.39176 38.96913 24.90487
#> [2,] NA NA 15.41221 29.64808 84.57220 34.48380 20.06730
#> [3,] NA NA NA 21.02786 69.32457 43.19480 31.56913
#> [4,] NA NA NA NA 43.74885 46.44875 42.21174
#> [5,] NA NA NA NA NA 78.32859 96.26958
#> [6,] NA NA NA NA NA NA 16.50397
#> [7,] NA NA NA NA NA NA NA
Here again the problem lies smallest distance of the first two color. But 6.7 is not so bad for distinctions. Therefore it is always necessary to see the actual value and not only to stick with the first table in Table 3.4.
It is interesting to notice that my thought from the visual inspection that the pair 1 /3 is also problematic is unsubstantiated.
R Code 3.113 : Color distances with the CVD friendly {cividis} palette
waffle_colors2 = scales::viridis_pal(alpha = 1,
direction = -1,
option = "cividis")(7)
colorblindcheck::palette_check(waffle_colors2, plot = TRUE)
#> name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1 normal 7 9.474139 21 21 9.474139 40.15113 94.45609
#> 2 deuteranopia 7 9.474139 21 20 9.187656 40.67049 95.77164
#> 3 protanopia 7 9.474139 21 20 9.309515 39.02739 90.60717
#> 4 tritanopia 7 9.474139 21 20 8.932613 34.28656 78.10328
From the table we can see that all three CVDs palette have a smaller tolerance as the normal color palette. But we also see that the differences are minimal. The worst case (tritanopia) is only .05 belwo and has still a distance of almost 9.
Use colorblind friendly palettes
I have tried very hard to choose colorblind friendly manually. But I didn’t succeed.
Lessons learned: Do not rely on your subjective visual impressions but use the professionally designed palettes that improve readability for all kinds of color vision deficiencies.
I have learned to add palettes with different approaches. The following experiments list all the options I became aware in the last few days.
Experiment 3.8 : How to add palettes to {ggplot2} graphics
R Code 3.114 : Creating the gg
object for adding the palette
library(ggmosaic)
library(ggplot2)
## restore ggplot2 colors #############
options(
ggplot2.discrete.color = NULL,
ggplot2.discrete.fill = NULL
)
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")
(
gg <- nhanes_2012_clean |>
dplyr::mutate(rounds_fired =
dplyr::na_if(rounds_fired, "Don't know")) |>
tidyr::drop_na(c(rounds_fired, sex)) |>
dplyr::mutate(rounds_fired =
forcats::fct_drop(rounds_fired)) |>
ggplot2::ggplot() +
ggmosaic::geom_mosaic(ggplot2::aes(
x = ggmosaic::product(rounds_fired, sex),
fill = rounds_fired)
) +
ggplot2::theme_classic() +
ggplot2::labs(x = "Sex",
y = "Total number of rounds fired",
fill = "Rounds fired") +
ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE))
)
R Code 3.115 : OPTION 1: Viridis color palette from {viridisLite} using {ggplot2}
gg +
ggplot2::scale_fill_viridis_d(
alpha = 1, # alpha does not work!
begin = .25,
end = .75,
direction = -1,
option = "cividis"
)
R Code 3.116 : OPTION 2: Create your own discrete scale using {ggplot2} and {scales}
gg +
ggplot2::scale_fill_manual(values = scales::viridis_pal(
alpha = 1,
begin = .25,
end = .75,
direction = -1,
option = "cividis")(5)
)
R Code 3.117 : OPTION 3: Viridis Color Palette for ggplot2 using {viridis}
gg +
viridis::scale_fill_viridis(
begin = .25,
end = .75,
discrete = TRUE,
option = "cividis",
direction = -1
)
R Code 3.118 : OPTION 4: Viridis manual scale with {paletteer}
library(paletteer)
gg +
ggplot2::scale_fill_manual(values =
paletteer::paletteer_c("viridis::cividis",
n = 5,
direction = -1)
)
This option with {paletteer} lacks some arguments (alpha
, begin
and end
).
R Code 3.119 : OPTION 5: Wrapper around continuous function (no direction!)
library(paletteer)
gg +
ggplot2::scale_fill_manual(values =
base::rev(paletteer:::paletteer_c_viridis(
name = "cividis",
n = 5))
)
This option uses an abbreviated function of the {paletteer} package. But again it misses the arguments alpha
, begin
, end
and direction
. It is possible to use base:rev()
to reverse the direction but for the other arguments I have not found an alternative.
There are different option to add a palette. The option 4 and 5 with the {paletteer} lacks some options. I have not much experience with palettes and color scales. There I am not sure if the missing options are a result of my missing knowledge or if it is a design feature or bug.
WATCH OUT! Changing the alpha
argument does not work
All the options to add the {viridis} palette does not change the alpha
argument. I believe that this is an error on the {ggmosiac} package.
Update: Yes, In the meanwhile I have confirmed my assumption! R Code 5.13 shows an example where the alpha
argument is working!
Resource 3.6 : Online resources of {ggplot2} code examples
term | definition |
---|---|
APIx | An API, or application programming interface, is a set of defined rules that enable different applications to communicate with each other. It acts as an intermediary layer that processes data transfers between systems, letting companies open their application data and functionality to external third-party developers, business partners, and internal departments within their companies. (<a href="https://www.ibm.com/topics/api">IBM</a>) |
ATF Agency | Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF): ATF’s responsibilities include the investigation and prevention of federal offenses involving the unlawful use, manufacture, and possession of firearms and explosives; acts of arson and bombings; and illegal trafficking of alcohol and tobacco products. The ATF also regulates, via licensing, the sale, possession, and transportation of firearms, ammunition, and explosives in interstate commerce. (<a href="https://www.atf.gov/about/what-we-do">ATF</a>) |
Bar Charts | Bar charts are visual displays of data often used to examine similarities and differences across categories of things; bars can represent frequencies, percentages, means, or other statistics. (SwR, Glossary) |
Boxplots | Boxplots are a visual representation of data that shows central tendency (usually the median) and spread (usually the interquartile range) of a numeric variable for one or more groups; boxplots are often used to compare the distribution of a continuous variable across several groups. (SwR, Glossary) |
Color blindness | Color blindness (also spelled colour blindness) or color vision deficiency (CVD) or includes a wide range of causes and conditions and is actually quite complex. It's a condition characterized by an inability or difficulty in perceiving and differentiating certain colors due to abnormalities in the three color-sensing pigments of the cones in the retina. (<a href="https://enchroma.com/pages/types-of-color-blindness">EnChroma</a>) |
CVD | Color vision deficiency (CVD) or color blindness (also spelled colour blindness) includes a wide range of causes and conditions and is actually quite complex. It's a condition characterized by an inability or difficulty in perceiving and differentiating certain colors due to abnormalities in the three color-sensing pigments of the cones in the retina. (<a href="https://enchroma.com/pages/types-of-color-blindness">EnChroma</a>) |
Density Plots | Density plots are used for examining the distribution of a variable measured along a continuum; density plots are similar to histograms but are smoothed and may not show existing gaps in data (SwR, Glossary) |
Donut Charts | Donut or doughnut charts (sometimes also called ring charts) are an alternative chart for pie charts, which have a hole in the middle, making them cleaner to read than pie charts. (<a href="https://r-charts.com/part-whole/donut-chart/">R Charts</a>) |
Grouped Bar Chart | A grouped bar chart is a data visualization that shows two categorical variables in a bar chart where one group is shown along the x-axis for vertical bars or y-axis for horizontal bars and the other grouping is shown as separate bars within each of the first grouping variable categories; the bars are often different colors to distinguish the groups. (SwR, Glossary) |
Histograms | Histograms are visual displays of data used to examine the distribution of a numeric variable. (SwR, Glossary) |
Kernel Density Estimation | Kernel density estimation (KDE) extrapolates data to an estimated population probability density function. It’s called kernel density estimation because each data point is replaced with a kernel—a weighting function to estimate the pdf. The function spreads the influence of any point around a narrow region surrounding the point. (<a href="https://www.statisticshowto.com/kernel-density-estimation/">Statistics How To</a>) |
Line Graph | A line graph is a visual display of data often used to examine the relationship between two continuous variables or for something measured over time. (SwR, Glossary) |
Mosaic Plots | Mosaic plots are visual representations of data to show the relationship between two categorical variables; useful primarily when both variables have few categories. (SwR, Glossary) |
NHANES | The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. (<a href="https://www.cdc.gov/nchs/nhanes/about_nhanes.htm">NHANES</a>) |
Pie Charts | Pie charts are used to show parts of a whole; pie charts get their name from looking like a pie with pieces representing different groups, and they are not recommended for most situations because they can be difficult to interpret |
Point Charts | Point charts are charts that show summary values for a numeric variable, typically across groups; for example, a point chart could be used in place of a bar graph to show mean or median across groups. (SwR, Glossary) |
Probability Density Function | A probability density function (PDF) tells us the probability that a random variable takes on a certain value. (<a href="https://www.statology.org/cdf-vs-pdf/">Statology</a>) The probability density function (PDF) for a given value of random variable X represents the density of probability (probability per unit random variable) within a particular range of that random variable X. Probability densities can take values larger than 1. ([StackExchange Mathematics](https://math.stackexchange.com/a/1464837/1215136)) We can use a continuous probability distribution to calculate the probability that a random variable lies within an interval of possible values. To do this, we use the continuous analogue of a sum, an integral. However, we recognise that calculating an integral is equivalent to calculating the area under a probability density curve. We use `p(value)` for probability densities and `P |
Scatterplot | A scatterplot is a graph that shows one dot for each observation in the data set (SwR, Glossary) |
Stacked Bar Chart | A stacked bar chart is a data visualization that shows parts of a whole in a bar chart format; this type of chart can be used to examine two categorical variables together by showing the categories of one variable as the bars and the categories of the other variable as different colors within each bar. (SwR, Glossary) |
Statista | Statista is a global data and business intelligence platform with an extensive collection of statistics, reports, and insights on over 80,000 topics from 22,500 sources in 170 industries. Established in Germany in 2007, Statista operates in 13 locations worldwide and employs around 1,100 professionals. (<a href="https://www.statista.com/aboutus/">statista</a>) |
Violin Plots | Violing plots are visual displays of data that combine features of density plots and boxplots to show the distribution of numeric variables, often across groups. (SwR, Glossary) |
Waffle Charts | Waffle Charts are visual displays of data that show the parts of a whole similar to a pie chart; waffle charts are generally preferred over pie charts. (SwR, Glossary) |
Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os macOS Sequoia 15.1
#> system x86_64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Vienna
#> date 2024-11-13
#> pandoc 3.5 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.4.0)
#> bitops 1.0-9 2024-10-03 [2] CRAN (R 4.4.1)
#> caTools 1.18.3 2024-09-04 [2] CRAN (R 4.4.1)
#> cli 3.6.3 2024-06-21 [2] CRAN (R 4.4.1)
#> colorblindcheck 1.0.2 2023-05-13 [2] CRAN (R 4.4.0)
#> colorblindr 0.1.0 2024-04-25 [2] Github (clauswilke/colorblindr@90d64f8)
#> colorspace 2.1-1 2024-07-26 [2] CRAN (R 4.4.0)
#> commonmark 1.9.2 2024-10-04 [2] CRAN (R 4.4.1)
#> cowplot 1.1.3.9000 2024-04-25 [2] Github (wilkelab/cowplot@e1334a2)
#> curl 6.0.0 2024-11-05 [2] CRAN (R 4.4.1)
#> data.table 1.16.2 2024-10-10 [2] CRAN (R 4.4.1)
#> dichromat 2.0-0.1 2022-05-02 [2] CRAN (R 4.4.0)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.4.1)
#> dplyr 1.1.4 2023-11-17 [2] CRAN (R 4.4.0)
#> DT 0.33 2024-04-04 [2] CRAN (R 4.4.0)
#> evaluate 1.0.1 2024-10-10 [2] CRAN (R 4.4.1)
#> extrafont 0.19 2023-01-18 [2] CRAN (R 4.4.0)
#> extrafontdb 1.0 2012-06-11 [2] CRAN (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [2] CRAN (R 4.4.0)
#> farver 2.1.2 2024-05-13 [2] CRAN (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.4.0)
#> forcats 1.0.0 2023-01-29 [2] CRAN (R 4.4.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.4.0)
#> ggmosaic * 0.3.4 2024-04-25 [2] Github (haleyjeppson/ggmosaic@b8045be)
#> ggokabeito 0.1.0 2021-10-18 [2] CRAN (R 4.4.0)
#> ggplot2 * 3.5.1 2024-04-23 [2] CRAN (R 4.4.0)
#> ggrepel 0.9.6 2024-09-07 [2] CRAN (R 4.4.1)
#> glossary * 1.0.0.9003 2024-08-05 [2] Github (debruine/glossary@05e4a61)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.4.1)
#> gplots 3.2.0 2024-10-05 [2] CRAN (R 4.4.1)
#> gridExtra 2.3 2017-09-09 [2] CRAN (R 4.4.0)
#> gtable 0.3.6 2024-10-25 [2] CRAN (R 4.4.1)
#> gtools 3.9.5 2023-11-20 [2] CRAN (R 4.4.0)
#> here 1.0.1 2020-12-13 [2] CRAN (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] CRAN (R 4.4.0)
#> janitor 2.2.0 2023-02-02 [2] CRAN (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] CRAN (R 4.4.1)
#> kableExtra 1.4.0 2024-01-24 [2] CRAN (R 4.4.0)
#> KernSmooth 2.23-24 2024-05-17 [2] CRAN (R 4.4.2)
#> knitr 1.49 2024-11-08 [2] CRAN (R 4.4.1)
#> labeling 0.4.3 2023-08-29 [2] CRAN (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.2)
#> lazyeval 0.2.2 2019-03-15 [2] CRAN (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.4.0)
#> lmtest 0.9-40 2022-03-21 [2] CRAN (R 4.4.0)
#> lubridate 1.9.3 2023-09-27 [2] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.4.0)
#> markdown 1.13 2024-06-04 [2] CRAN (R 4.4.0)
#> MASS 7.3-61 2024-06-13 [2] CRAN (R 4.4.2)
#> Matrix 1.7-1 2024-10-18 [2] CRAN (R 4.4.2)
#> mgcv 1.9-1 2023-12-21 [2] CRAN (R 4.4.2)
#> microbenchmark 1.5.0 2024-09-04 [2] CRAN (R 4.4.1)
#> munsell 0.5.1 2024-04-01 [2] CRAN (R 4.4.0)
#> nlme 3.1-166 2024-08-14 [2] CRAN (R 4.4.2)
#> paletteer * 1.6.0 2024-01-21 [2] CRAN (R 4.4.0)
#> patchwork 1.3.0 2024-09-16 [2] CRAN (R 4.4.1)
#> pillar 1.9.0 2023-03-22 [2] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.4.0)
#> plotly 4.10.4 2024-01-13 [2] CRAN (R 4.4.0)
#> plyr 1.8.9 2023-10-02 [2] CRAN (R 4.4.0)
#> prismatic 1.1.2 2024-04-10 [2] CRAN (R 4.4.0)
#> productplots 0.1.1 2016-07-02 [2] CRAN (R 4.4.0)
#> purrr 1.0.2 2023-08-10 [2] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.4.0)
#> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.4.0)
#> Rcpp 1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1)
#> rematch2 2.1.2 2020-05-01 [2] CRAN (R 4.4.0)
#> repr 1.1.7 2024-03-22 [2] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] CRAN (R 4.4.0)
#> rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.4.1)
#> rprojroot 2.0.4 2023-11-05 [2] CRAN (R 4.4.0)
#> rstudioapi 0.17.1 2024-10-22 [2] CRAN (R 4.4.1)
#> Rttf2pt1 1.3.12 2023-01-22 [2] CRAN (R 4.4.0)
#> rversions 2.1.2 2022-08-31 [2] CRAN (R 4.4.0)
#> scales 1.3.0 2023-11-28 [2] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.4.0)
#> skimr 2.1.5 2022-12-23 [2] CRAN (R 4.4.0)
#> snakecase 0.11.1 2023-08-27 [2] CRAN (R 4.4.0)
#> spacesXYZ 1.3-0 2024-01-23 [2] CRAN (R 4.4.0)
#> stringi 1.8.4 2024-05-06 [2] CRAN (R 4.4.0)
#> stringr 1.5.1 2023-11-14 [2] CRAN (R 4.4.0)
#> svglite 2.1.3 2023-12-08 [2] CRAN (R 4.4.0)
#> systemfonts 1.1.0 2024-05-15 [2] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] CRAN (R 4.4.0)
#> tidyr 1.3.1 2024-01-24 [2] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.4.0)
#> timechange 0.3.0 2024-01-18 [2] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [2] CRAN (R 4.4.0)
#> vcd 1.4-13 2024-09-16 [2] CRAN (R 4.4.1)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.4.0)
#> viridis 0.6.5 2024-01-29 [2] CRAN (R 4.4.0)
#> viridisLite 0.4.2 2023-05-02 [2] CRAN (R 4.4.0)
#> waffle 1.0.2 2023-09-30 [2] CRAN (R 4.4.0)
#> withr 3.0.2 2024-10-28 [2] CRAN (R 4.4.1)
#> xfun 0.49 2024-10-31 [2] CRAN (R 4.4.1)
#> xml2 1.3.6 2023-12-04 [2] CRAN (R 4.4.0)
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.4.0)
#> zoo 1.8-12 2023-04-13 [2] CRAN (R 4.4.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#> [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────