3  Data visualization

3.1 Achievements to unlock

Objectives for chapter 03

SwR Achievements

  • Achievement 1: Choosing and creating graphs for a single categorical variable (Section 3.4)
  • Achievement 2: Choosing and creating graphs for a single continuous variable (Section 3.5)
  • Achievement 3: Choosing and creating graphs for two variables at once (Section 3.6)
  • Achievement 4: Ensuring graphs are well-formatted (Section 3.7)

Achievements for chapter 03

3.2 Gun violence in the US

  1. Research about gun violence in under developed. Harris refers to an article by Stark & Shah (2017) (Figure 1 and 2).

  2. Data for figure 3 (Homicides in the US by guns 2012-2016) comes from the Uniform Crime Reporting (UCR):

    The Uniform Crime Reporting (UCR) Program generates reliable statistics for use in law enforcement. It also provides information for students of criminal justice, researchers, the media, and the public. The program has been providing crime statistics since 1930.

    The UCR Program includes data from more than 18,000 city, university and college, county, state, tribal, and federal law enforcement agencies. Agencies participate voluntarily and submit their crime data either through a state UCR program or directly to the FBI’s UCR Program.

  3. Figure 4: Handguns were the most widely used type of gun for homicide in 2016.

  4. Gun manufacturers play an essential role: Figure 5 and 6.

3.3 Resources & Chapter Outline

3.3.1 Data, codebook, and R packages

Resource 3.1 : Data, codebook, and R packages for data visualization

Data

There are two options:

  1. Download the three data files from https://edge.sagepub.com/harris1e
    • nhanes_2011_2012_ch3.csv
    • fbi_deaths_2016_ch3.csv
    • gun_publications_funds_2004_2015_ch3.csv
  2. Download
    • the raw data directly from the Internet for the FBI deaths data
    • the NHANES data by following the instructions in Box 3.1
    • the gun_publications_funds_2004_2015_ch3.csv from https://edge.sagepub.com/harris1e

I will only work with the second option.

Manufactured firearms data not mentioned

Harris provides the “total_firearms_manufactured_US_1990to2015.csv” file with firearm production in the US from 1990-2015 but did not mention the source. I have looked around on the web and reported the results of my research in Section 3.3.2.1.

Codebook

Again there are two options:

  1. Download from https://edge.sagepub.com/harris1e
    • nhanes_demographics_2012_codebook.html
    • nhanes_auditory_2012_codebook.html
  2. Access the codebooks online on the National Health and Nutrition Examination Survey (NHANES) website

I will only work with the second option.

Packages

  1. Packages used with the book (sorted alphabetically)

Instead of downloading the file with {httr} I will use utils::download.file()

Harris lists the older {httr} package, but now there is {httr2}, “a modern re-imagining of {httr} that uses a pipe-based interface and solves more of the problems that API wrapping packages face.” (See Section A.40)

The {httr} package is in the book just used for downloading the excel file from the FBI website. For this specific task there is no need to download, to install and to learn a new package. You can use utils::download.file(), a function as I have it already applied successfully in Listing / Output 2.1.

  1. My additional packages (sorted alphabetically)

3.3.2 Get data

3.3.2.1 Gun production

There are different sources of data for this chapter. A special case are the data provided by Harris about guns manufactured in the US between 1990 and 2015. There is no source available because this dataset was not mentioned in the section about “Data, codebook, and R packages” (see: Section 3.3.1). But a Google searched revealed several interesting possible sources:

  • ATF: The original data are generated and published by the Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF). Scrolling down you will find the header “Annual Firearms Manufacturers And Export Report” (AFMER). But the data are separated by year and only available as a summarized PDF fact sheet. But finally I found in a downloaded .csv file from USAFacts a reference to a PDF file where all the data I am interesting in are listed. To the best of my knowledge there are no better accessible data on the ATF website.
  • Statista: With a free account of statista it is possible to download Number of firearms manufactured in the United States from 1986 to 2021. But here we are missing the detailed breakdown by type of firearms. Another restriction is that the publication of the data are only allowed if you have a professional or enterprise account, starting with € 199,- per month.
  • The Trace: Another option is to download the collected data by The Trace, an American non-profit journalism outlet devoted to gun-related news in the United States (Wikipeda). The quoted article is referring to a google spreadsheet where you can access the collected data for the US gun production from 1899 (!) until today.
  • USAFacts: Finally I found the data I was looking for in a easy accessible format to download on the website of USAFacts.org, a not-for-profit organization and website that provides data and reports on the United States population, its government’s finances, and government’s impact on society.(Wikipedia). The data range from 1986 to 2019 and they are based on the original ATF data from the PDF report. They are higher than the data provided by Harris because the include exports. The AFMER report excludes production for the U.S. military but includes also firearms purchased by domestic law enforcement agencies.

But even if you have data of manufactured, exported and imported guns, this does not provide the exact numbers of guns circulating in the US:

Although a few data points are consistently collected, there is a clear lack of data on firearms in the US. It is impossible, for instance, to know the total number of firearms owned in the United States, or how many guns were bought in the past year, or what the most commonly owned firearms are. Instead, Americans are left to draw limited conclusions from available data, such as the number of firearms processed by the National Firearm Administration (NFA), the number of background checks conducted for firearm purchase, and the number of firearms manufactured. However, none of these metrics provide a complete picture because state laws about background checks and gun registration differ widely. (USAFact.org)

Remark 3.1. How to calculate the numbers of circulated guns in the US

If you are interested to research the relationship between gun ownership in the USA and homicides then you would need to reflect how to get a better approximation as the yearly manufactured guns. Besides that not all gun manufacturer have reported their production numbers to the ATF, there is a big difference between gun production and gun sales. Additionally there are some other issues that influence the number of circulated guns in the US. So you have to take into account for instance

  • the export and import of guns,
  • that guns fall out of circulation because of broken part, attrition, or illegal exports

As an exercise I have subtracted from the manufactured gun the exported guns and have added the imported firearms. You will find the result in Listing / Output 3.14.

3.3.2.2 Three steps procedure

To get the data for this chapter is a three step procedure:

Procedure 3.1 : How to get data from the internet

  1. My first step is always to go to the website and download the file manually. Some people may believe that this is superfluous, but I think there are three advantages for this preparatory task:
    • Inspecting the website and checking if the URL is valid and points to the correct dataset.
    • Checking the file extension
    • Inspecting the file after downloaded to see if there is something to care about (e.g., the file starts with several lines, that are not data, or other issues).
  2. Download the file using utils::donwload.file().
  3. Read the imported file into R with the appropriate program function, in the first case readxl::read_excel()

3.3.2.3 Get & save data

Example 3.1 : Get data for chapter 3

R Code 3.1 : Get data from the FBI’s Uniform Crime Reporting database

Code
## run only once (manually)
# create a variable that contains the web
# URL for the data set
url1 <- base::paste0("https://ucr.fbi.gov/crime-in-the-u.s",
                     "/2016/crime-in-the-u.s.-2016/tables/",
                     "expanded-homicide-data-table-4.xls/output.xls")

## code worked in the console
## but did not work when rendered in quarto
# utils::download.file(url = url1,
#                      destfile = paste0(here::here(),
#                                        "/data/chap03/fbi_deaths.xls"),
#                      method = "libcurl",
#                      mode = "wb"
# )

## the following code line worked but I used {**curl**}
# httr::GET(url = url1, 
#           httr::write_disk(
#               path = base::paste0(here::here(),
#                     "/data/chap03/fbi_deaths.xls",
#               overwrite = TRUE)
#               )
#         )

curl::curl_download(url = url1,
                     destfile = paste0(here::here(),
                                       "/data/chap03/fbi_deaths.xls"),
                     quiet = TRUE,
                     mode = "wb"
)

fbi_deaths <- tibble::tibble(
     readxl::read_excel(path = paste0(here::here(), 
                                      "/data/chap03/fbi_deaths.xls"),
                        sheet = 1,
                        skip = 3,
                        n_max = 18))

save_data_file("chap03", fbi_deaths, "fbi_deaths.rds")
Listing / Output 3.1: Get data from the FBI’s Uniform Crime Reporting database

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.4.)


In changed the recommended R code by Harris for downloading the FBI Excel Data I in several ways:

  1. Instead of of using {httr}, I tried first with utils::download.file(). This worked in the console (compiling the R code chunk), but did not work when rendered with Quarto. I changed to {curl} and used the curl_download() function which worked in both situations (see: Section A.11).
  2. Instead creating a data frame with base::data.frame() I used a tibble::tibble(). This has the advantage that the column names were not changed. In the original files the column names are years, but in base R is not allowed that column names start with a number. In tidyverse this is possible but you must refer to this column names with enclosed accents like 2016.
  3. Instead of saving the data as an Excel file I think that it is more convenient to store it as an R object with the extension “.rds”. (I believe that Harris saved it in the book only to get the same starting condition with the already downloaded file in the books companion web site.)

R Code 3.2 : Get NHANES data 2011-2012 from the CDC website with {RNHANES}

Code
## run only once (manually)
# download audiology data (AUQ_G)
# with demographics
nhanes_2012 <- 
    RNHANES::nhanes_load_data(
        file_name = "AUQ_G", 
        year = "2011-2012",
        demographics = TRUE)

save_data_file("chap03", nhanes_2012, "nhanes_2012.rds")
Listing / Output 3.2: Get NHANES data (2011-2012) from the CDC website with {RNHANES}

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.5)


{RNHANES} combines as a package specialized to download data from the National Health and Nutrition Examination Survey (NHANES) step 2 and 3 of Procedure 3.1. But as it turned out it only doesn’t work with newer audiology data than 2012. I tried to use the package with data from 2016 and 2018 (For 2014 there are no audiology data available), but I got an error message.

Error in validate_year(year) : Invalid year: 2017-2018

The problem lies in the function RNHANES:::validate_year(). It qualifies in version 1.1.0 only downloads until ‘2013-2014’ as valid:


function (year, throw_error = TRUE) 
{
    if (length(year) > 1) {
        Map(validate_year, year, throw_error = throw_error) %>% 
            unlist() %>% unname() %>% return()
    }
    else {
        valid <- switch(as.character(year), `1999-2000` = TRUE, 
            `2001-2002` = TRUE, `2003-2004` = TRUE, `2005-2006` = TRUE, 
            `2007-2008` = TRUE, `2009-2010` = TRUE, `2011-2012` = TRUE, 
            `2013-2014` = TRUE, FALSE)
        if (throw_error == TRUE && valid == FALSE) {
            stop(paste0("Invalid year: ", year))
        }
        return(valid)
    }
}

Conclusion: Special data packages can facilitate your work, but to know how to download data programmatically on your own is an indispensable data science skill.

See tab “NHANES 2018” how this is done.

R Code 3.3 : Get NHANES data 2017-2018 from the CDC website with {haven}

Code
## run only once (manually)
# download audiology data (AUQ_J)
nhanes_2018 <- haven::read_xpt(
    file = "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/AUQ_J.XPT"
)

save_data_file("chap03", nhanes_2018, "nhanes_2018.rds")
Listing / Output 3.3: Get NHANES data 2017-2018 from the CDC website with {haven}

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.6.)

The download with {haven} has the additional advantage that the variables are labelled as already explained in Section 1.8.

R Code 3.4 : Research funding for different kind of research topics (2004-2015)

Code
research_funding <- readr::read_csv(
    "data/chap03/gun_publications_funds_2004_2015_ch3.csv",
    show_col_types = FALSE)

save_data_file("chap03", research_funding, "research_funding.rds")

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.7.)


R Code 3.5 : Get guns manufactured 1986-2019 from USAFacts.org

Code
guns_manufactured_2019 <-  readr::read_csv(
    "https://a.usafacts.org/api/v4/Metrics/csv/16185?&_ga=2.61283563.611609354.1708598369-436724430.1708598369", show_col_types = FALSE
)

save_data_file("chap03", guns_manufactured_2019, "guns_manufactured_2019.rds")

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.8.)


R Code 3.6 : Get guns exported from a PDF report by the ATF

Code
guns_export <- tabulizer::extract_tables(
    "data/chap03/firearms_commerce_2019.pdf", 
    pages = 4
    )

guns_exported_2017 <- guns_export[[1]]

save_data_file("chap03", guns_exported_2017, "guns_exported_2017.rds")

(For this R code chunk is no output available, but you can inspect the data at Listing / Output 3.9.)


There are several innovation applied in this R code chunk:

1. First of all I have used the {**tabulizer**} package to scrap the export and import data tables from the original <a class='glossary' title='Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF): ATF’s responsibilities include the investigation and prevention of federal offenses involving the unlawful use, manufacture, and possession of firearms and explosives; acts of arson and bombings; and illegal trafficking of alcohol and tobacco products. The ATF also regulates, via licensing, the sale, possession, and transportation of firearms, ammunition, and explosives in interstate commerce. (ATF)'>ATF</a>-PDF. To make this procedure reproducible I have downloaded the [PDF from the ATF website](https://www.atf.gov/firearms/docs/report/2019-firearms-commerce-report/download) as I assume that this PDF will updated regularily. 

WATCH OUT! Read carefully the installation instructions

Please read carefully the installation advices at the start of the page but — even more important — Installing Java on Windows with Chocolatey if you working on a windows machine or generally the Troubleshooting section for macOs, Linux and Windows.

I have recoded the data in several ways: - I turned the resulted matrices from the {tabulizer} package into tibbles. - Now I could rename all the columns with one name vector. - As the export data end with the year 2017 I had to reduce the import data ans also my original gun manufactured file to this shorter period.

R Code 3.7 : Get and recode guns imported from a PDF by the ATF

Code
guns_import <- tabulizer::extract_tables(
    "data/chap03/firearms_commerce_2019.pdf", 
    pages = 6
    )

guns_imported_2018 <- guns_import[[1]]

save_data_file("chap03", guns_imported_2018, "guns_imported_2018.rds")

(For this R code chunk is no output available. But you can inspect the data at Listing / Output 3.10.)


3.3.3 Show raw data

Example 3.2 : Show raw data for chapter 3

R Code 3.8 : Show raw data from the FBI’s Uniform Crime Reporting database

Code
fbi_deaths <- base::readRDS("data/chap03/fbi_deaths.rds")

utils::str(fbi_deaths)
skimr::skim(fbi_deaths)
#> tibble [18 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ Weapons: chr [1:18] "Total" "Total firearms:" "Handguns" "Rifles" ...
#>  $ 2012   : num [1:18] 12888 8897 6404 298 310 ...
#>  $ 2013   : num [1:18] 12253 8454 5782 285 308 ...
#>  $ 2014   : num [1:18] 12270 8312 5673 258 264 ...
#>  $ 2015   : num [1:18] 13750 9778 6569 258 272 ...
#>  $ 2016   : num [1:18] 15070 11004 7105 374 262 ...
Data summary
Name fbi_deaths
Number of rows 18
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Weapons 0 1 4 44 0 18 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
2012 0 1 1926.28 3652.75 8 87.75 304.0 1403.50 12888 ▇▁▁▁▁
2013 0 1 1831.11 3449.87 2 87.25 296.5 1330.00 12253 ▇▁▁▁▁
2014 0 1 1825.11 3430.87 7 75.50 261.0 1414.25 12270 ▇▁▁▁▁
2015 0 1 2071.00 3918.39 1 87.75 265.0 1410.00 13750 ▇▁▁▁▁
2016 0 1 2285.78 4327.90 1 100.25 318.0 1428.75 15070 ▇▁▁▁▁
Listing / Output 3.4: Show raw data from the FBI’s Uniform Crime Reporting database

R Code 3.9 : Show raw NHANES data 2011-2012 from the CDC website with {RNHANES}

Code
nhanes_2012 <- base::readRDS("data/chap03/nhanes_2012.rds")

utils::str(nhanes_2012)
skimr::skim(nhanes_2012)
#> 'data.frame':    9364 obs. of  81 variables:
#>  $ SEQN      : num  62161 62162 62163 62164 62165 ...
#>  $ cycle     : chr  "2011-2012" "2011-2012" "2011-2012" "2011-2012" ...
#>  $ SDDSRVYR  : num  7 7 7 7 7 7 7 7 7 7 ...
#>  $ RIDSTATR  : num  2 2 2 2 2 2 2 2 2 2 ...
#>  $ RIAGENDR  : num  1 2 1 2 2 1 1 1 1 1 ...
#>  $ RIDAGEYR  : num  22 3 14 44 14 9 6 21 15 14 ...
#>  $ RIDAGEMN  : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ RIDRETH1  : num  3 1 5 3 4 3 5 5 5 1 ...
#>  $ RIDRETH3  : num  3 1 6 3 4 3 7 6 7 1 ...
#>  $ RIDEXMON  : num  2 1 2 1 2 2 1 1 1 1 ...
#>  $ RIDEXAGY  : num  NA 3 14 NA 14 10 6 NA 15 14 ...
#>  $ RIDEXAGM  : num  NA 41 177 NA 179 120 81 NA 181 175 ...
#>  $ DMQMILIZ  : num  2 NA NA 1 NA NA NA 2 NA NA ...
#>  $ DMQADFC   : num  NA NA NA 2 NA NA NA NA NA NA ...
#>  $ DMDBORN4  : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ DMDCITZN  : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ DMDYRSUS  : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DMDEDUC3  : num  NA NA 8 NA 7 3 0 NA 9 7 ...
#>  $ DMDEDUC2  : num  3 NA NA 4 NA NA NA 3 NA NA ...
#>  $ DMDMARTL  : num  5 NA NA 1 NA NA NA 5 NA NA ...
#>  $ RIDEXPRG  : num  NA NA NA 2 NA NA NA NA NA NA ...
#>  $ SIALANG   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ SIAPROXY  : num  1 1 1 2 1 1 1 2 1 1 ...
#>  $ SIAINTRP  : num  2 2 2 2 2 2 2 1 2 2 ...
#>  $ FIALANG   : num  1 1 1 1 1 1 1 NA 1 2 ...
#>  $ FIAPROXY  : num  2 2 2 2 2 2 2 NA 2 2 ...
#>  $ FIAINTRP  : num  2 2 2 2 2 2 2 NA 2 2 ...
#>  $ MIALANG   : num  1 NA 1 NA 1 1 NA 1 1 1 ...
#>  $ MIAPROXY  : num  2 NA 2 NA 2 2 NA 2 2 2 ...
#>  $ MIAINTRP  : num  2 NA 2 NA 2 2 NA 2 2 2 ...
#>  $ AIALANGA  : num  1 NA 1 NA 1 NA NA 1 1 1 ...
#>  $ WTINT2YR  : num  102641 15458 7398 127351 12210 ...
#>  $ WTMEC2YR  : num  104237 16116 7869 127965 13384 ...
#>  $ SDMVPSU   : num  1 3 3 1 2 1 2 1 3 3 ...
#>  $ SDMVSTRA  : num  91 92 90 94 90 91 103 92 91 92 ...
#>  $ INDHHIN2  : num  14 4 15 8 4 77 14 2 15 9 ...
#>  $ INDFMIN2  : num  14 4 15 8 4 77 14 2 15 9 ...
#>  $ INDFMPIR  : num  3.15 0.6 4.07 1.67 0.57 NA 3.48 0.33 5 2.46 ...
#>  $ DMDHHSIZ  : num  5 6 5 5 5 6 5 5 4 4 ...
#>  $ DMDFMSIZ  : num  5 6 5 5 5 6 5 5 4 4 ...
#>  $ DMDHHSZA  : num  0 2 0 1 1 0 0 0 0 0 ...
#>  $ DMDHHSZB  : num  1 2 2 2 2 4 2 1 2 2 ...
#>  $ DMDHHSZE  : num  0 0 1 0 0 0 1 0 0 0 ...
#>  $ DMDHRGND  : num  2 2 1 1 2 1 1 1 1 1 ...
#>  $ DMDHRAGE  : num  50 24 42 52 33 44 43 51 38 43 ...
#>  $ DMDHRBR4  : num  1 1 1 1 2 1 1 2 2 2 ...
#>  $ DMDHREDU  : num  5 3 5 4 2 5 4 1 5 3 ...
#>  $ DMDHRMAR  : num  1 6 1 1 77 1 1 4 1 1 ...
#>  $ DMDHSEDU  : num  5 NA 4 4 NA 5 5 NA 5 4 ...
#>  $ AUQ054    : num  2 1 2 1 2 1 1 2 2 2 ...
#>  $ AUQ060    : num  1 NA NA NA NA NA NA 2 NA NA ...
#>  $ AUQ070    : num  NA NA NA NA NA NA NA 1 NA NA ...
#>  $ AUQ080    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ090    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ100    : num  5 NA NA 4 NA NA NA 4 NA NA ...
#>  $ AUQ110    : num  5 NA NA 5 NA NA NA 4 NA NA ...
#>  $ AUQ136    : num  1 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUQ138    : num  1 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUQ144    : num  4 NA NA 4 NA NA NA 2 NA NA ...
#>  $ AUQ146    : num  2 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUD148    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ152    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ154    : num  2 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUQ191    : num  2 NA NA 1 NA NA NA 2 NA NA ...
#>  $ AUQ250    : num  NA NA NA 5 NA NA NA NA NA NA ...
#>  $ AUQ255    : num  NA NA NA 1 NA NA NA NA NA NA ...
#>  $ AUQ260    : num  NA NA NA 2 NA NA NA NA NA NA ...
#>  $ AUQ270    : num  NA NA NA 1 NA NA NA NA NA NA ...
#>  $ AUQ280    : num  NA NA NA 1 NA NA NA NA NA NA ...
#>  $ AUQ300    : num  2 NA NA 1 NA NA NA 2 NA NA ...
#>  $ AUQ310    : num  NA NA NA 2 NA NA NA NA NA NA ...
#>  $ AUQ320    : num  NA NA NA 1 NA NA NA NA NA NA ...
#>  $ AUQ330    : num  2 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUQ340    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ350    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ360    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ AUQ370    : num  2 NA NA 2 NA NA NA 2 NA NA ...
#>  $ AUQ380    : num  1 NA NA 6 NA NA NA 5 NA NA ...
#>  $ file_name : chr  "AUQ_G" "AUQ_G" "AUQ_G" "AUQ_G" ...
#>  $ begin_year: num  2011 2011 2011 2011 2011 ...
#>  $ end_year  : num  2012 2012 2012 2012 2012 ...
Data summary
Name nhanes_2012
Number of rows 9364
Number of columns 81
_______________________
Column type frequency:
character 2
numeric 79
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cycle 0 1 9 9 0 1 0
file_name 0 1 5 5 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SEQN 0 1.00 67029.29 2811.36 62161.00 64598.75 67024.50 69457.25 71916.0 ▇▇▇▇▇
SDDSRVYR 0 1.00 7.00 0.00 7.00 7.00 7.00 7.00 7.0 ▁▁▇▁▁
RIDSTATR 0 1.00 1.96 0.20 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
RIAGENDR 0 1.00 1.50 0.50 1.00 1.00 2.00 2.00 2.0 ▇▁▁▁▇
RIDAGEYR 0 1.00 32.72 24.22 1.00 11.00 28.00 53.00 80.0 ▇▅▃▃▃
RIDAGEMN 9130 0.02 17.94 3.41 12.00 15.00 18.00 21.00 24.0 ▇▆▇▆▆
RIDRETH1 0 1.00 3.24 1.25 1.00 3.00 3.00 4.00 5.0 ▃▃▇▇▅
RIDRETH3 0 1.00 3.45 1.60 1.00 3.00 3.00 4.00 7.0 ▆▇▇▁▅
RIDEXMON 408 0.96 1.52 0.50 1.00 1.00 2.00 2.00 2.0 ▇▁▁▁▇
RIDEXAGY 5946 0.37 9.64 5.18 2.00 5.00 9.00 14.00 20.0 ▇▇▅▆▃
RIDEXAGM 5737 0.39 114.53 64.66 12.00 57.00 109.00 167.00 239.0 ▇▇▇▅▅
DMQMILIZ 3357 0.64 1.91 0.29 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
DMQADFC 8813 0.06 1.50 0.59 1.00 1.00 1.00 2.00 9.0 ▇▁▁▁▁
DMDBORN4 0 1.00 1.27 2.11 1.00 1.00 1.00 1.00 99.0 ▇▁▁▁▁
DMDCITZN 5 1.00 1.13 0.44 1.00 1.00 1.00 1.00 7.0 ▇▁▁▁▁
DMDYRSUS 7292 0.22 7.44 14.71 1.00 3.00 5.00 6.00 99.0 ▇▁▁▁▁
DMDEDUC3 6765 0.28 6.04 6.13 0.00 2.00 5.00 9.00 66.0 ▇▁▁▁▁
DMDEDUC2 3804 0.59 3.47 1.28 1.00 3.00 4.00 5.00 9.0 ▃▇▃▁▁
DMDMARTL 3804 0.59 2.75 3.34 1.00 1.00 2.00 5.00 99.0 ▇▁▁▁▁
RIDEXPRG 8156 0.13 2.02 0.34 1.00 2.00 2.00 2.00 3.0 ▁▁▇▁▁
SIALANG 0 1.00 1.12 0.33 1.00 1.00 1.00 1.00 2.0 ▇▁▁▁▁
SIAPROXY 4 1.00 1.65 0.48 1.00 1.00 2.00 2.00 2.0 ▅▁▁▁▇
SIAINTRP 0 1.00 1.96 0.18 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
FIALANG 99 0.99 1.08 0.27 1.00 1.00 1.00 1.00 2.0 ▇▁▁▁▁
FIAPROXY 99 0.99 2.00 0.04 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
FIAINTRP 99 0.99 1.97 0.17 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
MIALANG 2651 0.72 1.05 0.22 1.00 1.00 1.00 1.00 2.0 ▇▁▁▁▁
MIAPROXY 2651 0.72 1.99 0.08 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
MIAINTRP 2651 0.72 1.97 0.17 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
AIALANGA 3610 0.61 1.11 0.37 1.00 1.00 1.00 1.00 3.0 ▇▁▁▁▁
WTINT2YR 0 1.00 32347.76 34440.37 3600.63 11761.87 18490.48 35509.75 220233.3 ▇▁▁▁▁
WTMEC2YR 0 1.00 32347.76 35612.05 0.00 11582.07 18605.93 36132.36 222579.8 ▇▁▁▁▁
SDMVPSU 0 1.00 1.64 0.64 1.00 1.00 2.00 2.00 3.0 ▇▁▇▁▂
SDMVSTRA 0 1.00 95.88 3.98 90.00 92.00 96.00 99.00 103.0 ▇▆▃▆▅
INDHHIN2 78 0.99 11.53 16.49 1.00 5.00 7.00 14.00 99.0 ▇▁▁▁▁
INDFMIN2 49 0.99 11.10 16.20 1.00 4.00 7.00 14.00 99.0 ▇▁▁▁▁
INDFMPIR 805 0.91 2.22 1.64 0.00 0.87 1.64 3.62 5.0 ▇▇▃▂▆
DMDHHSIZ 0 1.00 3.73 1.70 1.00 2.00 4.00 5.00 7.0 ▇▅▆▅▅
DMDFMSIZ 0 1.00 3.56 1.78 1.00 2.00 4.00 5.00 7.0 ▇▅▅▃▃
DMDHHSZA 0 1.00 0.49 0.77 0.00 0.00 0.00 1.00 3.0 ▇▂▁▁▁
DMDHHSZB 0 1.00 0.95 1.13 0.00 0.00 1.00 2.00 4.0 ▇▃▃▁▁
DMDHHSZE 0 1.00 0.41 0.71 0.00 0.00 0.00 1.00 3.0 ▇▂▁▁▁
DMDHRGND 0 1.00 1.49 0.50 1.00 1.00 1.00 2.00 2.0 ▇▁▁▁▇
DMDHRAGE 0 1.00 45.85 15.88 18.00 34.00 43.00 57.00 80.0 ▅▇▆▃▃
DMDHRBR4 361 0.96 1.44 3.12 1.00 1.00 1.00 2.00 99.0 ▇▁▁▁▁
DMDHREDU 358 0.96 3.43 1.33 1.00 2.00 4.00 4.00 9.0 ▅▇▃▁▁
DMDHRMAR 128 0.99 3.17 7.49 1.00 1.00 1.00 5.00 99.0 ▇▁▁▁▁
DMDHSEDU 4716 0.50 3.59 1.36 1.00 3.00 4.00 5.00 9.0 ▃▇▆▁▁
AUQ054 1 1.00 1.93 4.07 1.00 1.00 2.00 2.00 99.0 ▇▁▁▁▁
AUQ060 6459 0.31 1.34 0.92 1.00 1.00 1.00 2.00 9.0 ▇▁▁▁▁
AUQ070 8587 0.08 1.30 0.65 1.00 1.00 1.00 2.00 9.0 ▇▁▁▁▁
AUQ080 9151 0.02 1.29 0.69 1.00 1.00 1.00 2.00 9.0 ▇▁▁▁▁
AUQ090 9310 0.01 1.46 0.50 1.00 1.00 1.00 2.00 2.0 ▇▁▁▁▇
AUQ100 4689 0.50 4.14 1.13 1.00 4.00 5.00 5.00 9.0 ▂▆▇▁▁
AUQ110 4689 0.50 4.59 0.85 1.00 4.00 5.00 5.00 9.0 ▁▂▇▁▁
AUQ136 3805 0.59 2.02 1.32 1.00 2.00 2.00 2.00 9.0 ▇▁▁▁▁
AUQ138 3805 0.59 2.00 0.62 1.00 2.00 2.00 2.00 9.0 ▇▁▁▁▁
AUQ144 4689 0.50 3.84 1.53 1.00 3.00 4.00 5.00 9.0 ▅▇▇▁▁
AUQ146 4689 0.50 1.99 0.12 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
AUD148 9301 0.01 1.03 0.18 1.00 1.00 1.00 1.00 2.0 ▇▁▁▁▁
AUQ152 9303 0.01 3.15 1.63 1.00 1.00 3.00 5.00 5.0 ▇▂▃▅▇
AUQ154 4689 0.50 1.98 0.12 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
AUQ191 4689 0.50 1.86 0.38 1.00 2.00 2.00 2.00 9.0 ▇▁▁▁▁
AUQ250 8680 0.07 3.22 1.50 1.00 2.00 3.00 5.00 9.0 ▆▇▆▁▁
AUQ255 8680 0.07 3.02 1.57 1.00 1.00 3.00 5.00 9.0 ▇▇▅▁▁
AUQ260 8680 0.07 1.88 0.72 1.00 2.00 2.00 2.00 9.0 ▇▁▁▁▁
AUQ270 8680 0.07 1.62 0.56 1.00 1.00 2.00 2.00 9.0 ▇▁▁▁▁
AUQ280 8680 0.07 2.15 1.00 1.00 1.00 2.00 3.00 5.0 ▅▇▅▁▁
AUQ300 4689 0.50 1.66 0.48 1.00 1.00 2.00 2.00 7.0 ▇▁▁▁▁
AUQ310 7751 0.17 2.11 1.42 1.00 1.00 2.00 3.00 9.0 ▇▃▁▁▁
AUQ320 7751 0.17 3.06 1.81 1.00 1.00 3.00 5.00 9.0 ▇▂▇▁▁
AUQ330 4689 0.50 1.69 0.50 1.00 1.00 2.00 2.00 3.0 ▅▁▇▁▁
AUQ340 7828 0.16 4.77 4.59 1.00 3.00 5.00 7.00 99.0 ▇▁▁▁▁
AUQ350 7828 0.16 1.37 0.52 1.00 1.00 1.00 2.00 9.0 ▇▁▁▁▁
AUQ360 8383 0.10 4.65 3.59 1.00 3.00 5.00 7.00 99.0 ▇▁▁▁▁
AUQ370 4689 0.50 1.88 0.32 1.00 2.00 2.00 2.00 2.0 ▁▁▁▁▇
AUQ380 4689 0.50 4.68 2.11 1.00 5.00 5.00 6.00 77.0 ▇▁▁▁▁
begin_year 0 1.00 2011.00 0.00 2011.00 2011.00 2011.00 2011.00 2011.0 ▁▁▇▁▁
end_year 0 1.00 2012.00 0.00 2012.00 2012.00 2012.00 2012.00 2012.0 ▁▁▇▁▁
Listing / Output 3.5: Show raw NHANES data (2011-2012) from the CDC website with {RNHANES}

R Code 3.10 : Show raw NHANES data 2017-2018 from the CDC website with {haven}

Code
nhanes_2018 <- base::readRDS("data/chap03/nhanes_2018.rds")

utils::str(nhanes_2018)
skimr::skim(nhanes_2018)
#> tibble [8,897 × 58] (S3: tbl_df/tbl/data.frame)
#>  $ SEQN   : num [1:8897] 93703 93704 93705 93706 93707 ...
#>   ..- attr(*, "label")= chr "Respondent sequence number"
#>  $ AUQ054 : num [1:8897] 1 1 2 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "label")= chr "General condition of hearing"
#>  $ AUQ060 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Hear a whisper from across a quiet room?"
#>  $ AUQ070 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Hear normal voice across a quiet room?"
#>  $ AUQ080 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Hear a shout from across a quiet room?"
#>  $ AUQ090 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Hear if spoken loudly to in better ear?"
#>  $ AUQ400 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "When began to have hearing loss?"
#>  $ AUQ410A: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Genetic/Hereditary"
#>  $ AUQ410B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Ear infections"
#>  $ AUQ410C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Ear diseases"
#>  $ AUQ410D: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Illness/Infections"
#>  $ AUQ410E: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Drugs/Medications"
#>  $ AUQ410F: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Head/Neck injury"
#>  $ AUQ410G: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Loud brief noise"
#>  $ AUQ410H: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Long-term noise"
#>  $ AUQ410I: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Aging"
#>  $ AUQ410J: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Cause of hearing loss-Others"
#>  $ AUQ156 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Ever used assistive listening devices?"
#>  $ AUQ420 : num [1:8897] NA NA NA 2 1 NA 2 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Ever had ear infections or earaches?"
#>  $ AUQ430 : num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Ever had 3 or more ear infections?"
#>  $ AUQ139 : num [1:8897] NA NA NA NA 2 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Ever had tube placed in ear?"
#>  $ AUQ144 : num [1:8897] NA NA NA 1 3 NA 4 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Last time hearing tested by specialist?"
#>  $ AUQ147 : num [1:8897] NA NA NA 2 2 NA 2 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Now use hearing aid/ amplifier/ implant"
#>  $ AUQ149A: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Now use a hearing aid"
#>  $ AUQ149B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Now use a personal sound amplifier"
#>  $ AUQ149C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Now have a cochlear implant"
#>  $ AUQ153 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Past 2 weeks how often worn hearing aid"
#>  $ AUQ630 : num [1:8897] NA NA NA 2 2 NA 2 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Ever worn hearing aid/amplifier/implant"
#>  $ AUQ440 : num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Ever had Special Ed/Early Intervention"
#>  $ AUQ450A: num [1:8897] NA NA NA NA 1 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for speech-language"
#>  $ AUQ450B: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for reading"
#>  $ AUQ450C: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for hearing/listening skills"
#>  $ AUQ450D: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for intellectual disability"
#>  $ AUQ450E: num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for movement/mobility issues"
#>  $ AUQ450F: num [1:8897] NA NA NA NA 6 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Had service for other disabilities"
#>  $ AUQ460 : num [1:8897] NA NA NA NA 2 NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Exposed to very loud noise 10+ hrs/wk"
#>  $ AUQ470 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "How long exposed to loud noise 10+hrs/wk"
#>  $ AUQ101 : num [1:8897] NA NA NA 5 NA NA 5 NA 5 NA ...
#>   ..- attr(*, "label")= chr "Difficult to follow conversation w/noise"
#>  $ AUQ110 : num [1:8897] NA NA NA 5 NA NA 5 NA 5 NA ...
#>   ..- attr(*, "label")= chr "Hearing causes frustration when talking?"
#>  $ AUQ480 : num [1:8897] NA NA NA 5 NA NA 5 NA 4 NA ...
#>   ..- attr(*, "label")= chr "Avoid groups, limit social life?"
#>  $ AUQ490 : num [1:8897] NA NA NA 2 NA NA 1 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Problem with dizziness, lightheadedness?"
#>  $ AUQ191 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#>   ..- attr(*, "label")= chr "Ears ringing, buzzing past year?"
#>  $ AUQ250 : num [1:8897] NA NA NA NA NA NA NA NA 1 NA ...
#>   ..- attr(*, "label")= chr "How long bothered by ringing, buzzing?"
#>  $ AUQ255 : num [1:8897] NA NA NA NA NA NA NA NA 5 NA ...
#>   ..- attr(*, "label")= chr "In past yr how often had ringing/roaring"
#>  $ AUQ260 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "Bothered by ringing after loud sounds?"
#>  $ AUQ270 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "Bothered by ringing when going to sleep?"
#>  $ AUQ280 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "How much of a problem is ringing?"
#>  $ AUQ500 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "Discussed ringing with doctor?"
#>  $ AUQ300 : num [1:8897] NA NA NA 2 NA NA 2 NA 2 NA ...
#>   ..- attr(*, "label")= chr "Ever used firearms for any reason?"
#>  $ AUQ310 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "How many total rounds ever fired?"
#>  $ AUQ320 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Wear hearing protection when shooting?"
#>  $ AUQ330 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#>   ..- attr(*, "label")= chr "Ever had job exposure to loud noise?"
#>  $ AUQ340 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "How long exposed to loud noise at work"
#>  $ AUQ350 : num [1:8897] NA NA NA NA NA NA NA NA 2 NA ...
#>   ..- attr(*, "label")= chr "Ever exposed to very loud noise at work"
#>  $ AUQ360 : num [1:8897] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "How long exposed to very loud noise?"
#>  $ AUQ370 : num [1:8897] NA NA NA 2 NA NA 2 NA 1 NA ...
#>   ..- attr(*, "label")= chr "Had off-work exposure to loud noise?"
#>  $ AUQ510 : num [1:8897] NA NA NA NA NA NA NA NA 1 NA ...
#>   ..- attr(*, "label")= chr "How long exposed to loud noise 10+hrs/wk"
#>  $ AUQ380 : num [1:8897] NA NA NA 6 NA NA 5 NA 3 NA ...
#>   ..- attr(*, "label")= chr "Past year: worn hearing protection?"
#>  - attr(*, "label")= chr "Audiometry"
Data summary
Name nhanes_2018
Number of rows 8897
Number of columns 58
_______________________
Column type frequency:
numeric 58
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SEQN 0 1.00 98333.86 2671.90 93703 96017 98348 100645 102956 ▇▇▇▇▇
AUQ054 0 1.00 1.79 2.02 1 1 2 2 99 ▇▁▁▁▁
AUQ060 7234 0.19 1.49 1.13 1 1 1 2 9 ▇▁▁▁▁
AUQ070 8293 0.07 1.46 1.02 1 1 1 2 9 ▇▁▁▁▁
AUQ080 8683 0.02 1.36 1.01 1 1 1 2 9 ▇▁▁▁▁
AUQ090 8842 0.01 1.62 1.52 1 1 1 2 9 ▇▁▁▁▁
AUQ400 8291 0.07 7.12 12.62 1 4 6 7 99 ▇▁▁▁▁
AUQ410A 8813 0.01 34.83 46.87 1 1 1 99 99 ▇▁▁▁▅
AUQ410B 8816 0.01 2.00 0.00 2 2 2 2 2 ▁▁▇▁▁
AUQ410C 8879 0.00 3.00 0.00 3 3 3 3 3 ▁▁▇▁▁
AUQ410D 8871 0.00 4.00 0.00 4 4 4 4 4 ▁▁▇▁▁
AUQ410E 8889 0.00 5.00 0.00 5 5 5 5 5 ▁▁▇▁▁
AUQ410F 8879 0.00 6.00 0.00 6 6 6 6 6 ▁▁▇▁▁
AUQ410G 8807 0.01 7.00 0.00 7 7 7 7 7 ▁▁▇▁▁
AUQ410H 8709 0.02 8.00 0.00 8 8 8 8 8 ▁▁▇▁▁
AUQ410I 8585 0.04 9.00 0.00 9 9 9 9 9 ▁▁▇▁▁
AUQ410J 8871 0.00 10.00 0.00 10 10 10 10 10 ▁▁▇▁▁
AUQ156 8291 0.07 1.80 0.40 1 2 2 2 2 ▂▁▁▁▇
AUQ420 5544 0.38 1.55 0.71 1 1 2 2 9 ▇▁▁▁▁
AUQ430 7276 0.18 1.62 0.96 1 1 2 2 9 ▇▁▁▁▁
AUQ139 7276 0.18 1.95 0.79 1 2 2 2 9 ▇▁▁▁▁
AUQ144 5544 0.38 3.00 1.88 1 2 2 5 9 ▇▃▃▁▁
AUQ147 5544 0.38 1.94 0.25 1 2 2 2 2 ▁▁▁▁▇
AUQ149A 8690 0.02 1.00 0.00 1 1 1 1 1 ▁▁▇▁▁
AUQ149B 8889 0.00 2.00 0.00 2 2 2 2 2 ▁▁▇▁▁
AUQ149C 8893 0.00 3.00 0.00 3 3 3 3 3 ▁▁▇▁▁
AUQ153 8682 0.02 3.65 0.91 1 3 4 4 5 ▁▁▂▇▁
AUQ630 5759 0.35 1.99 0.12 1 2 2 2 2 ▁▁▁▁▇
AUQ440 7181 0.19 1.79 0.44 1 2 2 2 9 ▇▁▁▁▁
AUQ450A 8658 0.03 1.00 0.00 1 1 1 1 1 ▁▁▇▁▁
AUQ450B 8742 0.02 2.00 0.00 2 2 2 2 2 ▁▁▇▁▁
AUQ450C 8859 0.00 3.00 0.00 3 3 3 3 3 ▁▁▇▁▁
AUQ450D 8849 0.01 4.00 0.00 4 4 4 4 4 ▁▁▇▁▁
AUQ450E 8855 0.00 5.00 0.00 5 5 5 5 5 ▁▁▇▁▁
AUQ450F 8819 0.01 6.00 0.00 6 6 6 6 6 ▁▁▇▁▁
AUQ460 7181 0.19 1.99 0.27 1 2 2 2 9 ▇▁▁▁▁
AUQ470 8866 0.00 2.55 1.75 1 1 2 4 9 ▇▆▁▁▁
AUQ101 7261 0.18 3.64 1.25 1 3 4 5 9 ▃▇▅▁▁
AUQ110 7261 0.18 4.24 1.11 1 4 5 5 9 ▂▅▇▁▁
AUQ480 7261 0.18 4.59 0.90 1 5 5 5 9 ▁▂▇▁▁
AUQ490 7261 0.18 1.65 0.58 1 1 2 2 9 ▇▁▁▁▁
AUQ191 7261 0.18 1.84 0.45 1 2 2 2 9 ▇▁▁▁▁
AUQ250 8615 0.03 3.58 1.61 1 2 4 5 9 ▆▆▇▁▁
AUQ255 8615 0.03 2.62 1.67 1 1 2 4 9 ▇▃▃▁▁
AUQ260 8615 0.03 1.96 1.02 1 2 2 2 9 ▇▁▁▁▁
AUQ270 8615 0.03 1.70 0.64 1 1 2 2 9 ▇▁▁▁▁
AUQ280 8615 0.03 2.18 0.96 1 1 2 3 5 ▆▇▆▁▁
AUQ500 8615 0.03 1.53 0.81 1 1 1 2 9 ▇▁▁▁▁
AUQ300 7261 0.18 1.61 0.51 1 1 2 2 7 ▇▁▁▁▁
AUQ310 8251 0.07 2.09 1.40 1 1 2 3 9 ▇▃▁▁▁
AUQ320 8251 0.07 3.52 1.71 1 2 4 5 9 ▆▃▇▁▁
AUQ330 7261 0.18 1.77 0.60 1 1 2 2 9 ▇▁▁▁▁
AUQ340 8412 0.05 4.99 2.14 1 3 6 7 7 ▃▂▂▂▇
AUQ350 8412 0.05 1.42 0.69 1 1 1 2 9 ▇▁▁▁▁
AUQ360 8600 0.03 5.82 7.94 1 4 6 7 99 ▇▁▁▁▁
AUQ370 7261 0.18 1.89 0.36 1 2 2 2 9 ▇▁▁▁▁
AUQ510 8712 0.02 2.75 1.40 1 1 3 4 9 ▆▇▁▁▁
AUQ380 7261 0.18 4.83 1.18 1 5 5 5 6 ▁▁▁▇▃
Listing / Output 3.6: Show raw NHANES data 2017-2018 from the CDC website with {haven}

R Code 3.11 : Show raw research funding data for different kinds of research topics (2004-2015)

Code
research_funding <- base::readRDS("data/chap03/research_funding.rds")

utils::str(research_funding)
skimr::skim(research_funding)
#> spc_tbl_ [30 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ Cause of Death                       : chr [1:30] "Heart disease" "Cancer" "Lung disease" "Cerebrovascular disease" ...
#>  $ Mortality Rate per 100,000 Population: num [1:30] 201.5 186.4 44.7 43.9 26 ...
#>  $ Publications                         : num [1:30] 358915 1078144 259196 136502 40179 ...
#>  $ Funding                              : num [1:30] 1.29e+10 2.06e+10 1.09e+10 7.58e+09 5.93e+09 ...
#>  $ Predicted Publications               : num [1:30] 260875 248160 99170 98070 70085 ...
#>  $ Publications (Studentized Residuals) : num [1:30] 0.27 1.25 0.75 0.26 -0.43 0.66 -0.11 -0.73 -0.13 0.02 ...
#>  $ Predicted Funding                    : chr [1:30] "$12,075,841,121" "$11,415,379,454" "$4,065,188,916" "$4,014,434,824" ...
#>  $ Funding (Studentized Residuals)      : num [1:30] 0.04 0.33 0.52 0.34 0.4 0.92 -0.19 -0.64 -0.57 -0.7 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   `Cause of Death` = col_character(),
#>   ..   `Mortality Rate per 100,000 Population` = col_double(),
#>   ..   Publications = col_double(),
#>   ..   Funding = col_double(),
#>   ..   `Predicted Publications` = col_number(),
#>   ..   `Publications (Studentized Residuals)` = col_double(),
#>   ..   `Predicted Funding` = col_character(),
#>   ..   `Funding (Studentized Residuals)` = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr>
Data summary
Name research_funding
Number of rows 30
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Cause of Death 0 1 3 44 0 30 0
Predicted Funding 0 1 12 15 0 29 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Mortality Rate per 100,000 Population 0 1 22.42 4.807000e+01 0.59 1.77 7.760000e+00 1.481000e+01 2.015400e+02 ▇▁▁▁▁
Publications 0 1 93913.90 2.028434e+05 1034.00 12550.00 3.949800e+04 5.406425e+04 1.078144e+06 ▇▁▁▁▁
Funding 0 1 4136597358.43 5.799997e+09 3474852.00 358026213.50 1.659781e+09 4.830357e+09 2.059661e+10 ▇▂▁▁▁
Predicted Publications 0 1 48162.03 6.143627e+04 6163.00 12448.50 3.221550e+04 4.880050e+04 2.608750e+05 ▇▁▁▁▁
Publications (Studentized Residuals) 0 1 -0.01 1.060000e+00 -2.63 -0.36 1.200000e-01 7.400000e-01 1.460000e+00 ▂▁▂▇▇
Funding (Studentized Residuals) 0 1 -0.02 1.090000e+00 -3.71 -0.53 3.400000e-01 5.800000e-01 1.920000e+00 ▁▁▃▇▂
Listing / Output 3.7: Show raw research funding data for different kinds of research topics (2004-2015)

R Code 3.12 : Show raw data about guns manufactured in ten US (1986 - 2019) from USAFact.org website

Code
guns_manufactured_2019 <-  base::readRDS("data/chap03/guns_manufactured_2019.rds")

utils::str(guns_manufactured_2019)
skimr::skim(guns_manufactured_2019)
#> spc_tbl_ [11 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ Years: chr [1:11] "Firearms manufactured (Items)" "By type of firearm" "Pistols (Items)" "Revolvers (Items)" ...
#>  $ 1980 : chr [1:11] NA NA NA NA ...
#>  $ 1981 : chr [1:11] NA NA NA NA ...
#>  $ 1982 : chr [1:11] NA NA NA NA ...
#>  $ 1983 : chr [1:11] NA NA NA NA ...
#>  $ 1984 : chr [1:11] NA NA NA NA ...
#>  $ 1985 : chr [1:11] NA NA NA NA ...
#>  $ 1986 : chr [1:11] "3040934" NA "662973" "761414" ...
#>  $ 1987 : num [1:11] 3559663 NA 964561 722512 857949 ...
#>  $ 1988 : num [1:11] 3963877 NA 1101011 754744 928070 ...
#>  $ 1989 : num [1:11] 4418393 NA 1404753 628573 935541 ...
#>  $ 1990 : num [1:11] 3959968 NA 1371427 470495 848948 ...
#>  $ 1991 : num [1:11] 3563106 NA 1378252 456966 828426 ...
#>  $ 1992 : num [1:11] 4175836 NA 1669537 469413 1018204 ...
#>  $ 1993 : num [1:11] 5055637 NA 2093362 562292 1144940 ...
#>  $ 1994 : num [1:11] 5173217 NA 2004298 586450 1254926 ...
#>  $ 1995 : num [1:11] 4316342 NA 1195284 527664 1173645 ...
#>  $ 1996 : num [1:11] 3854439 NA 987528 498944 925732 ...
#>  $ 1997 : num [1:11] 3593504 NA 1036077 370428 915978 ...
#>  $ 1998 : num [1:11] 3713590 NA 960365 324390 868639 ...
#>  $ 1999 : num [1:11] 4047747 NA 995446 335784 1106995 ...
#>  $ 2000 : num [1:11] 3793541 NA 962901 318960 898442 ...
#>  $ 2001 : num [1:11] 2932655 NA 626836 320143 679813 ...
#>  $ 2002 : num [1:11] 3366895 NA 741514 347070 741325 ...
#>  $ 2003 : num [1:11] 3308404 NA 811660 309364 726078 ...
#>  $ 2004 : num [1:11] 3099025 NA 728511 294099 731769 ...
#>  $ 2005 : num [1:11] 3241494 NA 803425 274205 709313 ...
#>  $ 2006 : num [1:11] 3653324 NA 1021260 385069 714618 ...
#>  $ 2007 : num [1:11] 3922613 NA 1219664 391334 645231 ...
#>  $ 2008 : num [1:11] 4498944 NA 1609381 431753 630710 ...
#>  $ 2009 : num [1:11] 5555818 NA 1868258 547195 752699 ...
#>  $ 2010 : num [1:11] 5459240 NA 2258450 558927 743378 ...
#>  $ 2011 : num [1:11] 6541886 NA 2598133 572857 862401 ...
#>  $ 2012 : num [1:11] 8578610 NA 3487883 667357 949010 ...
#>  $ 2013 : num [1:11] 10844792 NA 4441726 725282 1203072 ...
#>  $ 2014 : num [1:11] 9050626 NA 3633454 744047 935411 ...
#>  $ 2015 : num [1:11] 9358661 NA 3557199 885259 777273 ...
#>  $ 2016 : num [1:11] 11497441 NA 4720075 856291 848617 ...
#>  $ 2017 : num [1:11] 8327792 NA 3691010 720917 653139 ...
#>  $ 2018 : num [1:11] 9052628 NA 3881158 664835 536126 ...
#>  $ 2019 : num [1:11] 7011945 NA 3046013 580601 480735 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   Years = col_character(),
#>   ..   `1980` = col_character(),
#>   ..   `1981` = col_character(),
#>   ..   `1982` = col_character(),
#>   ..   `1983` = col_character(),
#>   ..   `1984` = col_character(),
#>   ..   `1985` = col_character(),
#>   ..   `1986` = col_character(),
#>   ..   `1987` = col_double(),
#>   ..   `1988` = col_double(),
#>   ..   `1989` = col_double(),
#>   ..   `1990` = col_double(),
#>   ..   `1991` = col_double(),
#>   ..   `1992` = col_double(),
#>   ..   `1993` = col_double(),
#>   ..   `1994` = col_double(),
#>   ..   `1995` = col_double(),
#>   ..   `1996` = col_double(),
#>   ..   `1997` = col_double(),
#>   ..   `1998` = col_double(),
#>   ..   `1999` = col_double(),
#>   ..   `2000` = col_double(),
#>   ..   `2001` = col_double(),
#>   ..   `2002` = col_double(),
#>   ..   `2003` = col_double(),
#>   ..   `2004` = col_double(),
#>   ..   `2005` = col_double(),
#>   ..   `2006` = col_double(),
#>   ..   `2007` = col_double(),
#>   ..   `2008` = col_double(),
#>   ..   `2009` = col_double(),
#>   ..   `2010` = col_double(),
#>   ..   `2011` = col_double(),
#>   ..   `2012` = col_double(),
#>   ..   `2013` = col_double(),
#>   ..   `2014` = col_double(),
#>   ..   `2015` = col_double(),
#>   ..   `2016` = col_double(),
#>   ..   `2017` = col_double(),
#>   ..   `2018` = col_double(),
#>   ..   `2019` = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr>
Data summary
Name guns_manufactured_2019
Number of rows 11
Number of columns 41
_______________________
Column type frequency:
character 8
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Years 2 0.82 8 29 0 9 0
1980 9 0.18 6 58 0 2 0
1981 9 0.18 6 35 0 2 0
1982 10 0.09 5 5 0 1 0
1983 9 0.18 3 51 0 2 0
1984 9 0.18 11 79 0 2 0
1985 9 0.18 4 7 0 2 0
1986 3 0.73 4 15 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1987 5 0.55 1186554.3 1218822 6980 756371.2 911255.0 996886 3559663 ▂▇▁▁▂
1988 5 0.55 1321292.3 1355892 35345 798075.5 1014540.5 1133783 3963877 ▅▇▁▁▂
1989 5 0.55 1472797.7 1531902 42126 705315.0 1170147.0 1406738 4418393 ▅▇▁▁▂
1990 5 0.55 1319989.3 1379959 57434 565108.2 1030306.0 1331486 3959968 ▅▇▁▁▂
1991 5 0.55 1187702.0 1249591 15980 549831.0 855954.0 1254560 3563106 ▅▇▁▁▂
1992 5 0.55 1391945.3 1473834 16849 602518.0 1010018.5 1506704 4175836 ▅▇▁▁▂
1993 5 0.55 1685212.3 1783997 81349 707954.0 1159317.0 1863445 5055637 ▇▇▃▁▃
1994 5 0.55 1724405.7 1821553 10936 753569.0 1285766.5 1832375 5173217 ▅▇▁▁▂
1995 5 0.55 1438780.7 1502410 8629 689159.2 1184464.5 1357161 4316342 ▅▇▁▁▂
1996 5 0.55 1284813.0 1346281 17920 605641.0 956630.0 1315118 3854439 ▅▇▁▁▂
1997 5 0.55 1197834.7 1258599 19680 506815.5 976027.5 1197525 3593504 ▅▇▁▁▂
1998 5 0.55 1237863.3 1321963 24506 460452.2 914502.0 1391859 3713590 ▇▇▃▁▃
1999 5 0.55 1349249.0 1432202 39837 500699.5 1051220.5 1454012 4047747 ▅▇▁▁▂
2000 5 0.55 1264513.7 1352038 30196 463830.5 930671.5 1428007 3793541 ▇▇▃▁▃
2001 5 0.55 977551.7 1046414 21309 396816.2 653324.5 1133369 2932655 ▇▇▃▁▃
2002 5 0.55 1122298.3 1207898 21700 445633.8 741419.5 1321843 3366895 ▇▇▃▁▃
2003 5 0.55 1102801.3 1181269 30978 413542.5 768869.0 1275658 3308404 ▇▇▃▁▃
2004 5 0.55 1033008.3 1105477 19508 402702.0 730140.0 1176796 3099025 ▇▇▃▁▃
2005 5 0.55 1080498.0 1164096 23179 382982.0 756369.0 1274385 3241494 ▇▇▃▁▃
2006 5 0.55 1217774.7 1295505 35872 467456.2 867939.0 1377694 3653324 ▇▂▂▁▂
2007 5 0.55 1307537.7 1398868 55461 454808.2 932447.5 1513108 3922613 ▇▂▂▁▂
2008 5 0.55 1499648.0 1608622 92564 481492.2 1120045.5 1703247 4498944 ▇▅▁▁▂
2009 5 0.55 1851939.3 1986052 138815 598571.0 1310478.5 2153703 5555818 ▇▅▁▁▂
2010 5 0.55 1819746.7 1962427 67929 605039.8 1286967.0 2151476 5459240 ▇▂▂▁▂
2011 5 0.55 2180628.7 2345097 190407 645243.0 1590244.5 2528122 6541886 ▇▅▁▁▂
2012 5 0.55 2859536.7 3103979 306154 737770.2 2058608.0 3407964 8578610 ▇▅▁▁▂
2013 5 0.55 3614930.7 3923969 495142 844729.5 2591321.0 4326187 10844792 ▇▅▁▁▂
2014 5 0.55 3016875.3 3270622 358165 791888.0 2157480.0 3569978 9050626 ▇▅▁▁▂
2015 5 0.55 3119553.7 3378333 447131 804269.5 2221229.0 3658149 9358661 ▇▅▁▁▂
2016 5 0.55 3832480.3 4158420 833123 850535.5 2547813.0 4599890 11497441 ▇▅▁▁▂
2017 5 0.55 2775930.7 2984401 653139 730346.2 1631363.0 3394280 8327792 ▇▅▁▁▂
2018 5 0.55 3017542.7 3243302 536126 771119.5 1985254.5 3631002 9052628 ▇▅▁▁▂
2019 5 0.55 2337315.0 2488560 480735 672183.0 1452298.0 2773926 7011945 ▇▅▁▁▂
Listing / Output 3.8: Show raw data about guns manufactured in ten US (1986 - 2019) from USAFact.org website

R Code 3.13 : Show raw data of exported guns from the ATF PDF report (1998-2019)

Code
guns_exported_2017 <-  base::readRDS("data/chap03/guns_exported_2017.rds")

utils::str(guns_exported_2017)
skimr::skim(guns_exported_2017)
#>  chr [1:32, 1:7] "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993" ...
Data summary
Name guns_exported_2017
Number of rows 32
Number of columns 7
_______________________
Column type frequency:
character 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
V1 0 1 4 4 0 32 0
V2 0 1 6 7 0 32 0
V3 0 1 6 7 0 32 0
V4 0 1 6 7 0 32 0
V5 0 1 6 7 0 32 0
V6 0 1 3 6 0 32 0
V7 0 1 7 7 0 32 0
Listing / Output 3.9: Show raw data of exported guns from the ATF PDF report (1986-2017)

R Code 3.14 : Show raw data of imported guns from the ATF PDF report (1986-2018)

Code
guns_imported_2018 <-  base::readRDS("data/chap03/guns_imported_2018.rds")

utils::str(guns_imported_2018)
skimr::skim(guns_imported_2018)
#>  chr [1:33, 1:5] "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993" ...
Data summary
Name guns_imported_2018
Number of rows 33
Number of columns 5
_______________________
Column type frequency:
character 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
V1 0 1 4 4 0 33 0
V2 0 1 7 7 0 33 0
V3 0 1 7 9 0 33 0
V4 0 1 7 9 0 33 0
V5 0 1 7 9 0 33 0
Listing / Output 3.10: Show raw data of imported guns from the ATF PDF report (1986-2018)

3.3.4 Recode data

Example 3.3 : Recode data for chapter 3

R Code 3.15 : Recode FBI deaths data 2012-2016

Code
fbi_deaths <- base::readRDS("data/chap03/fbi_deaths.rds")

fbi_deaths_clean <- fbi_deaths |> 
    dplyr::slice(3:7) |> 
    tidyr::pivot_longer(
        cols = -Weapons,
        values_to = "deaths",
        names_to = "year"
        ) |> 
    dplyr::rename(weapons = Weapons)

base::saveRDS(fbi_deaths_clean,
              "data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean
#> # A tibble: 25 × 3
#>    weapons  year  deaths
#>    <chr>    <chr>  <dbl>
#>  1 Handguns 2012    6404
#>  2 Handguns 2013    5782
#>  3 Handguns 2014    5673
#>  4 Handguns 2015    6569
#>  5 Handguns 2016    7105
#>  6 Rifles   2012     298
#>  7 Rifles   2013     285
#>  8 Rifles   2014     258
#>  9 Rifles   2015     258
#> 10 Rifles   2016     374
#> # ℹ 15 more rows

R Code 3.16 : Recode gun use variable AUQ300 from NHANES data 2011-2012

Code
## load data
nhanes_2012 <- base::readRDS("data/chap03/nhanes_2012.rds")

## recode data
nhanes_2012_clean1 <- nhanes_2012 |> 
    dplyr::mutate(AUQ300 = 
          dplyr::na_if(x = AUQ300, y = 7)) |> 
    dplyr::mutate(AUQ300 = 
          dplyr::na_if(x = AUQ300, y = 9)) |> 
    ## see my note in the text under the code
    # tidyr::drop_na() |> 
    dplyr::mutate(AUQ300 = forcats::as_factor(AUQ300)) |> 
    dplyr::mutate(AUQ300 = 
          forcats::fct_recode(AUQ300, 
                  "Yes" = "1", 
                  "No" = "2")
    ) |> 
    dplyr::rename(gun_use = AUQ300) |> 
    dplyr::relocate(gun_use)


gun_use_2012 <- nhanes_2012_clean1 |> 
    dplyr::count(gun_use)|> 
    dplyr::mutate(percent = round(n / sum(n), 2) * 100) 

glue::glue("Result calculated manually with `dplyr::count()` and `dplyr::mutate()`")
#> Result calculated manually with `dplyr::count()` and `dplyr::mutate()`
Code
gun_use_2012
#>   gun_use    n percent
#> 1     Yes 1613      17
#> 2      No 3061      33
#> 3    <NA> 4690      50
Code
glue::glue(" ")
Code
glue::glue("*******************************************************************")
#> *******************************************************************
Code
glue::glue("Result calculated with `janitor::tabyl()` and `janitor::adorn_pct_formating()`")
#> Result calculated with `janitor::tabyl()` and `janitor::adorn_pct_formating()`
Code
nhanes_2012_clean1 |> 
    janitor::tabyl(gun_use) |> 
    janitor::adorn_pct_formatting()
#>  gun_use    n percent valid_percent
#>      Yes 1613   17.2%         34.5%
#>       No 3061   32.7%         65.5%
#>     <NA> 4690   50.1%             -

For recoding the levels of the categorical variable I have looked up the appropriate passage in the codebook (see: Graph 3.1).

With the last line dplyr::relocate(gun_use) I put the column gun_use to the front of the data frame. If neither the .before nor the .after argument of the function are supplied the column will move columns to the left-hand side of the data frame. So it is easy to find for visual inspections via the RStudio data frame viewer.

When to remove the NA’s?

It would be not correct to remove the NA’s here in the recoding code chunk, because this would remove the rows with missing values from the gun_use variable across the whole data frame! This implies that values of other variable that are not missing would be removed too. It is correct to remove the NA’s when the output for the analysis (graph or table) is prepared via the pipe operator without changing the stored data.

'Ever used firearms for any reason?' this question has several options: - 1 = Yes - 2 = No - 7 = Refused  - 9 = Don’t know - . = Missing
Graph 3.1: Ever used firearms for any reason? Codebook 2011-2012 AUDIOMETRY

R Code 3.17 : Recode rounds fired variable AUQ310 from NHANES data 2011-2012

Code
nhanes_2012_clean2 <- nhanes_2012_clean1 |> 
    dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |> 
    dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
        "1 to less than 100" = "1",
        "100 to less than 1000" = "2",
        "1000 to less than 10k" = "3",
        "10k to less than 50k" = "4",
        "50k or more" = "5",
        "Refused to answer" = "7",
        "Don't know" = "9")
    ) |> 
    dplyr::rename(rounds_fired = AUQ310) |> 
    dplyr::relocate(rounds_fired, .after = gun_use)
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `AUQ310 = forcats::fct_recode(...)`.
#> Caused by warning:
#> ! Unknown levels in `f`: 7
Code
fired_2012 <- nhanes_2012_clean2 |>
    dplyr::select(rounds_fired) |> 
    tidyr::drop_na() |> 
    dplyr::count(rounds_fired) |> 
    dplyr::mutate(prop = round(n / sum(n), 2) * 100) |> 
    dplyr::relocate(n, .after = dplyr::last_col())
fired_2012
#>            rounds_fired prop   n
#> 1    1 to less than 100   43 701
#> 2 100 to less than 1000   26 423
#> 3 1000 to less than 10k   18 291
#> 4  10k to less than 50k    7 106
#> 5           50k or more    4  66
#> 6            Don't know    2  26

I got a warning about a unknown level 7 because no respondent refused an answer. But this is not important. I could either choose to not recode level 7 or turn warning off in the chunk option — or simply to ignore the warning.

R Code 3.18 : Recode sex variable RIAGENDR from NHANES data 2011-2012

Code
nhanes_2012_clean3 <- nhanes_2012_clean2 |> 
    dplyr::mutate(RIAGENDR = forcats::as_factor(RIAGENDR)) |> 
    dplyr::mutate(RIAGENDR = forcats::fct_recode(RIAGENDR,
                 "Male" = '1',
                 "Female" = '2')
    ) |> 
    dplyr::rename(sex = RIAGENDR) |> 
    dplyr::relocate(sex, .after = rounds_fired)

nhanes_2012_clean3 |> 
    janitor::tabyl(sex) |> 
    janitor::adorn_pct_formatting()
#>     sex    n percent
#>    Male 4663   49.8%
#>  Female 4701   50.2%

R Code 3.19 : Recode ear plugs variable AUQ320 from NHANES data 2011-2012

Code
nhanes_2012_clean4 <- nhanes_2012_clean3 |> 
    dplyr::mutate(AUQ320 = forcats::as_factor(AUQ320)) |> 
    dplyr::mutate(AUQ320 = forcats::fct_recode(AUQ320,
        "Always" = "1",
        "Usually" = "2",
        "About half the time" = "3",
        "Seldom" = "4",
        "Never" = "5",
        # "Refused to answer" = "7", # nobody refused
        "Don't know" = "9")
    ) |> 
    dplyr::rename(ear_plugs = AUQ320) |> 
    dplyr::relocate(ear_plugs, .after = gun_use)

base::saveRDS(nhanes_2012_clean4, 
              "data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean4 |> 
    janitor::tabyl(ear_plugs) |> 
    janitor::adorn_pct_formatting()
#>            ear_plugs    n percent valid_percent
#>               Always  583    6.2%         36.1%
#>              Usually  152    1.6%          9.4%
#>  About half the time  123    1.3%          7.6%
#>               Seldom  110    1.2%          6.8%
#>                Never  642    6.9%         39.8%
#>           Don't know    3    0.0%          0.2%
#>                 <NA> 7751   82.8%             -

R Code 3.20 : Recode guns exported from a PDF report by the ATF

Code
guns_exported_2017 <- base::readRDS("data/chap03/guns_exported_2017.rds")

lookup_export <- c(Year = "V1",
                   Pistols = "V2",
                   Revolvers = "V3",
                   Rifles = "V4",
                   Shotguns = "V5",
                   Misc = "V6",
                   Total = "V7")

guns_exported_clean <- dplyr::rename(
    tibble::as_tibble(guns_exported_2017), 
    dplyr::all_of(lookup_export)
    ) |> 
    
    ## comma separated character columns to numeric
    dplyr::mutate(dplyr::across(2:7, 
        function(x)
            { base::as.numeric(as.character(base::gsub(",", "", x))) })
        ) |> 

    ## add Pistols and Revolvers to Handguns
    dplyr::mutate(Handguns = Pistols + Revolvers) |> 
    
    ## specify the same order for all three data frames
    dplyr::select(-c(Pistols, Revolvers)) |> 
    dplyr::relocate(c(Year, Handguns, Rifles, 
                      Shotguns, Misc, Total))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
Code
save_data_file("chap03", guns_exported_clean, "guns_exported_clean.rds")

utils::str(guns_exported_clean)
skimr::skim(guns_exported_clean)
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ Year    : chr [1:32] "1986" "1987" "1988" "1989" ...
#>  $ Handguns: num [1:32] 121082 159552 131859 118464 180218 ...
#>  $ Rifles  : num [1:32] 37224 42161 53896 73247 71834 ...
#>  $ Shotguns: num [1:32] 58943 76337 68699 67559 104250 ...
#>  $ Misc    : num [1:32] 199 9995 2728 2012 5323 ...
#>  $ Total   : num [1:32] 217448 288045 257182 261282 361625 ...
Data summary
Name guns_exported_clean
Number of rows 32
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Year 0 1 4 4 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Handguns 0 1 130703.47 60973.09 39081 81605.00 138057.5 172895.25 297100 ▇▃▇▁▁
Rifles 0 1 86827.50 37709.88 37224 62492.25 77941.0 92641.50 207934 ▆▇▁▂▁
Shotguns 0 1 64981.97 37306.30 18797 36113.00 56324.5 87122.00 171475 ▇▅▃▁▁
Misc 0 1 8057.84 7633.02 199 2505.50 5689.0 10987.25 34022 ▇▃▂▁▁
Total 0 1 290570.78 99914.95 139920 204422.50 281317.5 380893.75 488300 ▇▆▅▆▃
Listing / Output 3.11: Recoded: Guns exported from a PDF report by the ATF

I have recoded the data in several ways:

- I turned the resulted matrices from the {**tabulizer**} package into tibbles.
- Now I could rename all the columns with one named vector `lookup_export`.
- All the columns are of character type. Before I could change the appropriate column to numeric I had to remove the comma for big numbers, otherwise the base::as-numeric() function would not have worked.
- I added `Pistols` and `Revolvers` to `Handguns`, because the dataset about imported guns have only this category.
- Finnaly I relocated the columns to get the same structure in all data frames.

R Code 3.21 : Recode guns imported from a PDF by the ATF

Code
guns_imported_2018 <-  base::readRDS("data/chap03/guns_imported_2018.rds")

lookup_import <- c(Year = "V1",
            Shotguns = "V2",
            Rifles = "V3",
            Handguns = "V4",
            Total = "V5")

## reduce the reported period for all file from 1980 to 2017
guns_imported_clean <- dplyr::rename(
    tibble::as_tibble(guns_imported_2018), 
    dplyr::all_of(lookup_import)
    ) |> 
    
    ## comma separated character columns to numeric
    dplyr::mutate(dplyr::across(2:5, 
        function(x)
            { base::as.numeric(as.character(base::gsub(",", "", x))) })
        ) |> 

    dplyr::slice(1:dplyr::n() - 1) |> 
    dplyr::mutate(Misc = 0) |> 
    dplyr::relocate(c(Year, Handguns, Rifles, 
                      Shotguns, Misc, Total))


save_data_file("chap03", guns_imported_clean, "guns_imported_clean.rds")

utils::str(guns_imported_clean)
skimr::skim(guns_imported_clean)
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ Year    : chr [1:32] "1986" "1987" "1988" "1989" ...
#>  $ Handguns: num [1:32] 231000 342113 621620 440132 448517 ...
#>  $ Rifles  : num [1:32] 269000 413780 282640 293152 203505 ...
#>  $ Shotguns: num [1:32] 201000 307620 372008 274497 191787 ...
#>  $ Misc    : num [1:32] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ Total   : num [1:32] 701000 1063513 1276268 1007781 843809 ...
Data summary
Name guns_imported_clean
Number of rows 32
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Year 0 1 4 4 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Handguns 0 1 1228289.6 959074.1 231000 486461.0 858514 1739603.2 3671837 ▇▂▂▁▁
Rifles 0 1 613776.0 390186.3 198191 290524.0 556201 797873.2 1592522 ▇▅▃▁▂
Shotguns 0 1 433837.8 233098.7 106296 239432.2 417866 570714.2 973465 ▇▆▆▃▂
Misc 0 1 0.0 0.0 0 0.0 0 0.0 0 ▁▁▇▁▁
Total 0 1 2275903.5 1419964.9 701000 1049580.0 1895880 3095591.8 5539539 ▇▂▂▂▂
Listing / Output 3.12: Recoded: Guns imported from a PDF by the ATF

Here applies similar note as in the previous tab:

- I turned the resulted matrices from the {**tabulizer**} package into tibbles.
- Now I could rename all the columns with one named vector `lookup_import`.
- All the columns are of character type. Before I could change the appropriate column to numeric I had to remove the comma for big numbers, otherwise the base::as-numeric() function would not have worked.
- I relocated the columns to get the same structure in all data frames.

R Code 3.22 : Recode firearms production dataset 1986-2019 from USAFact.org

Code
guns_manufactured_2019 <- base::readRDS("data/chap03/guns_manufactured_2019.rds")

guns_manufactured_2019_clean <- guns_manufactured_2019 |> 
    dplyr::select(-c(2:7)) |> 
    dplyr::slice(c(1,3:7)) |> 
    ## strange: `1986` is a character variable
    dplyr::mutate(`1986` = as.numeric(`1986`))

base::saveRDS(guns_manufactured_2019_clean, 
              "data/chap03/guns_manufactured_2019_clean.rds")

utils::str(guns_manufactured_2019_clean)
skimr::skim(guns_manufactured_2019_clean)
#> tibble [6 × 35] (S3: tbl_df/tbl/data.frame)
#>  $ Years: chr [1:6] "Firearms manufactured (Items)" "Pistols (Items)" "Revolvers (Items)" "Shotguns (Items)" ...
#>  $ 1986 : num [1:6] 3040934 662973 761414 641482 4558 ...
#>  $ 1987 : num [1:6] 3559663 964561 722512 857949 6980 ...
#>  $ 1988 : num [1:6] 3963877 1101011 754744 928070 35345 ...
#>  $ 1989 : num [1:6] 4418393 1404753 628573 935541 42126 ...
#>  $ 1990 : num [1:6] 3959968 1371427 470495 848948 57434 ...
#>  $ 1991 : num [1:6] 3563106 1378252 456966 828426 15980 ...
#>  $ 1992 : num [1:6] 4175836 1669537 469413 1018204 16849 ...
#>  $ 1993 : num [1:6] 5055637 2093362 562292 1144940 81349 ...
#>  $ 1994 : num [1:6] 5173217 2004298 586450 1254926 10936 ...
#>  $ 1995 : num [1:6] 4316342 1195284 527664 1173645 8629 ...
#>  $ 1996 : num [1:6] 3854439 987528 498944 925732 17920 ...
#>  $ 1997 : num [1:6] 3593504 1036077 370428 915978 19680 ...
#>  $ 1998 : num [1:6] 3713590 960365 324390 868639 24506 ...
#>  $ 1999 : num [1:6] 4047747 995446 335784 1106995 39837 ...
#>  $ 2000 : num [1:6] 3793541 962901 318960 898442 30196 ...
#>  $ 2001 : num [1:6] 2932655 626836 320143 679813 21309 ...
#>  $ 2002 : num [1:6] 3366895 741514 347070 741325 21700 ...
#>  $ 2003 : num [1:6] 3308404 811660 309364 726078 30978 ...
#>  $ 2004 : num [1:6] 3099025 728511 294099 731769 19508 ...
#>  $ 2005 : num [1:6] 3241494 803425 274205 709313 23179 ...
#>  $ 2006 : num [1:6] 3653324 1021260 385069 714618 35872 ...
#>  $ 2007 : num [1:6] 3922613 1219664 391334 645231 55461 ...
#>  $ 2008 : num [1:6] 4498944 1609381 431753 630710 92564 ...
#>  $ 2009 : num [1:6] 5555818 1868258 547195 752699 138815 ...
#>  $ 2010 : num [1:6] 5459240 2258450 558927 743378 67929 ...
#>  $ 2011 : num [1:6] 6541886 2598133 572857 862401 190407 ...
#>  $ 2012 : num [1:6] 8578610 3487883 667357 949010 306154 ...
#>  $ 2013 : num [1:6] 10844792 4441726 725282 1203072 495142 ...
#>  $ 2014 : num [1:6] 9050626 3633454 744047 935411 358165 ...
#>  $ 2015 : num [1:6] 9358661 3557199 885259 777273 447131 ...
#>  $ 2016 : num [1:6] 11497441 4720075 856291 848617 833123 ...
#>  $ 2017 : num [1:6] 8327792 3691010 720917 653139 758634 ...
#>  $ 2018 : num [1:6] 9052628 3881158 664835 536126 1089973 ...
#>  $ 2019 : num [1:6] 7011945 3046013 580601 480735 946929 ...
Data summary
Name guns_manufactured_2019_cl…
Number of rows 6
Number of columns 35
_______________________
Column type frequency:
character 1
numeric 34
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Years 0 1 14 29 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1986 0 1 1013644.7 1044520 4558 646854.8 712193.5 918233.8 3040934 ▂▇▁▁▂
1987 0 1 1186554.3 1218822 6980 756371.2 911255.0 996886.0 3559663 ▂▇▁▁▂
1988 0 1 1321292.3 1355892 35345 798075.5 1014540.5 1133783.0 3963877 ▅▇▁▁▂
1989 0 1 1472797.7 1531902 42126 705315.0 1170147.0 1406738.2 4418393 ▅▇▁▁▂
1990 0 1 1319989.3 1379959 57434 565108.2 1030306.0 1331486.2 3959968 ▅▇▁▁▂
1991 0 1 1187702.0 1249591 15980 549831.0 855954.0 1254559.5 3563106 ▅▇▁▁▂
1992 0 1 1391945.3 1473834 16849 602518.0 1010018.5 1506703.8 4175836 ▅▇▁▁▂
1993 0 1 1685212.3 1783997 81349 707954.0 1159317.0 1863445.0 5055637 ▇▇▃▁▃
1994 0 1 1724405.7 1821553 10936 753569.0 1285766.5 1832375.2 5173217 ▅▇▁▁▂
1995 0 1 1438780.7 1502410 8629 689159.2 1184464.5 1357161.0 4316342 ▅▇▁▁▂
1996 0 1 1284813.0 1346281 17920 605641.0 956630.0 1315118.2 3854439 ▅▇▁▁▂
1997 0 1 1197834.7 1258599 19680 506815.5 976027.5 1197525.0 3593504 ▅▇▁▁▂
1998 0 1 1237863.3 1321963 24506 460452.2 914502.0 1391858.8 3713590 ▇▇▃▁▃
1999 0 1 1349249.0 1432202 39837 500699.5 1051220.5 1454012.5 4047747 ▅▇▁▁▂
2000 0 1 1264513.7 1352038 30196 463830.5 930671.5 1428006.8 3793541 ▇▇▃▁▃
2001 0 1 977551.7 1046414 21309 396816.2 653324.5 1133368.8 2932655 ▇▇▃▁▃
2002 0 1 1122298.3 1207898 21700 445633.8 741419.5 1321843.0 3366895 ▇▇▃▁▃
2003 0 1 1102801.3 1181269 30978 413542.5 768869.0 1275658.0 3308404 ▇▇▃▁▃
2004 0 1 1033008.3 1105477 19508 402702.0 730140.0 1176795.8 3099025 ▇▇▃▁▃
2005 0 1 1080498.0 1164096 23179 382982.0 756369.0 1274385.2 3241494 ▇▇▃▁▃
2006 0 1 1217774.7 1295505 35872 467456.2 867939.0 1377693.8 3653324 ▇▂▂▁▂
2007 0 1 1307537.7 1398868 55461 454808.2 932447.5 1513108.2 3922613 ▇▂▂▁▂
2008 0 1 1499648.0 1608622 92564 481492.2 1120045.5 1703247.2 4498944 ▇▅▁▁▂
2009 0 1 1851939.3 1986052 138815 598571.0 1310478.5 2153702.8 5555818 ▇▅▁▁▂
2010 0 1 1819746.7 1962427 67929 605039.8 1286967.0 2151476.5 5459240 ▇▂▂▁▂
2011 0 1 2180628.7 2345097 190407 645243.0 1590244.5 2528121.8 6541886 ▇▅▁▁▂
2012 0 1 2859536.7 3103979 306154 737770.2 2058608.0 3407963.8 8578610 ▇▅▁▁▂
2013 0 1 3614930.7 3923969 495142 844729.5 2591321.0 4326187.0 10844792 ▇▅▁▁▂
2014 0 1 3016875.3 3270622 358165 791888.0 2157480.0 3569977.8 9050626 ▇▅▁▁▂
2015 0 1 3119553.7 3378333 447131 804269.5 2221229.0 3658149.0 9358661 ▇▅▁▁▂
2016 0 1 3832480.3 4158420 833123 850535.5 2547813.0 4599890.0 11497441 ▇▅▁▁▂
2017 0 1 2775930.7 2984401 653139 730346.2 1631363.0 3394280.5 8327792 ▇▅▁▁▂
2018 0 1 3017542.7 3243302 536126 771119.5 1985254.5 3631002.5 9052628 ▇▅▁▁▂
2019 0 1 2337315.0 2488560 480735 672183.0 1452298.0 2773926.5 7011945 ▇▅▁▁▂

R Code 3.23 : Recode the guns manufactured data frame to get an appropriate structure to add the imported and subtract the exported data.

Code
guns_manufactured_clean <- base::readRDS("data/chap03/guns_manufactured_2019_clean.rds")

lookup_manufactured <- c(
    Total = "Firearms manufactured (Items)",
    Pistols = "Pistols (Items)",
    Rifles = "Rifles (Items)",
    Shotguns = "Shotguns (Items)",
    Revolvers = "Revolvers (Items)",
    Misc = "Misc. Firearms (Items)"
    )

guns_manufactured_clean  <-  guns_manufactured_clean |>
    dplyr::select(-c(`2018`, `2019`)) |>
    tidyr::pivot_longer(
        cols = !Years,
        names_to = "Year",
        values_to = "count"
    ) |>
    tidyr::pivot_wider(
        names_from = Years,
        values_from = count
    ) |> 
    dplyr::rename(
        dplyr::all_of(lookup_manufactured)
    ) |> 
    dplyr::mutate(Handguns = Pistols + Revolvers) |> 
    dplyr::select(-c(Pistols, Revolvers)) |> 
    dplyr::relocate(c(Year, Handguns, Rifles, 
                      Shotguns, Misc, Total))

save_data_file("chap03", guns_manufactured_clean, "guns_manufactured_clean.rds")

utils::str(guns_manufactured_clean)
skimr::skim(guns_manufactured_clean)
#> tibble [32 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ Year    : chr [1:32] "1986" "1987" "1988" "1989" ...
#>  $ Handguns: num [1:32] 1424387 1687073 1855755 2033326 1841922 ...
#>  $ Rifles  : num [1:32] 970507 1007661 1144707 1407400 1211664 ...
#>  $ Shotguns: num [1:32] 641482 857949 928070 935541 848948 ...
#>  $ Misc    : num [1:32] 4558 6980 35345 42126 57434 ...
#>  $ Total   : num [1:32] 3040934 3559663 3963877 4418393 3959968 ...
Data summary
Name guns_manufactured_clean
Number of rows 32
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Year 0 1 4 4 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Handguns 0 1 2294575.4 1305428 946979 1387554.2 1838570.0 2696084.8 5576366 ▇▂▁▂▁
Rifles 0 1 1815044.8 910422 883482 1276250.8 1463938.5 1935129.8 4239335 ▇▂▁▁▁
Shotguns 0 1 864117.9 172169 630710 730346.2 853448.5 935443.5 1254926 ▇▅▅▁▂
Misc 0 1 134637.4 216279 4558 19637.0 35608.5 104126.8 833123 ▇▁▁▁▁
Total 0 1 5108375.4 2388668 2932655 3585904.5 4005812.0 5483384.5 11497441 ▇▂▁▂▁
Listing / Output 3.13: Recoded: Prepare manufactured data to get an appropriate structure to add the imported and subtract the exported data.

R Code 3.24 : Manufactured guns - exported guns + imported guns

Code
guns_manufactured_clean <-  base::readRDS("data/chap03/guns_manufactured_clean.rds")
guns_exported_clean <- base::readRDS("data/chap03/guns_exported_clean.rds")
guns_imported_clean <- base::readRDS("data/chap03/guns_imported_clean.rds")

guns_total_temp_1 <- guns_manufactured_clean[, 2:6] - guns_exported_clean[, 2:6]
guns_total_temp_2 <- guns_total_temp_1[, 1:5] + guns_imported_clean[, 2:6]

guns_total <- dplyr::bind_cols(guns_manufactured_clean[, 1], guns_total_temp_2)

base::saveRDS(guns_total, "data/chap03/guns_total.rds")

guns_total
#> # A tibble: 32 × 6
#>    Year  Handguns  Rifles Shotguns  Misc   Total
#>    <chr>    <dbl>   <dbl>    <dbl> <dbl>   <dbl>
#>  1 1986   1534305 1202283   783539  4359 3524486
#>  2 1987   1869634 1379280  1089232 -3015 4335131
#>  3 1988   2345516 1373451  1231379 32617 4982963
#>  4 1989   2354994 1627305  1142479 40114 5164892
#>  5 1990   2110221 1343335   936485 52111 4442152
#>  6 1991   1939116 1103700   826766 13016 3882598
#>  7 1992   2930536 2335007  1341010 12202 6618755
#>  8 1993   3709645 2671944  1219579 66586 7667754
#>  9 1994   3333022 2082640  1226268  7716 6649646
#> 10 1995   2199438 1581471  1208470  6146 4995525
#> # ℹ 22 more rows
Listing / Output 3.14: Manufactured guns - exported guns + imported guns

As a first approximation to get guns data circulated in the US I have taken the manufactured numbers, subtracted the exported guns and added the imported guns.

I have already listed some reasons why the above result is still not the real amount of circulated guns per year in the US (see: Remark 3.1)

3.3.5 Classification of Graphs

3.3.5.1 Introduction

There are different possible classification of graph types. In the book Harris uses as major criteria the types and numbers of variables. This is very sensitive subject orientated arrangement addressed at the statistics novice with the main question: What type of graph should I use for my data?

The disadvantage of the subject oriented selection criteria is that there some graph types (e.g. the bar chart) that appear several times under different headings. Explaining the graph types is therefore somewhat redundant on the one hand and piecemeal on the other hand.

Another classification criteria would be the type of the graph itself. Under this pattern one could concentrate of the special features for each graph type. One of these features would be their applicability referring the variable types.

3.3.5.2 Lists of different categorization approaches

Example 3.4 : Five different categorization approaches

Bullet List

  1. One variable
  2. Two variables (achievement 3)
Bullet List 3.1: Book variable types and their corresponding graph types

You can see the redundancy when you categorize the graph types by variable types. But the advantage is, that your choice of graph is driven by essential statistical aspects.

This chapter will follow the book and therefore it will present the same order in the explication of the graphs as in the book outlined.

Bullet List

  • Bar chart
  • Box plot
  • Density plot
  • Histogram
  • Line plot
  • Mosaic plot
  • Pie chart
  • Point chart
  • Scatterplot
  • Violin plot
  • Waffle chart
Bullet List 3.2: Book graph types sorted alphabetically

This is a very abridged list of graphs used for presentation of statistical results. Although it is just a tiny selection of possible graphs it contains those graphs that are most important, most used and most well know types.

Bullet List

  • Correlations
    • Bubble
    • Contour plot
    • Correlogram
    • Heat map
    • Scatterplot
  • Distributions
    • Beeswarm
    • Box plot
    • Density plot
    • Dot plot
    • Dumbbell
    • Histogram
    • Ridgeline
    • Violin plot
  • Evolution
    • Area chart
    • Calendar
    • Candle stick
    • Line chart
    • Slope
  • Flow
    • Alluvial
    • Chord
    • Sankey
    • Waterfall
  • Miscellaneous
    • Art
    • Biology
    • Calendar
    • Computer & Games
    • Fun
    • Image Processing
    • Sports
  • Part of whole
    • Bar chart
    • Dendogram
    • Donut chart
    • Mosaic chart
    • Parliament
    • Pie chart
    • Tree map
    • Venn diagram
    • Voronoi
    • Waffle chart
  • Ranking
    • Bar chart
    • Bump chart
    • Lollipop
    • Parallel Coordinates
    • Radar chart
    • Word cloud
  • Spatial
    • Base map
    • Cartogram
    • Choropleth
    • Interactive
    • Proportion symbol
Bullet List 3.3: Categorization of graph types used by R Chart

Although this list has only 8 categories it is in contrast to Bullet List 3.2 a more complete list of different graphs. It features also not so well known graph types. Besides a miscellaneous category where the members of this group do not share a common feature the graph are sorted in categorization schema that has — with the exception of bar charts — no redundancy, e.g. is almost a taxonomy of graph types.

Bullet List

  • NUMERIC
    • One numeric variable
      • Histogram
      • Density plot
    • Two numeric variables
      • Not ordered
        • Few points
          • Box plot
          • Histogram
          • Scatter plot
        • Many points
          • Violin plot
          • Density plot
          • Scatter with marginal point
          • 2D density plot
      • Ordered
        • Connected scatter plot
        • Area plot
        • Line plot
    • Three numeric variables
      • Not ordered
        • Box plot
        • Violin plot
        • Bubble plot
        • 3D scatter or surface
      • Ordered
        • Stacked area plot
        • Stream graph
        • Line plot
        • Area (SM)
    • Several numeric variables
      • Not ordered
        • Box plot
        • Violin plot
        • Ridge line
        • PCA
        • Correlogram
        • Heatmap
        • Dendogram
      • Ordered
        • Stacked area plot
        • Stream graph
        • Line plot
        • Area (SM)
  • CATEGORICAL
    • One categorical variable
      • Bar plot
      • Lollipop
      • Waffle chart
      • Word cloud
      • Doughnut
      • Pie chart
      • Tree map
      • Circular packing
    • Two or more categorical variables
      • Two independent lists
        • Venn diagram
      • Nested or hierarchical data set
        • Tree map
        • Circular packing
        • Sunburst
        • Bar plot
        • Dendogram
      • Subgroups
        • Grouped scatter
        • Heat map
        • Lollipop
        • Grouped bar plot
        • Stacked bar plot
        • Parallel plot
        • Spider plot
        • Sankey diagram
      • Adjacency
        • Network
        • Chord
        • Arc
        • Sankey diagram
        • Heat map
  • NUMERIC & CATEGORICAL
    • One numeric & one categorical
      • One observation per group
        • Box plot
        • Lollipop
        • Doughnut
        • Pie chart
        • Word cloud
        • Tree map
        • Circular packing
        • Waffle chart
      • Several observations per group
        • Box plot
        • Violin plot
        • Ridge line
        • Density plot
        • Histogram
    • One categorical & several numeric
      • No order
        • Group scatter
        • 2D density
        • Box plot
        • Violin plot
        • PCA
        • Correlogram
      • One numeric is ordered
        • Stacked area
        • Area
        • Stream graph
        • Line plot
        • Connected scatter
      • One value per group
        • Grouped scatter
        • Heat map
        • Lollipop
        • Grouped bar plot
        • Stacked bar plot
        • Parallel plot
        • Spider plot
        • Sankey diagram
    • Several categorical & one numeric
      • Subgroup
        • One observation per group
          • One value per group
          • Grouped scatter
          • Heat map
          • Lollipop
          • Grouped bar plot
          • Stacked bar plot
          • Parallel plot
          • Spider plot
          • Sankey diagram
        • Several observations per group
          • Box plot
          • Violin plot
      • Nested / Hierarchical ordered
        • One observation per group
          • Bar plot
          • Dendogram
          • Sun burst
          • Tree map
          • Circular packing
        • Several observations per group
          • Box plot
          • Violin plot
      • Adjacency
        • Network diagram
        • Chord diagram
        • Arc diagram
        • Sankey diagram
        • Heat map
  • MAPS
    • Map
    • Connection map
    • Choropleth
    • Map hexabin
    • Bubble map
  • NETWORK
    • Simple network
      • Network
      • Chord diagram
      • Arc diagram
      • Sankey diagram
      • Heat map
      • Hive
    • Nested or hierarchical network
      • No value
        • Dendogram
        • Tree map
        • Circular packing
        • Sunburst
        • Sankey diagram
      • Value for leaf
        • Dendogram
        • Tree map
        • Circular packing
        • Sunburst
        • Sankey diagram
      • Value for edges
        • Dendogram
        • Sankey diagram
        • Chord diagram
      • Value for connection Hierarchical edge bundling
  • TIME SERIES
    • One series
      • Box plot
      • Violin plot
      • Ridge line
      • Area
      • Line plot
      • Bar plot
      • Lollipop
    • Several series
      • Box plot
      • Violin plot
      • Ridge line
      • Heat map
      • Line plot
      • Stacked area
      • Stream graph
Bullet List 3.4: Categorization of graph types used by From Data to Viz

This is the same variable oriented approach as in the book but with much more details and differentiation. It is cluttered with redundancies but should be helpful for selecting an appropriate graph type for your data analysis. And the interactive style on the web allows for a much better orientation as implemented in the above list.

Bullet List

  • Correlation
    • Scatterplot
    • Scatterplot With Encircling
    • Jitter Plot
    • Counts Chart
    • Bubble Plot
    • Animated Bubble Plot
    • Marginal Histogram / Boxplot
    • Correlogram
  • Deviation
    • Diverging Bars
    • Diverging Lollipop Chart
    • Diverging Dot Plot
    • Area Chart
  • Ranking
    • Ordered Bar Chart
    • Lollipop Chart
    • Dot Plot
    • Slope Chart
    • Dumbbell Plot
  • Distribution
    • Histogram
    • Density Plot
    • Box Plot
    • Dot + Box Plot
    • Tufte Boxplot
    • Violin Plot
    • Population Pyramid
  • Composition
    • Waffle Chart
    • Pie Chart
    • Treemap
    • Bar Chart
  • Change
    • Time Series Plots
      • From a Data Frame
      • Format to Monthly X Axis
      • Format to Yearly X Axis
      • From Long Data Format
      • From Wide Data Format
    • Stacked Area Chart
    • Calendar Heat Map
    • Slope Chart
    • Seasonal Plot
  • Groups
    • Dendrogram
    • Clusters
  • Spatial
    • Open Street Map
    • Google Road Map
    • Google Hybrid Map
Bullet List 3.5: Categorization of graph types used by Top 50 ggplot2 Visualizations

3.4 Achievement 1: Graphs for a single categorical variable

3.4.1 Introduction

There are several options for visualizing a single categorical variable:


Bullet List 3.7: Graph options for a single categorical variable

3.4.2 Bar Chart

Example 3.5 : Creating Bar Charts for Firearm Usage

R Code 3.25 : Bar charts for gun use (NHANES 2011-2012) with different width of bars

Code
## bar chart: bars with wide width
p_normal <- nhanes_2012_clean1 |> 
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>     
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use)
    ) +
    ggplot2::geom_bar() + 
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants") +
    ggplot2::theme_minimal()

## bar chart: bars with small width
p_small <- nhanes_2012_clean1 |>
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use)
    ) +
    ggplot2::geom_bar(width = 0.4) + 
    ggplot2::theme_minimal() +
    ggplot2::theme(aspect.ratio = 4/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants")

## display both charts side by side
gridExtra::grid.arrange(p_normal, p_small, ncol = 2)
Graph 3.2: Ever used firearms for any reason? (NHANES survey 2011-2012)

  1. Left: Only two bars look horrifying. In this example they are even a lot smaller as normal, because of the second graph to the right.
  2. Right: It is not enough to create smaller bars with the width argument inside the ggplot2::geom_bar() function because that would create unproportional wide space between the two bars. One need to apply the aspect ratio for the used theme as well. In this case all commands to the theme (e.g. my ggplot2::the_bw()) has to come before the aspect.ratio argument. One has to try out which aspect ratio works best.
  3. I used here — as recommended in the book — the {gridExtra} package to display the figures side by side (see Section A.31). But there are other options as well. In the next tab I will use the {patchwork} package, that is especially for {ggplot2} developed (see Section A.62). A third option would be to use one of Quarto formatting commands: See

R Code 3.26 : Bar charts for gun use (NHANES 2011-2012) with different colorizing methods

Code
## bar chart: filled colors within aes() (data controlled)
p_fill_in <- nhanes_2012_clean1 |>
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
    ggplot2::geom_bar(
        ggplot2::aes(fill = gun_use), width = 0.4) + 
    ggplot2::theme_bw() +
    ggplot2::theme(legend.position = "none") +
    ggplot2::theme(aspect.ratio = 3/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants",
                  subtitle = "Filled inside \naes()") 

## bar chart: filled colors outside aes() (manually controlled)
p_fill_out <- nhanes_2012_clean1 |> 
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
    # ggplot2::theme(legend.position = "none") +
    ggplot2::geom_bar(width = 0.4, fill =  c("darkred", "steelblue")) +
    ggplot2::theme_bw() +
    ggplot2::theme(aspect.ratio = 3/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants",
                  subtitle = "Filled outside \naes() my colors")

## ## bar chart: fill = data controlled by my own colors
p_fill_in_my_colors <- nhanes_2012_clean1 |> 
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
    ggplot2::geom_bar(
        ggplot2::aes(fill = gun_use), width = 0.4) + 
    ggplot2::theme_bw() +
    ggplot2::scale_fill_manual(values = c("darkred", "steelblue"), guide = "none") +
    ggplot2::theme(aspect.ratio = 3/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants",
                  subtitle = "Filled inside \nwith my colors") 

## bar chart: manually controlled colors with my own color
p_fill_out_my_colors <- nhanes_2012_clean1 |> 
    dplyr::select(gun_use) |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(ggplot2::aes(x = gun_use)) +
    ggplot2::geom_bar(width = 0.4, fill = c("darkred", "steelblue")) + 
    ggplot2::theme_bw() +
    ggplot2::theme(aspect.ratio = 3/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of participants",
                  subtitle = "Filled outside \nwith my colors")

##  patchwork with :: syntax ############################
## display all 4 charts side by side
## using the trick from 
## https://github.com/thomasp85/patchwork/issues/351#issuecomment-1931140157

patchwork:::"|.ggplot"(
    p_fill_in,
    p_fill_out
)
patchwork:::"|.ggplot"(
    p_fill_in_my_colors,
    p_fill_out_my_colors
)

# library(patchwork)
# p_fill_in | p_fill_out |
#      p_fill_in_my_colors | p_fill_out_my_colors 
Graph 3.3: Ever used firearms for any reason? (NHANES survey 2011-2012)
Graph 3.4: Ever used firearms for any reason? (NHANES survey 2011-2012)

  • Left Top: This graph has the color fill argument within aes() and is therefore data controlled. This means that the colors will be settled automatically by factor level.
  • Right Top: This graph has the color fill argument outside aes() and is therefore manually controlled. One needs to supply colors otherwise one gets a graph without any colors at all.
  • Left bottom: Even if the graph has the color fill argument within aes() and is therefore data controlled, you can change the color composition. But you has also the responsibility to provide a correct legend — or as I have done in this example — to remove the legend from the display. (The argument guide = FALSE as used in the book is superseded with guide = "none")
  • Right bottom: The graph is manually controlled because it has the color fill argument outside aes() with specified colors.

I used {patchwork} here to show all four example graphs (see Section A.62). As always I didn’t want to use the base::library() function to load and attach the package. But I didn’t know how to do this with the {patchwork} operators. Finally I asked my question on StackOverflow and received as answer the solution.

At first I tried it with the + operator. But that produced two very small graphs in the first row of the middle of the display area, and other two small graphs in the second row of the middle of the display area. Beside this ugly display the text of the subtitle was also truncated. After some experimentation I learned that I had to use the | operator.

3.4.3 Pie Chart

3.4.3.1 Introduction

Pie charts show parts of a whole. The pie, or circle, represents the whole. The slices of pie shown in different colors represent the parts. A similar graph type is the but they are not recommended” for several reasons

  • Humans aren’t particularly good at estimating quantity from angles: Once we have more than two categories, pie charts can easily misrepresent percentages and become hard to read.
  • Pie charts do badly when there are lots of categories: Matching the labels and the slices can be hard work and small percentages (which might be important) are tricky to show.

Resource 3.2 : Why you shouldn’t use pie charts

But there are some cases, where pie chart (or donut charts sometimes also called ring chart) are appropriate:

3.4.3.2 Visualize an important number

Visualize an important number by highlighting just one junk of the circle


Donut chart: Circle with hole in the middle, colored 63% orange with text '63% didn't visit a dentist'
(a) Pie chart demo
Donut chart: Circle with hole in the middle, colored 63% orange with text '63% didn't visit a dentist'
(b) Donut chart demo
Graph 3.5: Highlight just one junk to support only one number (Evergreen 2019, 33–35)

BTW: Donut charts are even worse than pie charts:

The middle of the pie is gone. The middle of the pie … where the angle is established, which is what humans would look at to try to determine wedge size, proportion, and data. Even though we aren’t accurate at interpreting angles, the situation is made worse when we remove the middle of the pie. Now we are left to judge curvature and … compare wedges by both curvature and angle (Evergreen 2019, 32).

3.4.3.3 Making a clear point

Use a very limited number of wedges (best not more than two) for making a clear point.

circle colored with two different blue, on the left side - about 45% it says 'Male and on the right side 'Female'
Graph 3.6: Pie charts are acceptable with very few categories (Evergreen 2019, 176)

Example 3.6 : Creating Pie & Donut Charts for Firearm Usage

R Code 3.27 : Visualize percentage of gun user from NHANES survey 2011-2012

Code
lab <- "<span style='font-size:36pt; color:white'>**35%**</span>"
gun_use_2012 |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(ggplot2::aes(x = '', y = percent)) +
    ggplot2::geom_col(fill = c("#fa8324", "grey")) +
    ggplot2::coord_polar(theta = 'y') +
    ggtext::geom_richtext(ggplot2::aes(
        x = 1.1, y = 8,
        label = lab),
        fill = "#fa8324",
        label.colour = "#fa8324") +
    ggplot2::annotate("text", x = .7, y = 10,
                      label = "                used firearms",
                      color = "white",
                      size = 6) +
    ggplot2::theme_void() 
#> Warning in ggtext::geom_richtext(ggplot2::aes(x = 1.1, y = 8, label = lab), : All aesthetics have length 1, but the data has 2 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.
Graph 3.7: Percentage of gun user (NHANES survey 2011-2012)

The most important code line to create a pie graph is ggplot2::coord_polar(theta = 'y'). In the concept of gg (grammar of graphics) a car chart and a pie chart are — with the exception of the above code line — identical (C 2010).

Beside the ggplot2::annotate() function for text comments inside graphics I had for to get the necessary formatting options for the big number also to use {ggtext}, one of 132 registered {ggplot2} extensions. {ggtext} enables the rendering of complex formatted plot labels (see Section A.33).

For training purposes I tried to create exactly the same pattern (color, text size etc.) of a pie chart as in Graph 3.5 (a).

R Code 3.28 : Ever used firearms for any reason? (NHANES survey 2011-2012)

Code
gun_use_2012 |> 
    tidyr::drop_na() |>
    ggplot2::ggplot(
        ggplot2::aes(x = '', y = percent, 
                     fill = forcats::fct_rev(gun_use))
        ) +
    ggplot2::geom_col() + 
    ggplot2::geom_text(ggplot2::aes(
        label = gun_use),
        color = "white",
        position = ggplot2::position_stack(vjust = 0.5),
        size = 10) +
    ggplot2::coord_polar(theta = 'y') +
    ggplot2::theme_void() +
    ggplot2::theme(legend.position = "none") +
    ggplot2::labs(x = '', y = '') +
    viridis::scale_fill_viridis(
        discrete = TRUE,
        option = "turbo",
        begin = 0.1,
        end = 0.9)
Graph 3.8: Ever used firearms for any reason? (NHANES survey 2011-2012)

I have used {viridis} to produce colorblind-friendly color maps (see Section A.96). Instead of using as the first default color yellow I have chosen with the color map options and the begin/end argument, what color should appear for this binary variable.

R Code 3.29 : Donut chart with small hole

Code
# Small hole
hsize <- 1
gun_use_small_hole <- gun_use_2012 |>  
  dplyr::mutate(x = hsize)

gun_use_small_hole |> 
    tidyr::drop_na() |>
    dplyr::mutate(x = hsize) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = hsize, y = percent, 
                     fill = forcats::fct_rev(gun_use))
        ) +
    ggplot2::geom_col() + 
    ggplot2::coord_polar(theta = 'y') +
    ggplot2::xlim(c(0.2, hsize + 0.5)) +
    ggplot2::theme_void() +
    ggplot2::labs(x = '', y = '', fill = "Gun used?") +
    viridis::scale_fill_viridis(
        breaks = c('Yes', 'No'),
        discrete = TRUE,
        option = "turbo",
        begin = 0.1,
        end = 0.9)
Graph 3.9: Ever used firearms for any reason? (NHANES survey 2011-2012)

R Code 3.30 : Donut chart with big hole

Code
hsize <- 2
gun_use_big_hole <- gun_use_2012  |>
  dplyr::mutate(x = hsize)

gun_use_big_hole |>
    tidyr::drop_na() |>
    dplyr::mutate(x = hsize) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = hsize, y = percent, 
                     fill = forcats::fct_rev(gun_use))
        ) +
    ggplot2::geom_col() + 
    ggplot2::coord_polar(theta = 'y') +
    ggplot2::xlim(c(0.2, hsize + 0.5)) +
    ggplot2::theme_void() +
    ggplot2::labs(x = '', y = '', fill = "Gun used?") +
    viridis::scale_fill_viridis(
        breaks = c('Yes', 'No'),
        discrete = TRUE,
        option = "turbo",
        begin = 0.1,
        end = 0.9)
Graph 3.10: Ever used firearms for any reason? (NHANES survey 2011-2012)

3.4.3.4 Case Study

It is a fact that pie chart are still very popular. I recently found for instance a pie chart in one of the most prestigious German weekly newspaper Die Zeit the following pie chart about the financing of the United Nations Relief and Works Agency for Palestine Refugees in the Near East (UNRWA)

A German graph that shows as a doughnut the amount and proportion of financing the UNRWA through the ten biggest sponsors in 2022. In the middle (the whole) is written the total sum with 1.17 billion dollar. From right to left: USA 344 Mio, Several countries including Kuwait (13) and Qatar (12) with 278 Mio , Germany (202 Mio ), EU (114 Mio ), Sweden (61), Norway (34), Japan (39), France (29), Saudi Arabia (27), Switzerland (26), and Turkey with 25 Mio. US dollar.
Graph 3.11: Who is financing the UNRWA?

The figure is colorful and looks nice. More important is that is has all the important information (Country names and amount of funding) written as labels into the graphic. It seems that this a good example for a pie chart, a kind of an exception to the rule.

But if we get into the details, we see that the graph was a twofold simplification of the complete data set. Twofold because besides the overall ranking of 98 sponsors there is a simplified ranking list of the top 20 donors.The problem with pie charts (and also with waffle charts) is that you can’t use them with many categories.

So it was generally a good choice of “Die Zeit” to contract financiers with less funding and to concentrate to the biggest sponsors. But this has the disadvantage not to get a complete picture that is especially cumbersome for political decisions. As a result we have a huge group of miscellaneous funding sources (2nd in the ranking!) that hides many countries that are important to understand the political commitment im the world for the UNRWA.

But let’s see how this graphic would be appear in alternative charts

Experiment 3.1 : Top 20 UNRWA Donors in 2022

R Code 3.31 : Get UNRWA data of top 20 donors from PDF file with {tabulizer}

Code
unrwa <- tabulizer::extract_tables(
    "data/chap03/top_20_donors_2022_overall_ranking.pdf")

unrwa_donors <- tibble::as_tibble(unrwa[[1]])

base::saveRDS(unrwa_donors, "data/chap03/unrwa_donors.rds")

(For this R code chunk is no output available)

R Code 3.32 : Recode and show UNRWA data of top 20 donors 2022

Code
unrwa_donors <- base::readRDS("data/chap03/unrwa_donors.rds")


base::options(scipen = 999)
unrwa_donor_ranking <- unrwa_donors |> 
    dplyr::rename(
        donor = "V1",
        total = "V6"
    ) |> 
    dplyr::select(donor, total) |> 
    dplyr::mutate(donor = forcats::as_factor(donor)) |> 
    dplyr::mutate(donor = forcats::fct_recode(donor,
          Spain = "Spain (including Regional Governments)*",
          Belgium = "Belgium (including Government of Flanders)",
          Kuwait = "Kuwait (including Kuwait Fund for Arab Economic Development)"
          )
    ) |> 
    dplyr::slice(6:25) |> 
    dplyr::mutate(dplyr::across(2, 
    function(x)
        { base::as.numeric(as.character(base::gsub(",", "", x))) })
    ) 

utils::str(unrwa_donor_ranking)
skimr::skim(unrwa_donor_ranking)
#> Warning: There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#>   mangled_skimmers$funs)`.
#> ℹ In group 0: .
#> Caused by warning:
#> ! There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#>   mangled_skimmers$funs)`.
#> Caused by warning in `sorted_count()`:
#> ! Variable contains value(s) of "" that have been converted to "empty".
#> tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ donor: Factor w/ 23 levels "","Donor","USA",..: 3 4 5 6 7 8 9 10 11 12 ...
#>  $ total: num [1:20] 343937718 202054285 114199150 60969987 34180677 ...
Data summary
Name unrwa_donor_ranking
Number of rows 20
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
donor 0 1 FALSE 20 USA: 1, Ger: 1, EU: 1, Swe: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total 0 1 52728341 82086354 10500000 15363671 24456320 31159321 343937718 ▇▁▁▁▁

R Code 3.33 : Top 20 UNRWA donors in 2022

Code
unrwa_donor_ranking |> 
    ggplot2::ggplot(
        ggplot2::aes(
            x = stats::reorder(x = donor, X = total),
            y = total / 1e6)
        ) +
    ggplot2::geom_point(size = 3) +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Donor",
        y = "Donation (in Million US Dollars)"
    )
Graph 3.12: Top 20 UNRWA donors in 2022 (in millions US dollar)

This graph is not as spectacular as “Die Zeit” figure, but carries more information. But what is much more important:

  1. It shows much better the ranking.
  2. It makes it obvious that there are only 4 donors (USA, Germany, EU and Sweden) that stand out from all the other financiers.

To adapt a quotation from Statistical Consulting Centre in its article Why you shouldn’t use pie charts:

Some people might say this graph is boring, but as Edward Tufte warns in his book Envisioning information:

“Cosmetic decoration, which frequently distorts the data, will never salvage an underlying lack of content.”

Why is my point chart so much better?

  • We can estimate the quantities – the gridlines are helpful here.
  • We can immediately see the order of the donors.
  • We can accurately compare the different donors.


3.4.4 Waffle Chart

Example 3.7 : Creating Waffle Charts

R Code 3.34 : Creating a waffle chart for number of total rounds fired (NHANES survey 2011-2012)

Code
fired_2012 |> 
    waffle::waffle(rows = 10,
                   colors = c("lightblue1", "lightsteelblue1", 
                              "deepskyblue1", "dodgerblue3", "black",
                    "lemonchiffon1"))
(a) Proportion of total rounds fired (NHANES survey 2011-2012)
Listing / Output 3.15: Proportion of total rounds fired (NHANES survey 2011-2012)

In contrast to the example in the book I have used percentages and not absolute numbers.

To emulate the percentage view of a pie chart, a 10x10 grid should be used with each square representing 1% of the total. (waffle homepage)

Another advantage: Using percentages I can compare 2011-2012 with 2017-2018 (see Graph 3.16 (a))

'How many rounds in total have you ever fired?' this question has several options: - 1 = 1 to less than 100 rounds  - 2 = 100 to less than 1,000 rounds  - 3 = 1,000 to less than 10,000 rounds  - 4 = 10,000 to less than 50,000 rounds  - 5 = 50,000 rounds or more  - 7 = Refused  - 9 = Don’t know
Graph 3.13: How many total rounds have you ever fired? Codebook 2011-2012 AUDIOMETRY

R Code 3.35 : Creating a waffle chart for number of total rounds fired (NHANES survey 2017-2018)

Code
nhanes_2018 <- readRDS("data/chap03/nhanes_2018.rds")

rounds_fired_2018 <- nhanes_2018 |> 
    dplyr::select(AUQ310) |> 
    tidyr::drop_na() |> 
    dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |> 
    dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
        "1 to less than 100" = "1",
        "100 to less than 1000" = "2",
        "1000 to less than 10k" = "3",
        "10k to less than 50 k" = "4",
        "50k or more" = "5",
        "Refused to answer" = "7",
        "Don't know" = "9")
    )  |>    
    dplyr::rename(rounds_fired = AUQ310) 

fired_2018 <-  rounds_fired_2018 |> 
    dplyr::count(rounds_fired) |>
    dplyr::mutate(prop = round(n / sum(n), 2) * 100) |> 
    dplyr::relocate(n, .after = dplyr::last_col())

(
    waffle_plot <- 
        waffle::waffle(parts = fired_2018,
                       rows = 10,
                       colors = c("lightblue1", "lightsteelblue1", 
                                  "deepskyblue1", "dodgerblue3", "black",
                       "gold1", "lemonchiffon1"))
)
(a) Proportion of total rounds fired (NHANES survey 2017-2018)
Listing / Output 3.16: Proportion of total rounds fired (NHANES survey 2017-2018)

The number of different levels of the factor variable is almost too high to realize at one glance the differences of the various categories.

Best practices suggest keeping the number of categories small, just as should be done when creating pie charts. (Create Waffle Charts Visualization)

Compare 2011-2012 with 2017-2018 (see Graph 3.15 (a)). You see there is just a small difference: Respondents in the 2017-2018 survey have fired tiny less rounds as the people asked in the 2011-2012 survey. Generally speaking: The fired total of rounds remains more or less constant during the period 2012 - 2018.

'How many rounds in total have you ever fired?' this question has several options: - 1 = 1 to less than 100 rounds  - 2 = 100 to less than 1,000 rounds  - 3 = 1,000 to less than 10,000 rounds  - 4 = 10,000 to less than 50,000 rounds  - 5 = 50,000 rounds or more  - 7 = Refused  - 9 = Don’t know
Graph 3.14: How many total rounds have you ever fired? Codebook 2017-2018 AUDIOMETRY

R Code 3.36 : Compare the total rounds fired between the NHANES survey participants 2011-2012 and 2017-2018

Code
fired_df <- dplyr::full_join(x = fired_2012,
                          y = fired_2018,
                          by = dplyr::join_by(rounds_fired) 
)

fired_df <-  fired_df |> 
    dplyr::rename(
        "Rounds fired" = rounds_fired,
        `2012(%)` = prop.x,
        `n (2012)` = n.x,
        `2018(%)` = prop.y,
        `n (2018)` = n.y,
        ) |> 
    dplyr::mutate(`Diff (%)` = `2012(%)` - `2018(%)`)

fired_df
Table 3.1: Total rounds fired of NHANES survey participants 2011-2012 and 2017-2018
#>            Rounds fired 2012(%) n (2012) 2018(%) n (2018) Diff (%)
#> 1    1 to less than 100      43      701      45      289       -2
#> 2 100 to less than 1000      26      423      24      154        2
#> 3 1000 to less than 10k      18      291      20      131       -2
#> 4  10k to less than 50k       7      106      NA       NA       NA
#> 5           50k or more       4       66       2       15        2
#> 6            Don't know       2       26       2       10        0
#> 7 10k to less than 50 k      NA       NA       7       45       NA
#> 8     Refused to answer      NA       NA       0        2       NA

The participants of the NHANES survey 2011-2012 and 2017-2018 fired almost the same numbers of total rounds. The participants in 2017-2018 fired just a tiny amount of bullets less.

R Code 3.37 : Creating a waffle chart for number of total rounds fired (NHANES survey 2017-2018) with cividis color scale

Code
nhanes_2018 <- readRDS("data/chap03/nhanes_2018.rds")

rounds_fired_2018 <- nhanes_2018 |> 
    dplyr::select(AUQ310) |> 
    tidyr::drop_na() |> 
    dplyr::mutate(AUQ310 = forcats::as_factor(AUQ310)) |> 
    dplyr::mutate(AUQ310 = forcats::fct_recode(AUQ310,
        "1 to less than 100" = "1",
        "100 to less than 1000" = "2",
        "1000 to less than 10k" = "3",
        "10k to less than 50 k" = "4",
        "50k or more" = "5",
        "Refused to answer" = "7",
        "Don't know" = "9")
    )  |>    
    dplyr::rename(rounds_fired = AUQ310) 

fired_2018 <-  rounds_fired_2018 |> 
    dplyr::count(rounds_fired) |>
    dplyr::mutate(prop = round(n / sum(n), 2) * 100) |> 
    dplyr::relocate(n, .after = dplyr::last_col())

waffle::waffle(parts = fired_2018,
               rows = 10,
               colors = c("#BCAF6FFF", "#A69D75FF",
                          "#918C78FF", "#7C7B78FF",
                          "#6A6C71FF", "#575C6DFF", "#414D6BFF")
)
(a) Proportion of total rounds fired (NHANES survey 2017-2018) with cividis color scale
Listing / Output 3.17: Proportion of total rounds fired (NHANES survey 2017-2018) with cividis color scale

In contrast to Graph 3.16 (a) — where I have used individual choices of different colors without any awareness of color blindness or black & white printing — here I have used the color blindness friendly cividis palette form the {viridis} package. Read more about my reflections about choosing color palettes in Section 3.9.1.

3.5 Achievement 2: Graphs for a single continuous variable

3.5.1 Introduction

Options are:


Bullet List 3.8: Graph options for a single continuous variable

  • Histograms and density plots are very similar to each other and show the overall shape of the data. These two types of graphs are especially useful in determining whether or not a variable has a normal distribution.

  • Boxplots show the central tendency and spread of the data, which is another way to determine whether a variable is normally distributed or skewed.

  • Violin plots are also useful when looking at a continuous variable and are like a combination of boxplots and density plots. Violin plots are commonly used to examine the distribution of a continuous variable for different levels (or groups) of a factor (or categorical) variable.

3.5.2 Histogram

Experiment 3.2 : Histograms of research funding (2004-2015) with 10 and 30 bins

R Code 3.38 : Histogram of research funding (2004-2015) with 10 bins

Code
p_histo_funding <- research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_histogram(bins = 10,
                            fill = "grey90",
                            color = "black") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Number of causes") +
    ggplot2::theme_bw()

p_histo_funding
Graph 3.15: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

R Code 3.39 : Histogram of research funding (2004-2015) with 30 bins

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_histogram(bins = 30,
                            fill = "grey90",
                            color = "black") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Number of research topics ('Causes of Deadth')") +
    ggplot2::theme_bw()
Graph 3.16: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

With histograms it is important to play around with the number of bins. This changes the appearance of histograms sometimes quite profoundly. The default of 30 bins displays a warning that one should choose a better value with the argument binwidth (another option to control the number of bins.)

It is not quite clear for me what would be the optimal number of bins of a given data set. It was easy for Graph 2.11: There were only 30 different values, each for one. So to provide the same number of bins as number of observed days (= 30) was a sensible choice.

There is not much difference in the case of 10 or 30 bins of Experiment 3.2. A big difference would be for example when the number of modes is changing, or the mode is moving far to another value. It seems to me that with a density plot it is simpler to choose the optimal curve (even if I do not understand the underlying rationale of this kernel density estimation (KDE) procedure).

3.5.3 Density plot

Experiment 3.3 : Density plot of research funding (2004-2015) with different bandwidth

R Code 3.40 : Density plot with standard bandwidth bw = "nrd0"

Code
p_dens_funding <- research_funding |>
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "grey90",
                          color = "black") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()

p_dens_funding
Graph 3.17: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

This is the density plot without changing the default bandwidth. It uses nrd0 as bandwidth. One can see that it is somewhat similar to bw = 1.5 (see Graph 3.19).

  • bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman’s ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31))) unless the quartiles coincide when a positive result will be guaranteed. (Quoted form the help file of stats::bandwidth())

R Code 3.41 : Density plot with bandwidth of 0.5

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "#7d70b6",
                          color = "black",
                          bw = .5) +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
Graph 3.18: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

I have replicated Figure 3.26 without knowing what bw = 0.5 means and how it is computed.

: Density plot of research funding (2004-2015) with bandwidth of 1.5

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "#7d70b6",
                          color = "black",
                          bw = 1.5) +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
Graph 3.19: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

I have replicated Figure 3.27 without knowing what bw = 1.5 means and how it is computed. This is the figure that was chosen in the book as appropriate to represent the data distribution. It is very similar to the {ggplot2} standard version (Graph 3.17), where neither bw nor kernel was changed.

R Code 3.42 : Density plot with bandwidth selector nrd0

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "grey90",
                          color = "black",
                          bw = "nrd0") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
Graph 3.20: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

Here I have stated expressively that bw = "nrd0". This is the default value that implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It is appropriate for normal-like distributions.

R Code 3.43 : Density plot with bandwidth selector SJ

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "grey90",
                          color = "black",
                          bw = "SJ") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
#> Warning: Computation failed in `stat_density()`.
#> Caused by error in `precompute_bw()`:
#> ! `bw` must be one of "nrd0", "nrd", "ucv", "bcv", "sj", "sj-ste", or
#>   "sj-dpi", not "SJ".
#> ℹ Did you mean "sj"?
Graph 3.21: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

bw = "SJ" select the bandwidth using pilot estimation of derivatives and is appropriate for multimodal or general non-normal distribution (od 2018).

R Code 3.44 : Density plot with bandwidth ucv

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "grey90",
                          color = "black",
                          bw = "ucv") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
#> Warning in stats::bw.ucv(x): minimum occurred at one end of the range
Graph 3.22: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

ucv is one of the two extremes. It chooses a very small bandwidth. (The other extreme selector is bcv which chooses very a wide bandwidth.)

R Code 3.45 : Density plot with bandwith bcv

Code
research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_density(fill = "grey90    ", 
                          color = "black",
                          bw = "bcv") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar",
                  y = "Probability density") +
    ggplot2::theme_bw()
#> Warning in stats::bw.bcv(x): minimum occurred at one end of the range
Graph 3.23: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

bcv is one of the two extremes. It chooses a very wide bandwidth. (The other extreme selector is ucv which chooses a very narrow bandwidth.)

The equivalent of binwidth in histograms is bw (smoothing bandwidth) in density plots. I have to confess that I do not understand all the relevant factors to choose the optimal bandwidth for density plots (or binwidth for histograms).

In density plots there is also the kernel argument for choosing the appropriate smoothing kernel. I learned from the video by od webel (2018) that the chosen kernel (gaussian”, the standard, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”) is not so important because the result all in similar distributions. Most important, however, is to choose the appropriate bandwidth.

It is said that one disadvantage of density plots is its feature to smooth out the distribution so that you cannot see anymore — in contrast to histograms — if there are gaps in the data. But if you choose a very small bandwidth like ucv then you get a distribution similar to a histogram.

Resource 3.3 : Information about kernel density estimation and bandwidth

Much of my understanding about bandwidth and kernel density estimation comes from the video Intro to Kernel Density Estimation by od webel (2018).

Another article on this complex topic: What is Kernel Density Estimation?. From this article I learned that a kernel is nothing else as a weighting function to estimate the probability density function (PDF). (Until now I had a sketchy impression about this concept derived from another meaning of “kernel”, related with the core part of a computer.)

3.5.4 Box Plot

Histograms and density plots are great for examining the overall shape of the data for a continuous variable. Boxplots in contrasts are useful for identifying the middle value and the boundaries around the middle half of the data.


Explains the parts of a box plots: The middle 50% are called Interquartile Range (IQR). It is followed by the 'Whiskers'. Whiskers are calculated: Take the first (lower fence) or third (upper fence) 'Quartile' and add 1.5 x IQR.  A quartile is 25% of the data. Whiskers are followed by the outliers.
Graph 3.24: Parts of a boxplot (Ferreira et al. 2016, 211)

R Code 3.46 : Box plot of research funding (2004-2015)

Code
p_box_funding <- research_funding |> 
    ggplot2::ggplot(ggplot2::aes(x = Funding / 1000000000)) +
    ggplot2::geom_boxplot(fill = "grey90",
                          color = "black") +
    ggplot2::labs(x = "Research funding (2004-2015) in billions dollar") +
    ggplot2::theme_bw()

p_box_funding
Graph 3.25: Research funding (2004-2015) for the top 30 mortility casues in the U.S. (in billions dollar)

3.5.5 Summary

Each of the three presented plot types (histogram, density plot and box plot) has it strength and disadvantage.

All three graphs show the right skew clearly, while the histogram and boxplot show gaps in the data toward the end of the tail. The boxplot is the only one of the three that clearly identifies the central tendency and spread of the variable.

R Code 3.47 : Comparison of histogram, density plot and box plot

Code
gridExtra::grid.arrange(p_histo_funding, 
                        p_dens_funding, 
                        p_box_funding, 
                        nrow = 3)
Graph 3.26: Comparison: Histogram, densitiy plot and box plot

3.6 Achievement 3: Graph for two variable at once

3.6.1 Two categorical variables

3.6.1.1 Mosaic Plot

3.6.1.1.1 Properties

Mosaic plots also called Marimekko charts or Mekko charts show the relative size of groups across two or more variables. It gives an overview of the data and makes it possible to recognize relationships between different variables. For example, independence is shown when the boxes across categories all have the same areas.

Mosaic plots are the multidimensional extension of spineplots, which graphically display the same information for only one variable.


Bullet List

  • The displayed variables are categorical or ordinal scales.
  • The plot is of at least two variables. There is no upper limit, but too many variables may be confusing in graphic form.
  • The number of observations is not limited, but not read in the image.
  • The areas of the rectangular tiles that are available for a combination of features are proportional to the number of observations that have this combination of features.
  • Unlike as other graphs, it is not possible for the mosaic plot to plot a confidence interval.
Bullet List 3.9: Properties of Mosaic Charts (Wikipedia)

3.6.1.1.2 Mosaic charts with {ggmosaic}

Example 3.8 : Mosaic chart of gun use, rounds fired, ear plugs and by sex (NHANES survey 2011-2012)

R Code 3.48 : Firearm use by sex in the US among 2011–2012 NHANES participants

Code
nhanes_2012_clean <- readRDS("data/chap03/nhanes_2012_clean.rds")

library(ggmosaic)
library(ggplot2)

nhanes_2012_clean |> 
    tidyr::drop_na(c(gun_use, sex)) |> 
    ggplot2::ggplot() + 
    ggmosaic::geom_mosaic(ggplot2::aes(
        x = ggmosaic::product(gun_use, sex), 
        fill = gun_use)
        ) +
    ggplot2::theme_classic() +
    ggplot2::labs(x = "Sex",
              y = "Ever used a gun?",
              fill = "Gun use") +
    ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
    ggplot2::scale_fill_viridis_d(
      alpha = 1,
      begin = 0.15,
      end = 0.35,
      direction = -1,
      option = "turbo",
      aesthetics = "fill"
    )
Graph 3.27: Firearm use by sex in the US among 2011–2012 NHANES participants

Better use of mosaic charts when variables have different levels with different proportions

Graph 3.27 is a bad example for a mosaic chart: One of its features (that the area of the columns reflect the proportion of the level) does not shine in this example because both levels (men and women) have approximately the same proportion (49.8 vs. 50.2%). Although it is better to have variables with several levels.

WATCH OUT! Loading and attaching {ggmosaic} is mandatory

Just calling the {ggmosaic} functions with the :: syntax results into an error. I have to expressively load and attach the package with library(ggmosiac). But the error only appears when I render the whole document; running just the R code chunk works. I suppose that this results from a misbehavior of the {ggmosaic} code.

There is also another warning with {ggmosaic} version 0.33: One should use tidyr::unite() instead of tidyr::unite_() since {tidyr} version 1.2.0. This is a missing adaption in the {ggmosaic} package as the warning message pointed out. I could get rid of this warning message by installing the GitHub version 0.34 as recommended in this GitHub post.

R Code 3.49 : Rounds fired by sex in the United States among 2011–2012 NHANES participants

Code
nhanes_2012_clean |> 
    dplyr::mutate(rounds_fired =
          dplyr::na_if(rounds_fired, "Don't know")) |> 
    tidyr::drop_na(c(rounds_fired, sex)) |> 
    dplyr::mutate(rounds_fired =
          forcats::fct_drop(rounds_fired)) |> 
    ggplot2::ggplot() + 
    ggmosaic::geom_mosaic(ggplot2::aes(
        x = ggmosaic::product(rounds_fired, sex), 
        fill = rounds_fired)
        ) +
    ggplot2::theme_classic() +
    ggplot2::labs(x = "Sex",
                  y = "Total number of rounds fired",
                  fill = "Rounds fired") +
    ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
    ggplot2::scale_fill_viridis_d(
        alpha = 1,  # alpha does not work!
        begin = .25,
         end = .75,
        direction = -1,
        option = "cividis"
    )
Graph 3.28: Rounds fired by sex in the United States among 2011–2012 NHANES participants

This is already a more informative chart as Graph 3.27. It shows that women are not only less using a gun, those they do, they fire less round than men.

But mosaic charts really shine when both variables have several levels as shown in Graph 3.29.

I have experimented with several approaches to provide a specific scale. In this case the color friendly cividis scale from the {viridis} package. From the 5 different scaling option I have 4 commented out. See the code for the details.

R Code 3.50 : Number of rounds fired by wearing ear plugs among the 2011–2012 NHANES participants

Code
nhanes_2012_clean |> 
    dplyr::mutate(rounds_fired =
        dplyr::na_if(rounds_fired, "Don't know")) |> 
    dplyr::mutate(ear_plugs =
        dplyr::na_if(ear_plugs, "Don't know")) |> 
    tidyr::drop_na(c(rounds_fired, ear_plugs)) |> 
    dplyr::mutate(rounds_fired =
        forcats::fct_drop(rounds_fired)) |>
    dplyr::mutate(ear_plugs =
        forcats::fct_drop(ear_plugs)) |>
    ggplot2::ggplot() + 
    ggmosaic::geom_mosaic(ggplot2::aes(
        x = ggmosaic::product(rounds_fired, ear_plugs), 
        fill = rounds_fired)
        ) +
    ggplot2::theme_bw() +
    ggplot2::theme(axis.text.x = 
           ggplot2::element_text(angle = 45, 
                                 vjust = 1, 
                                 hjust = 1)) +
    ggplot2::labs(x = "Wearing ear plugs",
                  y = "Total number of rounds fired",
                  fill = "Rounds fired") +
    ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE)) +
        viridis::scale_fill_viridis(
        discrete = TRUE,
        option = "turbo",
        direction = -1)
Graph 3.29: Rounds fired with ear plugs among the 2011–2012 NHANES participants

The chart clearly shows that a part (more than 1/3 of the respondent) never use ear plugs, even if they fire many rounds.

R Code 3.51 : Firearm use by sex in the US among 2011–2012 NHANES participants using {vcd}

Code
nhanes_2012_clean <-  base::readRDS("data/chap03/nhanes_2012_clean.rds")

test <- nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na()


vcd::mosaic(~ gun_use + sex,
            data = test,
            highlighting = "gun_use", 
            highlighting_fill = c("#5fefbd", "#7298f6"),
            highlighting_direction = "top",
            direction = c("h", "v"),
            main = "Firearm use by sex using {vcd}")
Graph 3.30: Firearm use by sex in the US among 2011–2012 NHANES participants

I have tried to replicate Graph 3.27 with the vcd::mosaic() function. But I noticed that there are several facilities of {ggplot2} not available (for instance

  • to rename x- and y-axis;
  • to rotate the text for the axis so that I can read longer variable name;
  • to present a legend with just the variable names,
    (For the first issue I would have to rename the column names in the data frame, for the second issue I couldn’t find a solution.)

Because the vcd::mosaic() function is too complex (for my) and requires a new package to learn I will stick with {ggplot2} approaches, even if I have the impression that there are some programming glitches with {ggmosaic}. (See for instance the discussion in the post forum).

Even in all the examples I found on the Internet (see for instance Creating a Mosaic Plot in R Graphics Cookbook (Chang 2024) I could not find a solution for my problems mentioned above.

3.6.1.1.3 Adaptions and other resources

My own adaptions to the mosaic figures

In contrast to the book I have used three additional adaptions for the figures produced with the {ggmosaic} package:

  1. I have used with the {viridis} package color-blind friendly palettes (see: Section A.96). There are several features to comment:
    • I have the direction of the palettes reversed so that the light color are at the bottom and the dark colors at the top.
    • In Graph 3.27 I had start and end color of the palette “turbo” manually changed to get a better color separation with lighter colors. Otherwise I would have gotten dark blue and dark red. (Unfortunately the argument alpha for setting the transparency value does not work with {viridis}.)
    • With {ggplot2} version 3.0.0 the {viridisLite} package was integrated integrated into {ggplot2}. I have used this new possibility in Graph 3.27.
  2. I have removed the small amount of “Don’t know” answers in Graph 3.28 and Graph 3.29 to get a graph that is easier to interpret.
  3. I have reversed the display of the legend so that the colors match the graphic e.g. from bottom to top are colors in the figure and in the legend identical.
  4. For Graph 3.29 I have the angle for the text of the axis set to 45° to prevent overlapping text under small columns.

Concerning color friendly palettes I have added several experiments with different packages in Section 3.9.1.

Resource 3.4 {vcd} / {vcdExtra} are other packages for mosaic graphs (and more)

The packages {vcd} and {vcdExtra} are specialized package for the visualization of categorical data. I have tried it out, but its usage for the mosaic function is more complex and need time to learn.

{vcd} and {vcdExtra} are support packages for the book, “Discrete Data Analysis with R” (Friendly and Meyer 2015) (see Section A.95). There is also a website DDAR (abbreviation of the book title) and a course description with additional material. There are several introductory vignettes to learn different aspects of the two packages.

So if you are interested to learn more about visualization of categorical data besides the mosaic graph then DDAR would be the right place.

3.6.1.2 Bar Chart

There are two different kinds of bar chars: stacked bar charts and grouped bar charts:

  • ggplot2::geom_bar() if you are providing the records (rows) and geom_bar() has to calculate the number of cases in each group (as we have already done in @#sec-chap03-bar-chart-1) or
  • ggplot2::geom_col() if you have already counted the number of cases in each groups and provide these summarized numbers to geom_col().

During working on Section 3.6.1 I have added several experiments for color friendly palettes (see Section 3.9.1). Here I will use these lesson learned and changed the default {ggplot2} color palette to the base R Okabe-Ito palette, one of the available safe colorblind palettes. (Another recommended choice are the eight {viridis} palettes, especially the viridis::cividis() palette)

Experiment 3.4 : Working with colorblind friendly color palettes

R Code 3.52 : Choose two colors from a color friendly palette

Code
my_2_colors <- grDevices::palette.colors(2, palette = "Okabe-Ito", alpha = .8, recycle = FALSE)
# my_2_colors <- paletteer::paletteer_c("viridis::cividis", n = 10)
# my_2_colors <- viridis::cividis(2, direction = -1, alpha = .8)

colorblindcheck::palette_check(my_2_colors, plot = TRUE)
#>           name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1       normal 2  64.73729   1    1 64.73729  64.73729 64.73729
#> 2 deuteranopia 2  64.73729   1    1 67.24857  67.24857 67.24857
#> 3   protanopia 2  64.73729   1    0 60.15494  60.15494 60.15494
#> 4   tritanopia 2  64.73729   1    0 62.72480  62.72480 62.72480
Table 3.2: Show color distances of the ‘Okabe-Ito’ palette for two colors for normal sighted people and people with CVD


Checking the color distances with {colorblindcheck} we see a very great distance (more than 60) not only in the normal vision palette but also with all CVD palettes. (See for additional details of how to use and interpret results of the {colorblindcheck} in Section 3.9.1.3.3).

There are several options to define a new color palette. I have here chosen three approaches, but used only one (e.g., the other two options I have commented out). I have applied the base R color vision deficiency friendly palette Okabe-Ito.

For the next few graphs I need only two colors. Normally it would not change the appearance if I define a color palette with more color. The first two would always the same. But there is one exception: When I need to reverse the two color used — as I have done in Listing / Output 3.18 — then the reverse function is applied to the complete color palette and results in a different color composition.

This problem could be overcome with the {ggokabeito} package, where you could manually define the order of the colors (see Section A.28). Additionally it provide the Okabe Ito scale for direct use in {ggplot}.

Nevertheless I haven’t use more than two colors. The categorical variable in Graph 3.35 is an ordered variable (0 to 50k rounds fired) but the Okabe Ito palette is with its distinct discrete colors not well suited to display ordered variables.

R Code 3.53 : Change the {ggplot2} default color palette

Code
## change 2 ggplot2 colors #############
options(
  ggplot2.discrete.colour = my_2_colors,
  ggplot2.discrete.fill = my_2_colors
)

nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

p1 <- nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    ggplot2::ggplot(ggplot2::aes(x = sex, fill = gun_use)) +
    ggplot2::geom_bar() +
    ggplot2::theme_minimal() +
    ggplot2::theme(legend.position = "none") +
    ggplot2::theme(aspect.ratio = 4/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of respondents")

p2 <- nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    ggplot2::ggplot(ggplot2::aes(x = sex, fill = gun_use)) +
    ggplot2::geom_bar(position = "dodge") +
    ggplot2::theme_minimal() +
    ggplot2::theme(legend.position = "none") +
    ggplot2::theme(aspect.ratio = 2/1) +
    ggplot2::labs(x = "Gun use", 
                  y = "Number of respondents")


patchwork:::"-.ggplot"(p1, p2)

## restore 2 ggplot2 colors #############
# options(
#   ggplot2.discrete.color = NULL,
#   ggplot2.discrete.fill = NULL
# )
Graph 3.31: Change the {ggplot2} default color palette to my two CVD save colors

I have changed the default colors of the {ggplot2} palette with two discrete colors using options(ggplot2.discrete.fill = my_2_colors). To restore the default {ggplot2} color palette use options(ggplot2.discrete.fill = NULL).

3.6.1.2.1 Stacked Bar Chart

Stacked bar charts (like pie & waffle charts) show parts of a whole and have similar problems as pie charts and mosaic plots. If there are many groups or groups of similar size, stacked bar charts are difficult to interpret and not recommended.

Example 3.9 : Stacked bar charts for two categorical variables

R Code 3.54 : Gun use by sex (using raw data = geom_bar())

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use, fill = sex)
        ) + 
    ggplot2::geom_bar() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use"
        ))
Graph 3.32: Stacked bar chart: Gun use by sex (using raw data = geom_bar()

R Code 3.55 : Gun use by sex (using summarized data = geom_col())

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    janitor::tabyl(gun_use, sex) |> 
    tidyr::pivot_longer(
        cols = !gun_use,
        names_to = "sex",
        values_to = "count"
    ) |> 

    ggplot2::ggplot(
        ggplot2::aes(x = gun_use,
                     y = count,
                     fill = sex)
        ) + 
    ggplot2::geom_col(position = 
        ggplot2::position_stack(reverse = TRUE)) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use",
            reverse = TRUE
        )) +
    ggplot2::scale_fill_discrete(
        type = base::rev(getOption("ggplot2.discrete.fill")))
(a) Stacked bar chart: Gun use by sex (using summarized data = geom_col())
Listing / Output 3.18: Stacked bar chart: Gun use by sex (using summarized data = geom_col())

To get exactly the same appearance with geom_col() as with geom_bar() in Graph 3.32 I had to add three important adaptions:

  1. To get the correct structure of the data I had — after the summation with janitor::tabyl() — to apply tidyr::pivot_longer().

  2. To get the same order of stacked bars (starting with female) I had to add inside ggplot2::geom_col() the argument position = ggplot2::position_stack(reverse = TRUE).

  3. To get the same order of colors I had to reverse the color order in ggplot2::scale_fill_discrete() with type = base::rev(getOption("ggplot2.discrete.fill")).

R Code 3.56 : Stacked bar chart: Gun use by sex in percent of total respondents

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use, fill = sex,
                     y = 100 * ggplot2::after_stat(count)
                     / sum(ggplot2::after_stat(count)))
        ) + 
    
    ## adornments
    ggplot2::geom_bar() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use"
        ))
Graph 3.33: Stacked bar chart: Gun use by sex in percent of total respondents

WATCH OUT! This bar chart is not correct.

Instead of displaying parts of whole (100% in each group) it relates its grouped value to the total amount of observations. This lead to a misleading perception of the relationships if the groups have different sizes. For instance in this graph male using and not using guns have approximately the same percentage (about 25%). But this is not correct: Almost 75% of the men are using fire arms.

R Code 3.57 : Proportional stacked bar chart: Gun use by sex in percent of grouped respondents

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |>
    tidyr::drop_na() |> 

    # prepare for proportional stacked bar
    dplyr::group_by(gun_use, sex) |>
    dplyr::count() |>
    ## pick the variable that will add # to 100%
    dplyr::group_by(gun_use) |>
    ## compute percents within chosen variable
    dplyr::mutate(percent = 100 * (n / sum(n))) |>
    
    
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use, 
                     y = percent,
                     fill = sex)) + 
    ggplot2::geom_col() +
    
    
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use"
        )) 
(a) Proportional stacked bar chart: Gun use by sex in percent of grouped respondents
Listing / Output 3.19: This code snippet from the book is more complex as necessary

This is the same code as in the book. It is more complex as necessary because:

  1. For proportional bars it is sufficient to use position = "fill" in geom_bar() / geom_col(). Then the complex preparation for proportional bars is not necessary anymore.
  2. Instead of multiplying the n / sum(n) with 100 one could use ggplot2::scale_y_continuous(labels = scales::percent). This has the additional advantage that % sign is appended at the y-values.

I will show the more concise code in Listing / Output 3.20.

R Code 3.58 : Proportional stacked bar chart: Gun use by sex

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use, fill = sex)) + 
    ggplot2::geom_bar(position = "fill") +
    
    ## adornments
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use"
        )) +
    ggplot2::scale_y_continuous(labels = scales::percent)
(a) Proportional stacked bar chart: Gun use by sex
Listing / Output 3.20: More concise code for the proportional stacked bar chart: Gun use by sex

This is the reduced code of the book snippet from Listing / Output 3.19. It does not need a special preparation for proportional bar parts. Instead just add position = "fill" into the ggplot2::geom_bar() or ggplot2::geom_col() function.


3.6.1.2.2 Grouped Bar Chart

Grouped bar charts are the preferred option for bar charts. If there are more than two levels of categorical variables stacked bar charts are difficult to interpret. They lack a common reference point as the different observations or percentage in each levels starts and ends at different absolute position. A comparison of the relative size is for the human eye therefore awkward.

From the conceptual point of view there is no difference in the creation between stacked and grouped bar charts. The only difference is that grouped bar charts have the argument position = "dodge" inside the geom_bar() or geom_col() argument.

Example 3.10 : Grouped bar charts for two categorical variables

R Code 3.59 : Grouped bar chart with two variables with only two levels: Gun use by sex

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, sex) |> 
    tidyr::drop_na() |> 
    ggplot2::ggplot(
        ggplot2::aes(x = gun_use, fill = sex)
        ) + 
    ggplot2::geom_bar(position = "dodge") +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Gun use",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Gun use"
        ))
Graph 3.34: Grouped bar chart with two variables with only two levels: Gun use by sex

R Code 3.60 : Choose six colors from a color friendly palette

Code
my_6_colors <- viridis::cividis(6, direction = -1, alpha = .8)
colorblindcheck::palette_check(my_6_colors, plot = TRUE)
## change 6 ggplot2 colors #############
options(
  ggplot2.discrete.colour = my_6_colors,
  ggplot2.discrete.fill = my_6_colors
)
#>           name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1       normal 6  11.43205  15   15 11.43205  41.93861 94.45609
#> 2 deuteranopia 6  11.43205  15   14 11.32162  42.45045 95.77164
#> 3   protanopia 6  11.43205  15   14 11.15705  40.71083 90.60717
#> 4   tritanopia 6  11.43205  15   13 10.49463  35.80623 78.10328
Table 3.3: Show color distances of the ‘cividis’ palette for six colors for normal sighted people and people with CVD


For the bar charts rounds fired by sex we need a different scale with

  • more colors (six colors in total) and
  • a gradual progression to show the order of the categorical variable

Checking the color distances of the cividis palette from the {viridis} packages with {colorblindcheck} we can see that the minimum distance is 10.5. This is relatively low but still acceptable. I have therefore set the default {ggplot2} color palette to these six colors of the cividis palette.

R Code 3.61 : Grouped bar chart of two variables with several levels: Rounds fired by sex

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(rounds_fired, sex) |> 
    dplyr::mutate(rounds_fired =
          dplyr::na_if(rounds_fired, "Don't know")) |> 
    tidyr::drop_na() |> 
    base::droplevels() |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sex, fill = rounds_fired)
        ) + 
    ggplot2::geom_bar(position = 
        ggplot2::position_dodge()
        ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Rounds fired",
        y = "Number of respondents"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Rounds fired")
        ) 
Graph 3.35: Grouped bar chart of two variables with several levels: Rounds fired by sex

From this graph it is obvious that men fire generally more rounds than women. But the absolute value is not conclusive because we cannot see how many people in each group was excluded because they didn’t even fire one round. It could be that there are much more men than women that did not even fire one round. We already know that this is not the case but from this graph alone you wouldn’t know.

To get a more complete picture:

  1. We add also people without any use of fire arms into the graph.
  2. We calculate the percentage of rounds fired per group (= by sex).

R Code 3.62 : Grouped bar chart of two variables with several levels: Rounds fired by sex of all respondents

Code
nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    dplyr::select(gun_use, rounds_fired, sex) |> 
    dplyr::mutate(rounds_fired = 
          dplyr::if_else(gun_use == "No", "0 rounds fired", rounds_fired)
    ) |> 
    dplyr::mutate(rounds_fired =
          dplyr::na_if(rounds_fired, "Don't know")) |>
    tidyr::drop_na() |>
    base::droplevels() |>
    
    ## prepare for percentage calculation
    dplyr::group_by(rounds_fired, sex) |>
    dplyr::count() |>
    ## pick the variable that will add to 100%
    dplyr::group_by(sex) |>
    ## compute percents within chosen variable
    dplyr::mutate(percent = n / sum(n)) |>

    ggplot2::ggplot(
        ggplot2::aes(x = sex, 
                     y = percent,
                     fill = rounds_fired)
        ) + 
    ggplot2::geom_col(position = 
        ggplot2::position_dodge()
        ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Rounds fired",
        y = "Percentage"
    ) +
    ggplot2::guides(
        fill = ggplot2::guide_legend(
            title = "Rounds fired")
        ) +
    ggplot2::scale_y_continuous(
        labels = scales::label_percent()
        ) +
    ggplot2::scale_fill_brewer(palette = "BuGn")
Graph 3.36: Grouped bar chart of two variables with several levels: Rounds fired by sex of all respondents

Now we can say generally that more man use firearms than women. From those respondents (men and women) that are using guns, men fire generally more rounds than women.

3.6.2 One categorical and one continuous variable

3.6.2.1 Bar Chart

The is just one difference for bar charts with one categorical and one continuous variable in comparison to the bar charts used previously in this chapter: We need a summary statistics for the continuous variable so that we can present accordingly the height of the bar. These summary statistics has to be inserted into the geom_bar() function.

Example 3.11 : Bar charts with one categorical and one continuous variable

R Code 3.63 : The R code from the book does not work

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |>  
    ggplot2::ggplot(
        ggplot2::aes(x = weapons, y = deaths)
        ) + 
    ggplot2::geom_bar(stat = "summary", fun.y = mean) + 
    ggplot2::theme_minimal() + 
    ggplot2::labs(x = "Firearm type", 
                  y = "Number of homicides committed")
#> Warning in ggplot2::geom_bar(stat = "summary", fun.y = mean): Ignoring unknown
#> parameters: `fun.y`
#> No summary function supplied, defaulting to `mean_se()`
Graph 3.37: Using the R code from the book does not work

Here I was using exactly the same code as in the book for Figure 3.45. Because of changes in {ggplot2) this code does not work anymore. I got two messages:

  • No summary function supplied, defaulting to mean_se()
  • Warning: Ignoring unknown parameters: fun.y

The problem is that the arguments inside ggplot2::geom_bar() didn’t work.

R Code 3.64 : Revised book code using ggplot2::stat_summary()

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |> 
    ggplot2::ggplot(
        ggplot2::aes(x = weapons)
    ) +
    ggplot2::geom_bar() +
    ggplot2::stat_summary(
        ggplot2::aes(y = deaths),
        fun = "mean",
        geom = "bar"
        ) +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type", 
                  y = "Number of homicides committed") 
Graph 3.38: Bar chart with calculating the mean for the 2012-2016

This code chunk works without any message and warning. There are three differences to the code from the book:

  1. Only the x variable is the aesthetics for the ggplot() call.
  2. geom_bar() is empty in contrast to the book where arguments for summarizing statistics are added inside the geom_bar() parenthesis.
  3. A new function stat_summary() is added with supplies all the necessary arguments to generate the summarizing statistics. It needs an extra aesthetics for the y-axis where the calculation takes place.

R Code 3.65 : Bar chart: Calculating mean values with stat_summary() and flipped coordinates

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |> 
    ggplot2::ggplot(
        ggplot2::aes(
            x = stats::reorder(x = weapons, X = -deaths))
        ) +
    ggplot2::geom_bar() +
    ggplot2::stat_summary(
        ggplot2::aes(
            y = deaths,
            fill = weapons,
            group = weapons),
        fun = "mean",
        geom = "bar"
        ) +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type", 
                  y = "Number of homicides committed") +
    ggplot2::scale_fill_manual(values =
           c("Handguns" = "#7463AC",
             "Firearms, type not stated" = "gray",
             "Rifles" = "gray",
             "Shotguns" = "gray",
             "Other guns" = "gray"), guide = "none")
(a) Calculating mean values using geom_bar() with stat_summary() and flipped coordinates
Listing / Output 3.21: Calculating mean values using geom_bar() with stat_summary() and flipped coordinates

Here I have used the same code as before but have added several scale improvements like flipped axis, reordered the bars and changed to a sparse theme.

I had considerable problems with the fill color for this chart, because I did not know where and how to add the fill aesthetics. As I added during my experimentation a new ggplot2::eas() layer with the argument fill = weapons, I got the following warning message:

The following aesthetics were dropped during statistical transformation. This can happen when ggplot fails to infer the correct grouping structure in the data. Did you forget to specify a group aesthetic or to convert a numerical variable into a factor?

I didn’t know how to add the group aesthetic and all the examples I found on the web are referring to geom_line(). (See for instance the page of the online manual for ggplot2 Aesthetics: grouping or Mapping variable values to colors in the first edition of R Graphics Cookbook). I finally found the solution. The following text snippet is now more understandable for me:

The group aesthetic is by default set to the interaction of all discrete variables in the plot. This choice often partitions the data correctly, but when it does not, or when no discrete variable is used in the plot, you will need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group. Aesthetics: grouping

An easier solution (for me) is to compute the mean values of the different firearms and then to apply a bar chart using geom_col() as I have done in the next tab with Listing / Output 3.22.

R Code 3.66 : Bar chart: Using mean values with geom_col() and flipped coordinates

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |>
    dplyr::group_by(weapons) |>
    dplyr::summarize(mean_deaths_2012_2016 = mean(deaths)) |>
    ggplot2::ggplot(
        ggplot2::aes(fill = weapons,
            x = stats::reorder(
            x = weapons, 
            X = -mean_deaths_2012_2016),
            y = mean_deaths_2012_2016)
        ) +
    ggplot2::geom_col() +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed") +
    ggplot2::scale_fill_manual(values = 
           c("Handguns" = "#7463AC", 
             "Firearms, type not stated" = "gray", 
             "Rifles" = "gray", 
             "Shotguns" = "gray", 
             "Other guns" = "gray"), guide = "none")
(a) Bar chart: Using already computed mean values with geom_col() and flipped coordinates
Listing / Output 3.22: Bar chart: Using already computed mean values with geom_col() and flipped coordinates

In this version I have calculated the mean beforehand I have passed the data to {ggplot2}. This was easier for me to understand and I had this solution long before I solved the group problem in Listing / Output 3.21.

R Code 3.67 : Number of Homicides by type of firearms (2012-2016)

Code
p <- fbi_deaths_clean |>
    ggplot2::ggplot(
        ggplot2::aes(fill = year,
                     y = deaths,
                     x = stats::reorder(
                            x = weapons, 
                            X = -deaths)
                    )
        ) +
    ggplot2::geom_bar(position = 'dodge', stat = 'identity') +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed",
                  fill = "Year")

my_5_colors <- viridis::cividis(5, direction = -1, alpha = .8)
withr::with_options(
    list(ggplot2.discrete.fill = my_5_colors),
    print(p)
)
Graph 3.39: Number of Homicides by type of firearms (2012-2016)

Here I have the first time used the {withr} package (see: Section A.98) for a just temporary change of the {ggplot2) standard color palette. This has the advantage that one doesn’t restore the original value.

Report
  1. Handguns are in all years constant the type of firearms where the most homicides were committed.
  2. Starting with 2013 the homicides with handguns are rising constantly.
  3. The type of firearms not stated is also rising over the years. What is the reason that this category is on the rise? As the second most cause of homicides it could disturb the previous conclusion. Imagine that most of this category belong to handguns category than we would see a still stepper rise of homicides with handguns. Or the diametrically opposite assumption: If most of the this category belongs to one of the other smaller categories than this would change the ranking of the homicides by firearm types.

R Code 3.68 : Number of homicides in the years 2012-2016 per type of firearms

Code
p <- fbi_deaths_clean |>
    ggplot2::ggplot(
        ggplot2::aes(fill = weapons,
                     x = year,
                     y = deaths)
        ) +
    ggplot2::geom_bar(position = 'dodge', stat = 'identity') +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Year",
                  y = "Number of homicides committed",
                  fill = "Weapons") 
    
my_5_colors <- viridis::cividis(5, direction = -1, alpha = .8)
withr::with_options(
    list(ggplot2.discrete.fill = my_5_colors),
    print(p)
)
Graph 3.40: Number of homicides in the years 2012-2016 per type of firearms
Report

This graph is not very informative because it is difficult to compare the type of firearms for different years. To get this information a line chart would be much better (see Section 3.6.3.1).


3.6.2.2 Point Chart

Point charts are an alternative for simple bar graphs. The use less ink and are generated with the same code as bar charts with two exceptions:

Example 3.12 : Point charts

R Code 3.69 : Point chart: Mean annual homicides by firearm type in the United States, 2012–2016

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |>
    dplyr::group_by(weapons) |>
    dplyr::summarize(mean_deaths_2012_2016 = mean(deaths)) |>
    ggplot2::ggplot(
        ggplot2::aes(color = weapons,
                     x = stats::reorder(
                         x = weapons, 
                         X = -mean_deaths_2012_2016),
                     y = mean_deaths_2012_2016)
        ) +
    ggplot2::geom_point(size = 4) +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed") +
    ggplot2::scale_color_manual(values = 
           c("Handguns" = "#7463AC", 
             "Firearms, type not stated" = "gray", 
             "Rifles" = "gray", 
             "Shotguns" = "gray", 
             "Other guns" = "gray"), guide = "none") 
Graph 3.41: Point chart flipped: Mean annual homicides by firearm type in the United States, 2012–2016

R Code 3.70 : Point chart with error bars: Mean annual homicides by firearm type in the United States, 2012–2016

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |>
    dplyr::group_by(weapons) |>
    dplyr::summarize(
        central = mean(x = deaths), 
        spread = sd(x = deaths)
        ) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = stats::reorder(
                         x = weapons, 
                         X = -central),
                     y = central)
    ) +
    ggplot2::geom_errorbar(
        ggplot2::aes(ymin = central - spread, 
                     ymax = central + spread, 
                     linetype = "Mean\n+/- sd"), 
        width = .2) +
    ggplot2::geom_point(
        ggplot2::aes(color = weapons,
                     size = "Mean (2012-2016)"),
        alpha = 0.5
    ) +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed") +
    ggplot2::scale_color_manual(values = 
           c("Handguns" = "#7463AC", 
             "Firearms, type not stated" = "gray", 
             "Rifles" = "gray", 
             "Shotguns" = "gray", 
             "Other guns" = "gray"), 
           guide = "none"
           ) +
    ggplot2::scale_linetype_manual(values = 1, name = "Error bars") + 
    ggplot2::scale_size_manual(values = 4, name = "") + 
    ggplot2::theme(legend.position = "top")
Graph 3.42: Point chart with error bars: Mean annual homicides by firearm type in the United States, 2012–2016

I added alpha = 0.5 so that the small error bars are still visible “behind” the big dots.

This is a more complex graph: It has two layers (geom_errorbar() and geom_point()) and three different scales (weapons, line type and size). The difficulty for me is to know where to put the different aesthetics. For instance: I could put size = 4 after the alpha = 0.5 argument, but then the argument scale_size_manual(values = 4) would not work anymore. Otherwise it is not possible to add the alpha argument into the scale_size_manual() function.

Report

Only for “firearms, type not stated” and “Handguns” had a remarkable size of standard deviation from the mean. For the other types (“Riffles”, “shotguns” and “other guns”) the spread is so small that they did not even extend outside of the dots. (To see them, I had to apply an alpha argument.)

In the book Harris supposes that the the small number of years is the reason for this tendency. I do not believe that this may be the main reason. I think it has more to do with the absolute small numbers of observation in these type of firearms.

3.6.2.3 Box Plot

While the bar chart and point chart were great for comparing the means of the groups, the boxplot will provide more information about the distribution in each group.

Example 3.13 : Box Plots

R Code 3.71 : Boxplot: Annual homicides by firearm type in the United States, 2012–2016

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

fbi_deaths_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = stats::reorder(
                         x = weapons, 
                         X = -deaths),
                     y = deaths)
        ) +
    ggplot2::geom_boxplot() +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed") 
Graph 3.43: Boxplot: Annual homicides by firearm type in the United States, 2012–2016

In contrast to the book I have not colored / filled the boxes.

In this type of graph we do not only see the bigger spread of handguns and firearms not stated, but we also see the skew of these two distributions.

In the next tab we can see the distribution of the mean values in relation to the boxplots.

Report
  1. The number of homicides by handguns and firearms, where the type is not stated, varies in the years 2012-2016 a lot.
  2. Both distributions are skewed as the median is not in the middle of the box. Firearms, whose types are not stated, has the median on the far left of the box. The distribution is right skewed because is has some large values to the right.
  3. The situation for handguns is reversed: It is a left skewed distribution because there are some small values to the left of the median resulting in a smaller mean.

R Code 3.72 : Boxplot with data points: Annual homicides by firearm type in the United States, 2012–2016

Code
fbi_deaths_clean <- base::readRDS("data/chap03/fbi_deaths_clean.rds")

## if jitter should always result 
## in the same horizontal position
# set.seed(500) 

fbi_deaths_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = stats::reorder(
                         x = weapons, 
                         X = -deaths),
                     y = deaths)
        ) +
    ggplot2::geom_boxplot() +
    ggplot2::geom_jitter() +
    ggplot2::coord_flip() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "Firearm type",
                  y = "Number of homicides committed") 
Graph 3.44: Boxplot with data points: Annual homicides by firearm type in the United States, 2012–2016

I didn’t need alpha = .8 as in the book, because I didn’t fill the boxes with color.

I noticed that whenever I run the code chunk the horizontal position of the data points changes. This is the effect of the jitter command. I you want to have always exactly the same position than you would have to add a set.seed() in front of the computation.

It is interesting to see that just one data point far from the median cannot change its position. Compare the far right dot of firearms not reported. It has the same relative position to the box as the far right dot of the handguns category. But it didn’t pull the median to its position. In contrast to the handguns where several other points on the right side support the one far right dot and drag the median to the right.

3.6.2.4 Violin Plot

Violin plots can be seen as a graph type between boxplots and density plots. They are typically used to look at the distribution of continuous data within categories.

For the homicide data they do not work because there are not enough data. There were too few cases in some categories. So the book applied the violin plot to the age by sex data from the NHANES survey.

R Code 3.73 : Using a violin plot to compare respondent sex of the NHANES 2012 survey

Code
nhanes_2012_clean <-  base::readRDS("data/chap03/nhanes_2012_clean.rds")

nhanes_2012_clean |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sex, y = RIDAGEYR)
    ) +
    ggplot2::geom_violin(
        ggplot2::aes(fill = sex)
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Sex",
        y = "Age in years"
    ) + 
    ggokabeito::scale_fill_okabe_ito(
        guide = "none"
    )

3.6.3 Two continuous variables

When to use line graphs and when to use scatterplots

Situations where a line graph is more useful than a scatterplot:

    1. when the graph is showing change over time, and
    1. when there is not a lot of variation in the data.

Relationships where there is no measure of time and data that include a lot of variation are better shown with a scatterplot.

3.6.3.1 Line Graph

Example 3.14 : Firearms per type manufactured in the United States

R Code 3.74 : Firearms circulating in the United States 1986-2017

Code
guns_total <-  
    base::readRDS("data/chap03/guns_total.rds")

p_guns_total <- guns_total |>
    tidyr::pivot_longer(
        cols = 2:6,
        names_to = "gun_type",
        values_to = "gun_count"
    ) |> 
    dplyr::mutate(Year = base::as.numeric(Year),
                  gun_type = forcats::as_factor(gun_type)
    ) |> 
    
    ggplot2::ggplot(
        ggplot2::aes(
            x = Year, 
            y = gun_count / 1e5)
        ) +
    ggplot2::geom_line(
        ggplot2::aes(color = gun_type)
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Number of firearms (in 100,000s)",
    ) + 
    ggplot2::scale_x_continuous(
        limits = c(1986, 2017), 
        breaks = c(1986, 1990, 1995, 2000,
                   2005, 2010, 2015, 2017)) +
    ggokabeito::scale_color_okabe_ito(
        order = c(1:3, 7, 9),
        name = "Type of
Firearm",
        breaks = c('Total', 'Handguns', 'Rifles',
                   'Shotguns',  'Misc'))


p_guns_total

# # # palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE)
# "#E69F00" "#56B4E9" "#009E73" "#F0E442" "#0072B2" "#D55E00" "#CC79A7" "#999999" "#000000"
Graph 3.45: Firearms circulating in the United States 1986-2017

There are four remarkable code lines:

  1. To prevent scientific notation on the y-axis I had changed the measure unit to 100000 guns (1e5).
  2. I added several breaks for the x-axis.
  3. I reordered the legend to match the position with the lines of the gun types.
  4. I used with the Okabe Ito color scale a colorblind friendly color palette and chose those colors that can be distinguished best. The function palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE) generates Hex color code for all nine colors of palette. If you copy the (not runable) resulted hex code into the R code chunk, you can examine the appropirate colors visually. — I commented these two code line out.

I took the ATF data form the PDF file and had calculated guns manufactured (from the USAFact.org website) and subtracted exported guns (page 4 of the ATF PDF) and added imported guns (page 6 of the ATF PDF). For details of the sources see Section 3.3.2.1.

R Code 3.75 : Firearms manufactured in the United States 1986-2019

Code
guns_manufactured_2019_clean <-  
    base::readRDS("data/chap03/guns_manufactured_2019_clean.rds")

lookup_manufactured <- c(
    Total = "Firearms manufactured (Items)",
    Pistols = "Pistols (Items)",
    Rifles = "Rifles (Items)",
    Shotguns = "Shotguns (Items)",
    Revolvers = "Revolvers (Items)",
    Misc = "Misc. Firearms (Items)"
    )


guns_manufactured_2019_clean |>
    tidyr::pivot_longer(
        cols = -Years,
        names_to = "year",
        values_to = "gun_count"
    ) |> 
    dplyr::rename(gun_type = "Years") |> 
    dplyr::mutate(year = base::as.numeric(year),
                  gun_type = forcats::as_factor(gun_type)
    ) |>
    tidyr::pivot_wider(
        names_from = gun_type,
        values_from = gun_count
    ) |> 
    dplyr::rename(
        dplyr::all_of(lookup_manufactured)
    ) |> 
    tidyr::pivot_longer(
        cols = -year,
        names_to = "gun_type",
        values_to = "gun_count"
    ) |> 
    
    ggplot2::ggplot(
        ggplot2::aes(
            x = year, 
            y = gun_count / 1e5)
        ) +
    ggplot2::geom_line(
        ggplot2::aes(color = gun_type)
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Number of firearms (in 100,000s)",
        x = "Year manufactured"
    ) +
    ggplot2::scale_x_continuous(
        limits = c(1986, 2019),
        breaks = c(1986, 1990, 1995, 2000,
                   2005, 2010, 2015, 2019)) +
    ggokabeito::scale_color_okabe_ito(
        order = c(1:3, 6, 7, 9),
        name = "Type of
Firearm",
        breaks = c('Total', 'Pistols', 'Rifles',
                   'Misc', 'Revolvers', 'Shotguns'))

## palette_okabe_ito(order = 1:9, alpha = NULL, recycle = FALSE)
# "#E69F00" "#56B4E9" "#009E73" "#F0E442" "#0072B2" "#D55E00" "#CC79A7" "#999999" "#000000"
Graph 3.46: Firearms manufactured in the United States 1986-2019

Here applies the same comment about the code structure as written for the previous tab “Circulating guns”.

Line graphs need an ggplot2::aes() specification inside the ggplot2::geom_line()

In contrast to other types of graph: Besides the definition of x and y variables, there is a second aes() specification inside the geom_line() necessary where you define linetype or the color of lines. Otherwise the graph would not work as line graph!

3.6.3.2 Scattterplot

Example 3.15 : Scatterplot for Mortality Rate versus Funding

R Code 3.76 : Scatterplot: Mortality Rate versus Funding (with metric x-scale)

Code
research_funding <- base::readRDS("data/chap03/research_funding.rds")

research_funding |> 
    ggplot2::ggplot(
        ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
                     y = Funding)
        ) +
    ggplot2::geom_point() +
    ggplot2::theme_bw()
Graph 3.47: Scatterplot: Mortality Rate versus Funding (with metric x-scale)

R Code 3.77 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a ‘loess’ smoother)

Code
## prevent exponential (scientific) notation
base::options(scipen = 999) 

research_funding |> 
    ggplot2::ggplot(
        ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
                     y = Funding / 1e9)
        ) +
    ggplot2::geom_point() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Funding, US $ billion"
    ) +
    ggplot2::geom_smooth(
        method = "loess", 
        formula = 'y ~ x'
        ) +
    ggplot2::scale_x_log10() + 
    ggplot2::scale_y_log10()
Graph 3.48: Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a ‘loess’ smoother)

  1. To produce the trend line I have used ggplot2::geom_smooth() and not ggplot2::stat_smooth() as in the book. Frankly, I do not understand the difference, because the output is the same. As parameters I have added the standard value for less than 1,000 observations method = "loess" and formula = 'y ~x'.
  2. To prevent scientific notation I have added options(scipen = 999).

A penalty to be applied when deciding to print numeric values in fixed or exponential notation. Positive values bias towards fixed and negative towards scientific notation: fixed notation will be preferred unless it is more than scipen digits wider. Help page for Options Settings

R Code 3.78 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a linear smoother)

Code
research_funding |> 
    ggplot2::ggplot(
        ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
                     y = Funding / 1e9)
        ) +
    ggplot2::geom_point() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Funding, US $ billion"
    ) +
    ggplot2::geom_smooth(
        method = "lm", 
        formula = 'y ~ x'
        ) +
    ggplot2::scale_x_log10() + 
    ggplot2::scale_y_log10()
Graph 3.49: Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale and a linear model smoother)

In contrast to Graph 3.48 I have here used for the trend line, e.g. the line for the best fit, the smoother for the linear model (‘lm’).

R Code 3.79 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear smoother and points labelled)

Code
research_funding |> 
    ggplot2::ggplot(
        ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
                     y = Funding / 1e9)
        ) +
    ggplot2::geom_point() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Funding, US $ billion"
    ) +
    ggplot2::geom_smooth(
        method = "lm", 
        formula = 'y ~ x'
        ) +
    ggplot2::scale_x_log10() + 
    ggplot2::scale_y_log10() +
    ggplot2::geom_text(
        ggplot2::aes(label = `Cause of Death`)
    )
Graph 3.50: Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear model smoother and points labelled)

The problem here is that the label are overlapping each other. This can be repaired with the {ggrepel} package (see Section A.32). (See Graph 3.51)

R Code 3.80 : Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear smoother and points labelled with {ggrepel})

Code
research_funding <- base::readRDS("data/chap03/research_funding.rds")


research_funding |> 
    ggplot2::ggplot(
        ggplot2::aes(x = `Mortality Rate per 100,000 Population`,
                     y = Funding / 1e9)
        ) +
    ggplot2::geom_point() +
    ggplot2::theme_bw() +
    ggplot2::labs(
        y = "Funding, US $ billion"
    ) +
    ggplot2::geom_smooth(
        method = "lm", 
        formula = 'y ~ x'
        ) +
    ggplot2::scale_x_log10() + 
    ggplot2::scale_y_log10() +
    ggrepel::geom_text_repel(
        ggplot2::aes(label = `Cause of Death`)
        )

## return from fixed to standard notation
base::options(scipen = 0) 
Graph 3.51: Scatterplot: Mortality Rate versus Funding (with logarithmic x-scale, a linear model smoother and points labelled with {ggrepel})

With the {ggrepel} package one prevents overlapping of labels in a scatterplot. Compare to Graph 3.50.

3.7 Ensuring graphs are well formatted

For a graph to stand alone, it should have as many of these features as possible:

  • Clear labels and titles on both axes
  • An overall title describing what is in the graph along with
  • Source of data (added, pb)
  • Date of data collection
  • Units of analysis (e.g., people, organizations)
  • Sample size

In addition, researchers often use the following to improve a graph:

  • Scale variables with very large or very small values (e.g., using millions or billions).
  • Color to draw attention to important or relevant features of a graph.

Most of the content in this chapter were tutorial material to choose and produce appropriate graphs for the data at hand. Therefore I didn’t always follow the above rules. Especially the source of the data, and the date of the data collection were most of the time not added.

In the next graph I try to add all the important information to fulfill the above advice.

R Code 3.81 : Firearms circulating in the United States 1986-2017

Code
p_guns_total +
    ggplot2::labs(
        title = "Firearms circulating in the United States 1986-2017",
        subtitle = "Own Calculation: Firearms manufactured (p.2)\nsubtracted Exports (p.4) and added Imports (p.6).",
        caption = "Source: Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF)"
    )
Graph 3.52: Firearms Commerce in the United States FY 2019 Update (ATF)

  • Hyperlinks: I could not manage to provide hyperlinks in title, subtitle or caption (= “Source”). Maybe it has to do with the lightbox = true directive in my _quarto.yml file, because the whole picture is an active link for the lightbox. But hyperlinks do work in the figure caption (fig-cap).
  • New line: Text for title, subtitle and caption can be split into different lines with \n. But this did not work for fig-cap.

3.8 Exercises (empty)


3.9 Additional resources for this chapter

3.9.1 Color Palettes

3.9.1.1 Introduction

At least since creating graphs for two variable at once (Section 3.6) the question arises: What color palette should be chosen?

To dive into the topic of color palettes for R is a very deep water and I can here only scratch the surface. My main purpose in this section is to provide practical orientation on tools to choose appropriate color palettes. I am following here mainly the compiled material by Emil Hvitfeldt (2024), who is also the developer of {paleteer} (see Section A.61) (Hvitfeldt 2021).

It is not enough that the colors do not offend the eye, e.g. that they are aesthetically pleasant. There are two other important consideration as well:

  1. Palettes have to retain its integrity when printed in black and white
  2. People with colorblindness are able to understand it (Perception of color palettes)

3.9.1.2 Printing black & white

Often you have a beauty colorful graphic which looks very nice in RStudio and on your web site. But how does it look when the graphics is printed out or appear in book published only in black and white?

You can check the black/white appearance of the colors your plot is using with the function colorspace::desaturate().

To test if the palette you want to use will be distorted when in black and white, use the colorspace::desaturate() function.

Experiment 3.5 : Using colorspace::desaturate() to test how color palettes perform in black & white

R Code 3.82 : Standard color palette for {ggplot2} in color and desaturated

Code
pal_data <- list(
    names = c("Normal", "desaturated"),
    color = list(scales::hue_pal()(256),
          colorspace::desaturate(scales::hue_pal()(256))))

list_plotter(pal_data$color, pal_data$names, 
             "Standard color palette for ggplot2")
Graph 3.53: Standard color palette for ggplot2

The standard color palette of {ggplot2} is completely useless when you print it out in black & white! The problem is that the colors are picked to have constant chroma and luminance thus yielding the same shade of grey when desaturated.

R Code 3.83 : {viridis} color palette “magma” (option ‘A’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::magma(256),
     colorspace::desaturate(viridis::magma(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::magma")
Graph 3.54: Viridis palette ‘magma’ (option ‘A’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “A” (“magma”)

R Code 3.84 : {viridis} color palette “inferno” (option ‘B’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::inferno(256),
     colorspace::desaturate(viridis::inferno(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::inferno")
Graph 3.55: Viridis palette ‘inferno’ (option ‘B’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “B” (“inferno”)

R Code 3.85 : {viridis} color palette “plasma” (option ‘C’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::plasma(256),
     colorspace::desaturate(viridis::plasma(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::plasma")
Graph 3.56: Viridis palette ‘plasma’ (option ‘C’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “C” (“plasma”)

R Code 3.86 : {viridis} color palette “viridis” (option ‘D’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::viridis(256),
     colorspace::desaturate(viridis::viridis(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::viridis")
Graph 3.57: Viridis palette ‘viridis’ (option ‘D’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“viridis”)

R Code 3.87 : {viridis} color palette “cividis” (option ‘E’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::cividis(256),
     colorspace::desaturate(viridis::cividis(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::cividis")
Graph 3.58: Viridis palette ‘cividis’ (option ‘E’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“cividis”)

R Code 3.88 : {viridis} color palette “rocket” (option ‘F’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::rocket(256),
     colorspace::desaturate(viridis::rocket(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::rocket")
Graph 3.59: Viridis palette ‘rocket’ (option ‘F’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “F” (“rocket”)

R Code 3.89 : {viridis} color palette “mako” (option ‘G’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::mako(256),
     colorspace::desaturate(viridis::mako(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::mako")
Graph 3.60: Viridis palette ‘mako’ (option ‘G’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “G” (“mako”)

R Code 3.90 : {viridis} color palette “turbo” (option ‘H’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "desaturated"),
     color = list(viridis::turbo(256),
     colorspace::desaturate(viridis::turbo(256))))

list_plotter(pal_data$color, pal_data$names, "viridis::turbo")
Graph 3.61: Viridis palette ‘turbo’ (option ‘H’) in color and desaturated

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “H” (“turbo”)

WATCH OUT! Begin and end of the “viridis::turbo” color scale are very similar in black & white

In this chapter I preferred “viridis::turbo” because it is the most colorful palette of {viridis}. But it turned out that the begin and end of the color scale are not distinguishable in black & white. I had therefore to determine a different begin and/or end of the palette to prevent this similarity.

3.9.1.2.0.1 Practice test

R Code 3.91 : Test how my used colors look for printing in black & white

Code
pal_data <- list(names = c("Normal", "desaturated"),
    color = list(scales::viridis_pal(
                                alpha = 1, 
                                begin = .15, 
                                 end = .35, 
                                direction = -1, 
                                option = "turbo")(2),
    colorspace::desaturate(scales::viridis_pal(
                                alpha = 1, 
                                begin = .15, 
                                end = .35, 
                                direction = -1, 
                                option = "turbo")(2)))
    )
list_plotter(pal_data$color, pal_data$names, 
    "Colors and black & white of graph gun-use by sex")
(a) Test if used colors of my graph gun-use by sex are also readable in black & white printing
Listing / Output 3.23: Test how the colors you used for your graph look for printing in black & write

Summary and my personal preferences

Read more details about {viridis} scales in the vignette Introduction to the {viridis} color maps. These palettes are not only colorful and pretty but also perceptually uniform and robust to color blindness (CVD).

  • The default scale is “D” (viridis) which is easier to read for most of the CVDs. cividis is a corrected version providing easiness for all kinds of CVDs.
  • turbo stands out, because it is a rainbow scale developed to address the shortcomings of the Jet rainbow color map. It is not perceptually uniform.
  • I do not like yellow (mostly at the end of the color scale and therefore always used, even with binary factors). But one can prevent the appearance with a different choice of the end point of the scale.
  • If you do not use all colors you can exactly provide all parameters of your scale & color choice to colorspace::desaturate(). For instance: colorspace::desaturate(viridis::turbo(5, 0.5, .25, .75, -1)) See for a practice test Listing / Output 3.23.

Resource 3.5 Other colorblind friendly palettes

  • {scico} is another package that contains 39 colorblind friendly palettes (see: Section A.79).
  • scale_color_OkabeIto and scale_fill_OkabeIto from {blindnessr} works also better for people with color-vision deficiency (CVD).

3.9.1.3 Color blindness

350 million people are color blind. Men are affected more than women: 28% (1 out of 12) men and 0.5% (1 out of 200) women. (EnChroma)

Usually when people talk about color blindness, they are referring to the most common forms of red-green color blindness. But there are also other color vision deficiency (CVD):

  • Deuteranomaly: About 6% of men are red-blind, and so they have trouble distinguishing red from green.
  • Protanomaly: About 2% of men are green-blind, and they also have trouble distinguishing red from green.
  • Tritanomaly : Less than 1% of men are blue-blind, and they have trouble distinguishing blue from yellow. (Polychrome: Color deficits)

Bullet List

  • Deutan = Red-Green Color Blind: Color Cone Sensitivity: Green: Deuteranomaly is the most common type of color blindness, affecting about 6% of men. It is characterized by a reduced sensitivity to green light, making it difficult to differentiate between shades of red and green.
  • Protan = Red-Green Color Blind: Color Cone Sensitivity: Red: Protan (“pro-tan”) is the second most common and is characterized by a reduced sensitivity to red light. People with protanomaly have difficulty distinguishing between shades of red and green.
  • Tritan = Blue-Yellow Color Blind: Color Cone Sensitivity: Blue: Tritanomaly is a rare type of color blindness that affects both males and females equally. It is characterized by a reduced sensitivity to blue light, making it difficult to differentiate between shades of blue and green, as well as yellow and red.
  • Monochromacy and Achromatopsia describes a range of conditions that include rod-Monochromacy, S-cone Monochromacy and Achromatopsia Sometimes these are collectively referred to as types of achromatopsia, as the word “achromat” meaning “no color.” However, not all cases of achromatopsia have “no color” vision.
Bullet List 3.10: Types of color vision deficiency (CVD)

There are different possibilities to determine the effect of color blindness on the used palettes or used graph colors.

3.9.1.3.1 Test with {dichromat}

The {dichromat} package can simulate color blindness on individual colors or entire palettes.

Experiment 3.6 : Effects of color blindness on different palettes using {dichromat}

R Code 3.92 : Standard color palette for {ggplot2} in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
                 color = list(scales::hue_pal()(256),
                              dichromat::dichromat(scales::hue_pal()(256), type = "deutan"),
                              dichromat::dichromat(scales::hue_pal()(256), type = "protan"),
                              dichromat::dichromat(scales::hue_pal()(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, "The effect of color blindness on the ggplot2 standard palette")
Graph 3.62: Effects of color blindness of standard color palette of ggplot2

R Code 3.93 : {viridis} color palette “magma” (option ‘A’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::magma(256),
              dichromat::dichromat(viridis::magma(256), type = "deutan"),
              dichromat::dichromat(viridis::magma(256), type = "protan"),
              dichromat::dichromat(viridis::magma(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'magma' palette")
Graph 3.63: Effects of color blindness of {viridis} palette ‘magma’ (option ‘A’)

R Code 3.94 : {viridis} color palette “inferno” (option ‘B’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::inferno(256),
              dichromat::dichromat(viridis::inferno(256), type = "deutan"),
              dichromat::dichromat(viridis::inferno(256), type = "protan"),
              dichromat::dichromat(viridis::inferno(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'inferno' palette")
Graph 3.64: Effects of color blindness of {viridis} palette ‘inferno’ (option ‘B’)

R Code 3.95 : {viridis} color palette “plasma” (option ‘C’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::plasma(256),
              dichromat::dichromat(viridis::plasma(256), type = "deutan"),
              dichromat::dichromat(viridis::plasma(256), type = "protan"),
              dichromat::dichromat(viridis::plasma(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'plasma' palette")
Graph 3.65: Effects of color blindness of {viridis} palette ‘plasma’ (option ‘C’)

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “C” (“plasma”)

R Code 3.96 : {viridis} color palette “viridis” (option ‘D’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::viridis(256),
              dichromat::dichromat(viridis::viridis(256), type = "deutan"),
              dichromat::dichromat(viridis::viridis(256), type = "protan"),
              dichromat::dichromat(viridis::viridis(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'viridis' palette")
Graph 3.66: Effects of color blindness of {viridis} palette ‘viridis’ (option ‘D’)

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “D” (“viridis”)

R Code 3.97 : {viridis} color palette “cividis” (option ‘E’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::cividis(256),
              dichromat::dichromat(viridis::cividis(256), type = "deutan"),
              dichromat::dichromat(viridis::cividis(256), type = "protan"),
              dichromat::dichromat(viridis::cividis(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'cividis' palette")
Graph 3.67: Effects of color blindness of {cividis} palette ‘cividis’ (option ‘E’)

R Code 3.98 : {viridis} color palette “rocket” (option ‘F’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::rocket(256),
              dichromat::dichromat(viridis::rocket(256), type = "deutan"),
              dichromat::dichromat(viridis::rocket(256), type = "protan"),
              dichromat::dichromat(viridis::rocket(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'rocket' palette")
Graph 3.68: Effects of color blindness of {viridis} palette ‘rocket’ (option ‘F’)

R Code 3.99 : {viridis} color palette “mako” (option ‘G’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::mako(256),
              dichromat::dichromat(viridis::mako(256), type = "deutan"),
              dichromat::dichromat(viridis::mako(256), type = "protan"),
              dichromat::dichromat(viridis::mako(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'mako' palette")
Graph 3.69: Effects of color blindness of {viridis} palette ‘mako’ (option ‘G’)

One of the continuous palettes that satisfy the criteria of readable gray shades are the palettes collected in the {viridis} resp. {viridisLite} packages. Here I have used the option “G” (“mako”)

R Code 3.100 : {viridis} color palette “turbo” (option ‘H’) in color and desaturated

Code
pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
         color = list(viridis::turbo(256),
              dichromat::dichromat(viridis::turbo(256), type = "deutan"),
              dichromat::dichromat(viridis::turbo(256), type = "protan"),
              dichromat::dichromat(viridis::turbo(256), type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on the viridis 'turbo' palette")
Graph 3.70: Effects of color blindness of {viridis} palette ‘turbo’ (option ‘H’)

R Code 3.101 : Color blindness test of my chosen colors for the graph of gun-use by sex

Code
my_colors = scales::viridis_pal(
                                alpha = 1, 
                                begin = .15, 
                                 end = .35, 
                                direction = -1, 
                                option = "turbo")(2)

pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
          color = list(my_colors,
          dichromat::dichromat(my_colors, type = "deutan"),
          dichromat::dichromat(my_colors, type = "protan"),
          dichromat::dichromat(my_colors, type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects on my chosen colors for gun use by sex")
Graph 3.71: Color blindness (CVD) replication of gun-use by sex

R Code 3.102 : Color blindness test of my chosen colors for the graph rounds fired by sex

Code
my_colors = scales::viridis_pal(
                                alpha = 1, 
                                begin = .25, 
                                 end = .75, 
                                direction = -1, 
                                option = "cividis")(5)

pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
          color = list(my_colors,
          dichromat::dichromat(my_colors, type = "deutan"),
          dichromat::dichromat(my_colors, type = "protan"),
          dichromat::dichromat(my_colors, type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects for my graph: rounds fired by sex")
Graph 3.72: Color blindness (CVD) test of my graph: rounds fired by sex

R Code 3.103 : Color blindness test of my chosen colors for the waffle graph: rounds fired

Code
my_colors <- c("lightblue1", "lightsteelblue1", 
                          "deepskyblue1", "dodgerblue3", "black",
               "gold1", "lemonchiffon1")


pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
          color = list(my_colors,
          dichromat::dichromat(my_colors, type = "deutan"),
          dichromat::dichromat(my_colors, type = "protan"),
          dichromat::dichromat(my_colors, type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects for my waffle graph: rounds fired 2018")
Graph 3.73: Color blindness (CVD) test of my waffle graph: rounds fired 2018

R Code 3.104 : Color blindness test of my chosen colors for the waffle graph: rounds fired 2018 with cividis color scale

Code
my_colors <- scales::viridis_pal(
                                alpha = 1,
                                begin = .25,
                                 end = .75,
                                direction = -1,
                                option = "cividis")(7)


pal_data <- list(names = c("Normal", "deuteranopia", "protanopia", "tritanopia"),
          color = list(my_colors,
          dichromat::dichromat(my_colors, type = "deutan"),
          dichromat::dichromat(my_colors, type = "protan"),
          dichromat::dichromat(my_colors, type = "tritan")))

list_plotter(pal_data$color, pal_data$names, 
             "Color blindness effects for my waffle graph:\nrounds fired with cividis color scale")
Graph 3.74: Color blindness (CVD) test of my waffle graph: rounds fired with cividis color scale

With the exception of the turbo palette form the {viridis} package show all palettes a color blindness friendly performance. Again this experiments shows that one should not rely on the standard {ggplot2} palette if one would design for CVD.

But be aware that it worked for my test cases with only 2, 5 and 7 colors. I think one should always test if the used colors are also readable with black and white printing and for people with CVD.

3.9.1.3.2 Test with {colorblindr}

Another helpful package to produce colorblind friendly plots is {colorblindr}. This package is only available at GitHub. It depends on the development versions of {cowplot} and {colorspace}.

It simulates the plot colors for all kinds of CVD with three different methods:

  • colorblindr::cvd_grid(<gg-object>): Displays the various color-vision-deficiency simulations as a plot in R.
  • colorblindr::view_cvd(<gg-object>): Inspect the graph in the interactive app. Use this function only interactively (not programmatically!)
  • Go to the interactive simulator and upload the image you want to test.

You get 5 different plots to compare:

  • Original
  • Desaturated
  • Deuteranope
  • Protanope
  • Tritanope

R Code 3.105 : Demo of the usage of {colorblindr}

Code
colorblindr::cvd_grid(waffle_plot)
Graph 3.75: Demonstration of the colorblindr::cvd_grid() function

I have used the plot for the proportion of total rounds fired (NHANES survey 2017-2018). Compare the simulation with the original in Graph 3.16 (a),

3.9.1.3.3 Test with {colorblindcheck}

Both methods to check for colorblind friendly plots discussed so far (Section A.17 and Section A.6) have the disadvantage that one has to determine the colors by the subjective impression of the personal visual inspection. {colorblindcheck} provides in addition to the visual inspection some objective data: It calculates the distances between the colors in the input palette and between the colors in simulations of the color vision deficiencies.

In the following experiment I am going to investigate my hand-picked colors for the waffle graph in Graph 3.16 (a). I picked these colors because the appeared to me distant enough to make an understandable and pleasant graph.

Experiment 3.7 : Experiments with the {colorblindcheck} package

R Code 3.106 : Convert colors by name into hexadecimal code

Code
waffle_colors <-  gplots::col2hex(c("lightblue1", "lightsteelblue1", 
                  "deepskyblue1", "dodgerblue3",  
                  "black", "gold1", "lemonchiffon1"))


waffle_colors
#> [1] "#BFEFFF" "#CAE1FF" "#00BFFF" "#1874CD" "#000000" "#FFD700" "#FFFACD"

I had chosen my colors by name. So the first task to apply the function of the {colorblindcheck} package is to translate them into hex code. For this conversion I have used the col2hex() function of the {gplots} package (see Section A.30)

R Code 3.107 : List different parameters for color distance values to check for colorblind friendliness

Code
colorblindcheck::palette_check(waffle_colors)
Table 3.4: List different parameters for color distince values to check for colorblind friendliness
#>           name n tolerance ncp ndcp  min_dist mean_dist max_dist
#> 1       normal 7  10.54291  21   21 10.542912  47.97701 98.42826
#> 2 deuteranopia 7  10.54291  21   20  2.795989  47.79041 98.41086
#> 3   protanopia 7  10.54291  21   20  4.073523  46.14816 97.30477
#> 4   tritanopia 7  10.54291  21   20  6.730892  41.78801 96.26958

The colorblindcheck::palette_check() function returns a data.frame with 4 observations and 8 variables:

  • name: original input color palette (normal), deuteranopia, protanopia, and tritanopia
  • n: number of colors
  • tolerance: minimal value of the acceptable difference between the colors to distinguish between them. Here I have used the the default, e.g., the minimal distance between colors in the original input palette. But I think there should be another “normed” values as yard stick to check the color friendliness (maybe about 10-12, depending of the number of colors?)
  • ncp: number of color pairs
  • ndcp: number of differentiable color pairs (color pairs with distances above the tolerance value)
  • min_dist: minimal distance between colors
  • mean_dist: average distance between colors
  • max_dist: maximal distance between colors

The table shows in the normal color view that the minimal distance between colors is about 10.5. But in all three CVDs palettes only 20 from the 21 pairs are above this distance. So there is one color pair where people with CVD can’t distinguish.

Let’s go into the details. At first we will inspect the color palettes visually and then we will investigate the distances between all pairs for all palettes of CVDs.

R Code 3.108 : Plot waffle colors palette normal and for different CVDs

Code
colorblindcheck::palette_plot(waffle_colors, severity = 1)
Graph 3.76: Waffle colors palette normal and for different CVDs with highest severities

From the visual inspections it seems that there is a problem with the color pair 1 / 2.

R Code 3.109 : Color pair distances for the normal colors

Code
colorblindcheck::palette_dist(waffle_colors)
Table 3.5: Color pair distances for the normal color palette without CVD
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
#> [1,]   NA 10.54291 18.10731 37.10142 89.17120 41.73917 26.00648
#> [2,]   NA       NA 19.28529 33.38842 84.16342 48.19716 31.10567
#> [3,]   NA       NA       NA 25.22187 64.86977 60.28084 44.44362
#> [4,]   NA       NA       NA       NA 42.63815 67.09988 59.51180
#> [5,]   NA       NA       NA       NA       NA 85.64217 98.42826
#> [6,]   NA       NA       NA       NA       NA       NA 20.57238
#> [7,]   NA       NA       NA       NA       NA       NA       NA

The smallest distance in the normal color palette is about 10.5 between 1 and 2. But this is a very good distance, all the other colors perform even better with higher values of 18.

R Code 3.110 : Color pair distances for the deuteranopia CVD

Code
colorblindcheck::palette_dist(waffle_colors, cvd = "deu")
Table 3.6: Color pair distances for the deuteranopia CVD
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
#> [1,]   NA 2.795989 19.67791 38.31696 87.00229 42.72322 28.91615
#> [2,]   NA       NA 17.19632 36.15745 82.97331 45.39762 31.72386
#> [3,]   NA       NA       NA 21.43223 61.57162 60.53591 47.99180
#> [4,]   NA       NA       NA       NA 41.37258 70.95919 60.97332
#> [5,]   NA       NA       NA       NA       NA 87.88964 98.41086
#> [6,]   NA       NA       NA       NA       NA       NA 19.58039
#> [7,]   NA       NA       NA       NA       NA       NA       NA

Here we can see clearly that the distance between 1 and 2 is only 2.8!. All the other colors perform excellent with higher values of 17.

R Code 3.111 : Color pair distances for the protanopia CVD

Code
colorblindcheck::palette_dist(waffle_colors, cvd = "pro")
Table 3.7: Color pair distances for the protanopia CVD
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
#> [1,]   NA 4.073523 16.80032 35.11481 91.09592 39.70098 26.01640
#> [2,]   NA       NA 12.95911 31.90976 85.83276 43.51167 30.38048
#> [3,]   NA       NA       NA 20.91352 68.81192 54.71840 43.20273
#> [4,]   NA       NA       NA       NA 44.94125 65.17437 54.56921
#> [5,]   NA       NA       NA       NA       NA 82.19301 97.30477
#> [6,]   NA       NA       NA       NA       NA       NA 19.88647
#> [7,]   NA       NA       NA       NA       NA       NA       NA

Again: The problem is the color pair 1 and 2 with a slightly better difference of 4 in comparison with the deuteranopia palette of 2.8.

R Code 3.112 : Color pair distances for the tritanopia CVD

Code
colorblindcheck::palette_dist(waffle_colors, cvd = "tri")
Table 3.8: Color pair distances for the tritanopia CVD
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
#> [1,]   NA 6.730892 13.01156 30.72849 90.39176 38.96913 24.90487
#> [2,]   NA       NA 15.41221 29.64808 84.57220 34.48380 20.06730
#> [3,]   NA       NA       NA 21.02786 69.32457 43.19480 31.56913
#> [4,]   NA       NA       NA       NA 43.74885 46.44875 42.21174
#> [5,]   NA       NA       NA       NA       NA 78.32859 96.26958
#> [6,]   NA       NA       NA       NA       NA       NA 16.50397
#> [7,]   NA       NA       NA       NA       NA       NA       NA

Here again the problem lies smallest distance of the first two color. But 6.7 is not so bad for distinctions. Therefore it is always necessary to see the actual value and not only to stick with the first table in Table 3.4.

It is interesting to notice that my thought from the visual inspection that the pair 1 /3 is also problematic is unsubstantiated.

R Code 3.113 : Color distances with the CVD friendly {cividis} palette

Code
waffle_colors2 = scales::viridis_pal(alpha = 1,
                                    direction = -1,
                                    option = "cividis")(7)

colorblindcheck::palette_check(waffle_colors2, plot = TRUE)

Color distances with the CVD friendly {cividis} palette

Color distances with the CVD friendly {cividis} palette
#>           name n tolerance ncp ndcp min_dist mean_dist max_dist
#> 1       normal 7  9.474139  21   21 9.474139  40.15113 94.45609
#> 2 deuteranopia 7  9.474139  21   20 9.187656  40.67049 95.77164
#> 3   protanopia 7  9.474139  21   20 9.309515  39.02739 90.60717
#> 4   tritanopia 7  9.474139  21   20 8.932613  34.28656 78.10328

From the table we can see that all three CVDs palette have a smaller tolerance as the normal color palette. But we also see that the differences are minimal. The worst case (tritanopia) is only .05 belwo and has still a distance of almost 9.

Use colorblind friendly palettes

I have tried very hard to choose colorblind friendly manually. But I didn’t succeed.

Lessons learned: Do not rely on your subjective visual impressions but use the professionally designed palettes that improve readability for all kinds of color vision deficiencies.


3.9.2 Adding palettes

I have learned to add palettes with different approaches. The following experiments list all the options I became aware in the last few days.

Experiment 3.8 : How to add palettes to {ggplot2} graphics

R Code 3.114 : Creating the gg object for adding the palette

Code
library(ggmosaic)
library(ggplot2)

## restore ggplot2 colors #############
options(
  ggplot2.discrete.color = NULL,
  ggplot2.discrete.fill = NULL
)

nhanes_2012_clean <- base::readRDS("data/chap03/nhanes_2012_clean.rds")

(
    gg <- nhanes_2012_clean |> 
        dplyr::mutate(rounds_fired =
              dplyr::na_if(rounds_fired, "Don't know")) |> 
        tidyr::drop_na(c(rounds_fired, sex)) |> 
        dplyr::mutate(rounds_fired =
              forcats::fct_drop(rounds_fired)) |> 
        ggplot2::ggplot() + 
        ggmosaic::geom_mosaic(ggplot2::aes(
            x = ggmosaic::product(rounds_fired, sex), 
            fill = rounds_fired)
            ) +
        ggplot2::theme_classic() +
        ggplot2::labs(x = "Sex",
                      y = "Total number of rounds fired",
                      fill = "Rounds fired") +
        ggplot2::guides(fill = ggplot2::guide_legend(reverse = TRUE))
)
Graph 3.77: This is the graph with the default {ggplot2} palette.

R Code 3.115 : OPTION 1: Viridis color palette from {viridisLite} using {ggplot2}

Code
gg +
    ggplot2::scale_fill_viridis_d(
        alpha = 1,  # alpha does not work!
        begin = .25,
         end = .75,
        direction = -1,
        option = "cividis"
    )
Graph 3.78: Viridis color palette from {viridisLite} using {ggplot2}

R Code 3.116 : OPTION 2: Create your own discrete scale using {ggplot2} and {scales}

Code
gg +
    ggplot2::scale_fill_manual(values = scales::viridis_pal(
                                alpha = 1,
                                begin = .25,
                                 end = .75,
                                direction = -1,
                                option = "cividis")(5)
                               )
Graph 3.79: Create your own discrete scale using {ggplot2} and {scales}

R Code 3.117 : OPTION 3: Viridis Color Palette for ggplot2 using {viridis}

Code
gg +
    viridis::scale_fill_viridis(
        begin = .25,
        end = .75,
        discrete = TRUE,
        option = "cividis",
        direction = -1
        )
Graph 3.80: Viridis Color Palette for ggplot2 using {viridis}

R Code 3.118 : OPTION 4: Viridis manual scale with {paletteer}

Code
library(paletteer)

gg +
    ggplot2::scale_fill_manual(values =
           paletteer::paletteer_c("viridis::cividis",
              n = 5,
              direction = -1)
    )
Graph 3.81: Viridis manual scale with {paletteer}

This option with {paletteer} lacks some arguments (alpha, begin and end).

R Code 3.119 : OPTION 5: Wrapper around continuous function (no direction!)

Code
library(paletteer)

gg +
    ggplot2::scale_fill_manual(values =
           base::rev(paletteer:::paletteer_c_viridis(
               name = "cividis",
               n = 5))
    )
Graph 3.82: Viridis Color Palette for ggplot2 using {viridis}

This option uses an abbreviated function of the {paletteer} package. But again it misses the arguments alpha, begin, end and direction. It is possible to use base:rev() to reverse the direction but for the other arguments I have not found an alternative.

There are different option to add a palette. The option 4 and 5 with the {paletteer} lacks some options. I have not much experience with palettes and color scales. There I am not sure if the missing options are a result of my missing knowledge or if it is a design feature or bug.

WATCH OUT! Changing the alpha argument does not work

All the options to add the {viridis} palette does not change the alpha argument. I believe that this is an error on the {ggmosiac} package.

Update: Yes, In the meanwhile I have confirmed my assumption! R Code 5.13 shows an example where the alpha argument is working!


3.9.3 Online resources

Resource 3.6 : Online resources of {ggplot2} code examples

R Graphs Galleries

Videos

  • Plotting anything with {ggplot2}: Video 1 & Video 2: Two 2 hour (!) videos: Part I focuses on teaching the underlying theory of {ggplot2} and how it is reflected in the API. Part II focuses on the extended {ggplot2} universe with practical examples from many extension packages. Further, at the end is a short section on how to approach new graphics (Pedersen 2020b, 2020a).

Books online

  • ggplot2: Elegant Graphics for Data Analysis (3e): This book gives some details on the basics of {ggplot2}, but its primary focus is explaining the Grammar of Graphics that {ggplot2} uses, and describing the full details. It will help you understand the details of the underlying theory, giving you the power to tailor any plot specifically to your needs (Wickham, Navarro, and Pedersen 2024).
  • R Graphics Cookbook, 2nd edition: a practical guide that provides more than 150 recipes to help you generate high-quality graphs quickly, without having to comb through all the details of R’s graphing systems. Each recipe tackles a specific problem with a solution you can apply to your own project, and includes a discussion of how and why the recipe works (Chang 2024).

3.10 Glossary

term definition
APIx An API, or application programming interface, is a set of defined rules that enable different applications to communicate with each other. It acts as an intermediary layer that processes data transfers between systems, letting companies open their application data and functionality to external third-party developers, business partners, and internal departments within their companies. (<a href="https://www.ibm.com/topics/api">IBM</a>)
ATF Agency Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF): ATF’s responsibilities include the investigation and prevention of federal offenses involving the unlawful use, manufacture, and possession of firearms and explosives; acts of arson and bombings; and illegal trafficking of alcohol and tobacco products. The ATF also regulates, via licensing, the sale, possession, and transportation of firearms, ammunition, and explosives in interstate commerce. (<a href="https://www.atf.gov/about/what-we-do">ATF</a>)
Bar Charts Bar charts are visual displays of data often used to examine similarities and differences across categories of things; bars can represent frequencies, percentages, means, or other statistics. (SwR, Glossary)
Boxplots Boxplots are a visual representation of data that shows central tendency (usually the median) and spread (usually the interquartile range) of a numeric variable for one or more groups; boxplots are often used to compare the distribution of a continuous variable across several groups. (SwR, Glossary)
Color blindness Color blindness (also spelled colour blindness) or color vision deficiency (CVD) or includes a wide range of causes and conditions and is actually quite complex. It's a condition characterized by an inability or difficulty in perceiving and differentiating certain colors due to abnormalities in the three color-sensing pigments of the cones in the retina. (<a href="https://enchroma.com/pages/types-of-color-blindness">EnChroma</a>)
CVD Color vision deficiency (CVD) or color blindness (also spelled colour blindness) includes a wide range of causes and conditions and is actually quite complex. It's a condition characterized by an inability or difficulty in perceiving and differentiating certain colors due to abnormalities in the three color-sensing pigments of the cones in the retina. (<a href="https://enchroma.com/pages/types-of-color-blindness">EnChroma</a>)
Density Plots Density plots are used for examining the distribution of a variable measured along a continuum; density plots are similar to histograms but are smoothed and may not show existing gaps in data (SwR, Glossary)
Donut Charts Donut or doughnut charts (sometimes also called ring charts) are an alternative chart for pie charts, which have a hole in the middle, making them cleaner to read than pie charts. (<a href="https://r-charts.com/part-whole/donut-chart/">R Charts</a>)
Grouped Bar Chart A grouped bar chart is a data visualization that shows two categorical variables in a bar chart where one group is shown along the x-axis for vertical bars or y-axis for horizontal bars and the other grouping is shown as separate bars within each of the first grouping variable categories; the bars are often different colors to distinguish the groups. (SwR, Glossary)
Histograms Histograms are visual displays of data used to examine the distribution of a numeric variable. (SwR, Glossary)
Kernel Density Estimation Kernel density estimation (KDE) extrapolates data to an estimated population probability density function. It’s called kernel density estimation because each data point is replaced with a kernel—a weighting function to estimate the pdf. The function spreads the influence of any point around a narrow region surrounding the point. (<a href="https://www.statisticshowto.com/kernel-density-estimation/">Statistics How To</a>)
Line Graph A line graph is a visual display of data often used to examine the relationship between two continuous variables or for something measured over time. (SwR, Glossary)
Mosaic Plots Mosaic plots are visual representations of data to show the relationship between two categorical variables; useful primarily when both variables have few categories. (SwR, Glossary)
NHANES The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. (<a href="https://www.cdc.gov/nchs/nhanes/about_nhanes.htm">NHANES</a>)
Pie Charts Pie charts are used to show parts of a whole; pie charts get their name from looking like a pie with pieces representing different groups, and they are not recommended for most situations because they can be difficult to interpret
Point Charts Point charts are charts that show summary values for a numeric variable, typically across groups; for example, a point chart could be used in place of a bar graph to show mean or median across groups. (SwR, Glossary)
Probability Density Function A probability density function (PDF) tells us the probability that a random variable takes on a certain value. (<a href="https://www.statology.org/cdf-vs-pdf/">Statology</a>) The probability density function (PDF) for a given value of random variable X represents the density of probability (probability per unit random variable) within a particular range of that random variable X. Probability densities can take values larger than 1. ([StackExchange Mathematics](https://math.stackexchange.com/a/1464837/1215136)) We can use a continuous probability distribution to calculate the probability that a random variable lies within an interval of possible values. To do this, we use the continuous analogue of a sum, an integral. However, we recognise that calculating an integral is equivalent to calculating the area under a probability density curve. We use `p(value)` for probability densities and `P
Scatterplot A scatterplot is a graph that shows one dot for each observation in the data set (SwR, Glossary)
Stacked Bar Chart A stacked bar chart is a data visualization that shows parts of a whole in a bar chart format; this type of chart can be used to examine two categorical variables together by showing the categories of one variable as the bars and the categories of the other variable as different colors within each bar. (SwR, Glossary)
Statista Statista is a global data and business intelligence platform with an extensive collection of statistics, reports, and insights on over 80,000 topics from 22,500 sources in 170 industries. Established in Germany in 2007, Statista operates in 13 locations worldwide and employs around 1,100 professionals. (<a href="https://www.statista.com/aboutus/">statista</a>)
Violin Plots Violing plots are visual displays of data that combine features of density plots and boxplots to show the distribution of numeric variables, often across groups. (SwR, Glossary)
Waffle Charts Waffle Charts are visual displays of data that show the parts of a whole similar to a pie chart; waffle charts are generally preferred over pie charts. (SwR, Glossary)

Session Info

Session Info

Code
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Sequoia 15.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2024-11-13
#>  pandoc   3.5 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package         * version    date (UTC) lib source
#>  base64enc         0.1-3      2015-07-28 [2] CRAN (R 4.4.0)
#>  bitops            1.0-9      2024-10-03 [2] CRAN (R 4.4.1)
#>  caTools           1.18.3     2024-09-04 [2] CRAN (R 4.4.1)
#>  cli               3.6.3      2024-06-21 [2] CRAN (R 4.4.1)
#>  colorblindcheck   1.0.2      2023-05-13 [2] CRAN (R 4.4.0)
#>  colorblindr       0.1.0      2024-04-25 [2] Github (clauswilke/colorblindr@90d64f8)
#>  colorspace        2.1-1      2024-07-26 [2] CRAN (R 4.4.0)
#>  commonmark        1.9.2      2024-10-04 [2] CRAN (R 4.4.1)
#>  cowplot           1.1.3.9000 2024-04-25 [2] Github (wilkelab/cowplot@e1334a2)
#>  curl              6.0.0      2024-11-05 [2] CRAN (R 4.4.1)
#>  data.table        1.16.2     2024-10-10 [2] CRAN (R 4.4.1)
#>  dichromat         2.0-0.1    2022-05-02 [2] CRAN (R 4.4.0)
#>  digest            0.6.37     2024-08-19 [2] CRAN (R 4.4.1)
#>  dplyr             1.1.4      2023-11-17 [2] CRAN (R 4.4.0)
#>  DT                0.33       2024-04-04 [2] CRAN (R 4.4.0)
#>  evaluate          1.0.1      2024-10-10 [2] CRAN (R 4.4.1)
#>  extrafont         0.19       2023-01-18 [2] CRAN (R 4.4.0)
#>  extrafontdb       1.0        2012-06-11 [2] CRAN (R 4.4.0)
#>  fansi             1.0.6      2023-12-08 [2] CRAN (R 4.4.0)
#>  farver            2.1.2      2024-05-13 [2] CRAN (R 4.4.0)
#>  fastmap           1.2.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  forcats           1.0.0      2023-01-29 [2] CRAN (R 4.4.0)
#>  generics          0.1.3      2022-07-05 [2] CRAN (R 4.4.0)
#>  ggmosaic        * 0.3.4      2024-04-25 [2] Github (haleyjeppson/ggmosaic@b8045be)
#>  ggokabeito        0.1.0      2021-10-18 [2] CRAN (R 4.4.0)
#>  ggplot2         * 3.5.1      2024-04-23 [2] CRAN (R 4.4.0)
#>  ggrepel           0.9.6      2024-09-07 [2] CRAN (R 4.4.1)
#>  glossary        * 1.0.0.9003 2024-08-05 [2] Github (debruine/glossary@05e4a61)
#>  glue              1.8.0      2024-09-30 [2] CRAN (R 4.4.1)
#>  gplots            3.2.0      2024-10-05 [2] CRAN (R 4.4.1)
#>  gridExtra         2.3        2017-09-09 [2] CRAN (R 4.4.0)
#>  gtable            0.3.6      2024-10-25 [2] CRAN (R 4.4.1)
#>  gtools            3.9.5      2023-11-20 [2] CRAN (R 4.4.0)
#>  here              1.0.1      2020-12-13 [2] CRAN (R 4.4.0)
#>  htmltools         0.5.8.1    2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets       1.6.4      2023-12-06 [2] CRAN (R 4.4.0)
#>  httr              1.4.7      2023-08-15 [2] CRAN (R 4.4.0)
#>  janitor           2.2.0      2023-02-02 [2] CRAN (R 4.4.0)
#>  jsonlite          1.8.9      2024-09-20 [2] CRAN (R 4.4.1)
#>  kableExtra        1.4.0      2024-01-24 [2] CRAN (R 4.4.0)
#>  KernSmooth        2.23-24    2024-05-17 [2] CRAN (R 4.4.2)
#>  knitr             1.49       2024-11-08 [2] CRAN (R 4.4.1)
#>  labeling          0.4.3      2023-08-29 [2] CRAN (R 4.4.0)
#>  lattice           0.22-6     2024-03-20 [2] CRAN (R 4.4.2)
#>  lazyeval          0.2.2      2019-03-15 [2] CRAN (R 4.4.0)
#>  lifecycle         1.0.4      2023-11-07 [2] CRAN (R 4.4.0)
#>  lmtest            0.9-40     2022-03-21 [2] CRAN (R 4.4.0)
#>  lubridate         1.9.3      2023-09-27 [2] CRAN (R 4.4.0)
#>  magrittr          2.0.3      2022-03-30 [2] CRAN (R 4.4.0)
#>  markdown          1.13       2024-06-04 [2] CRAN (R 4.4.0)
#>  MASS              7.3-61     2024-06-13 [2] CRAN (R 4.4.2)
#>  Matrix            1.7-1      2024-10-18 [2] CRAN (R 4.4.2)
#>  mgcv              1.9-1      2023-12-21 [2] CRAN (R 4.4.2)
#>  microbenchmark    1.5.0      2024-09-04 [2] CRAN (R 4.4.1)
#>  munsell           0.5.1      2024-04-01 [2] CRAN (R 4.4.0)
#>  nlme              3.1-166    2024-08-14 [2] CRAN (R 4.4.2)
#>  paletteer       * 1.6.0      2024-01-21 [2] CRAN (R 4.4.0)
#>  patchwork         1.3.0      2024-09-16 [2] CRAN (R 4.4.1)
#>  pillar            1.9.0      2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgconfig         2.0.3      2019-09-22 [2] CRAN (R 4.4.0)
#>  plotly            4.10.4     2024-01-13 [2] CRAN (R 4.4.0)
#>  plyr              1.8.9      2023-10-02 [2] CRAN (R 4.4.0)
#>  prismatic         1.1.2      2024-04-10 [2] CRAN (R 4.4.0)
#>  productplots      0.1.1      2016-07-02 [2] CRAN (R 4.4.0)
#>  purrr             1.0.2      2023-08-10 [2] CRAN (R 4.4.0)
#>  R6                2.5.1      2021-08-19 [2] CRAN (R 4.4.0)
#>  RColorBrewer      1.1-3      2022-04-03 [2] CRAN (R 4.4.0)
#>  Rcpp              1.0.13-1   2024-11-02 [1] CRAN (R 4.4.1)
#>  rematch2          2.1.2      2020-05-01 [2] CRAN (R 4.4.0)
#>  repr              1.1.7      2024-03-22 [2] CRAN (R 4.4.0)
#>  rlang             1.1.4      2024-06-04 [2] CRAN (R 4.4.0)
#>  rmarkdown         2.29       2024-11-04 [2] CRAN (R 4.4.1)
#>  rprojroot         2.0.4      2023-11-05 [2] CRAN (R 4.4.0)
#>  rstudioapi        0.17.1     2024-10-22 [2] CRAN (R 4.4.1)
#>  Rttf2pt1          1.3.12     2023-01-22 [2] CRAN (R 4.4.0)
#>  rversions         2.1.2      2022-08-31 [2] CRAN (R 4.4.0)
#>  scales            1.3.0      2023-11-28 [2] CRAN (R 4.4.0)
#>  sessioninfo       1.2.2      2021-12-06 [2] CRAN (R 4.4.0)
#>  skimr             2.1.5      2022-12-23 [2] CRAN (R 4.4.0)
#>  snakecase         0.11.1     2023-08-27 [2] CRAN (R 4.4.0)
#>  spacesXYZ         1.3-0      2024-01-23 [2] CRAN (R 4.4.0)
#>  stringi           1.8.4      2024-05-06 [2] CRAN (R 4.4.0)
#>  stringr           1.5.1      2023-11-14 [2] CRAN (R 4.4.0)
#>  svglite           2.1.3      2023-12-08 [2] CRAN (R 4.4.0)
#>  systemfonts       1.1.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  tibble            3.2.1      2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyr             1.3.1      2024-01-24 [2] CRAN (R 4.4.0)
#>  tidyselect        1.2.1      2024-03-11 [2] CRAN (R 4.4.0)
#>  timechange        0.3.0      2024-01-18 [2] CRAN (R 4.4.0)
#>  utf8              1.2.4      2023-10-22 [2] CRAN (R 4.4.0)
#>  vcd               1.4-13     2024-09-16 [2] CRAN (R 4.4.1)
#>  vctrs             0.6.5      2023-12-01 [2] CRAN (R 4.4.0)
#>  viridis           0.6.5      2024-01-29 [2] CRAN (R 4.4.0)
#>  viridisLite       0.4.2      2023-05-02 [2] CRAN (R 4.4.0)
#>  waffle            1.0.2      2023-09-30 [2] CRAN (R 4.4.0)
#>  withr             3.0.2      2024-10-28 [2] CRAN (R 4.4.1)
#>  xfun              0.49       2024-10-31 [2] CRAN (R 4.4.1)
#>  xml2              1.3.6      2023-12-04 [2] CRAN (R 4.4.0)
#>  yaml              2.3.10     2024-07-26 [2] CRAN (R 4.4.0)
#>  zoo               1.8-12     2023-04-13 [2] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────