Appendix A: Packages used
A.1 autoplotly
Package Profile: {autoplotly}
Functionalities to automatically generate interactive visualizations for statistical results supported by {ggfortify}, such as time series, PCA, clustering and survival analysis, with plotly.js and {ggplot2} style. The generated visualizations can also be easily extended using {ggplot2} and {plotly} syntax while staying interactive.
A.2 BSDA
Package Profile: BSDA
Data sets for the book “Basic Statistics and Data Analysis” by Larry J. Kitchens. (BSDA) (Kitchens 2002).
A.3 broom
Package Profile: broom
{broom} provides three verbs to make it convenient to interact with model objects:
-
tidy()
summarizes information about model components -
glance()
reports information about the entire model -
augment()
adds information about observations to a dataset
For a detailed introduction, please see Introduction to broom.
{broom} tidies 100+ models from popular modelling packages and almost all of the model objects in the stats package that comes with base R.
The vignette Available methods lists the available methods.
A.4 car
Package Profile: car
Functions to Accompany J. Fox and S. Weisberg, An R Companion to Applied Regression, Third Edition, Sage, 2019. (Fox and Weisberg 2018)
An R Companion to Applied Regression is a broad introduction to the R statistical computing environment in the context of applied regression analysis. The book provides a step-by-step guide to using the free statistical software R, and emphasizes integrating statistical computing in R with the practice of data analysis. The R packages car and effects, written to facilitate the application and interpretation of regression analysis, are extensively covered in the book.
A.5 colorblindcheck
Package Profile: colorblindcheck
{There is no hexagon sticker available for {colorblindcheck}.}
Compare color palettes with simulations of color vision deficiencies - deuteranopia, protanopia, and tritanopia. It includes calculation of distances between colors, and creating summaries of differences between a color palette and simulations of color vision deficiencies.
Deciding if a color palette is a colorblind friendly is a hard task. This cannot be done in an entirely automatic fashion, as the decision needs to be confirmed by visual judgments. The goal of {colorblindcheck} is to provide tools to decide if the selected color palette is colorblind friendly, including:
-
palette_dist()
- Calculation of the distances between the colors in the input palette and between the colors in simulations of the color vision deficiencies: deuteranopia, protanopia, and tritanopia. -
palette_plot()
- Plotting of the original input palette and simulations of color vision deficiencies: deuteranopia, protanopia, and tritanopia. -
palette_check()
- Creating summary statistics comparing the original input palette and simulations of color vision deficiencies: deuteranopia, protanopia, and tritanopia.
A.6 colorblindr
Package Profile: colorblindr
(There is no hexagon sticker available for {colorblindr}.)
Provides a variety of functions that are helpful to simulate the effects of colorblindness in R figures. Complete figures can be modified to simulate the effects of various types of colorblindness. The resulting figures are standard grid objects and can be further manipulated or outputted as usual.
A.7 colorspace
Package Profile: colorspace
(There is no hexagon sticker available for {colorspace}.)
The colorspace package provides a broad toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in various kinds of visualizations.
At the core of the package there are various utilities for computing with color spaces (as the name of the package conveys). Thus, the package helps to map various three-dimensional representations of color to each other. A particularly important mapping is the one from the perceptually-based and device-independent color model HCL (Hue-Chroma-Luminance) to standard Red-Green-Blue (sRGB) which is the basis for color specifications in many systems based on the corresponding hex codes (e.g., in HTML but also in R). For completeness further standard color models are included as well in the package: polarLUV() (= HCL), LUV(), polarLAB(), LAB(), XYZ(), RGB(), sRGB(), HLS(), HSV().
The HCL space (= polar coordinates in CIELUV) is particularly useful for specifying individual colors and color palettes as its three axes match those of the human visual system very well: Hue (= type of color, dominant wavelength), chroma (= colorfulness), luminance (= brightness).
There is extensive documentation available. See also the website on HCL Color Space:
The hclwizard provides tools for manipulating and assessing colors and palettes based on the underlying colorspace software (available in R and Python). It leverages the HCL color space: a color model that is based on human color perception and thus makes it easy to choose good color palettes by varying three color properties: Hue (= type of color, dominant wavelength) - Chroma (= colorfulness) - Luminance (= brightness). As shown in the color swatches below each property can be varied while keeping the other two properties fixed.
{colorspace}: My personal evaluation
This toolbox package is very important: All of the other color palette related package uses {colorspace} as a bases for their functionality.
A.8 cowplot
Package Profile: cowplot
The {cowplot} package provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images. The package was originally written for internal use in the Wilke lab, hence the name (Claus O. Wilke’s plot package). It has also been used extensively in the book Fundamentals of Data Visualization.
There are several packages that can be used to align plots. The most widely used ones beside {cowplot} are {egg} and {patchwork} (see Section A.62). All these packages use slightly different approaches to plot alignment, and the respective approaches have different strengths and weaknesses. If you cannot achieve your desired result with one of these packages try another one.
Most importantly, while {egg} and {patchwork} align and arrange plots at the same time, {cowplot} aligns plots independently of how they are arranged. This makes it possible to align plots and then reproduce them separately, or even overlay them on top of each other.
The {cowplot} package now provides a set of complementary themes with different features. I now believe that there isn’t one single theme that works for all figures, and therefore I recommend that you always explicitly set a theme for every plot you make.
A.9 cranlogs
Package Profile: cranlogs
API to the database of CRAN package downloads from the RStudio CRAN mirror. The database itself is at http://cranlogs.r-pkg.org, see https://github.com/r-hub/cranlogs.app for the raw API.
RStudio publishes the download logs from their CRAN package mirror daily at http://cran-logs.rstudio.com.
This R package queries a web API maintained by R-hub that contains the daily download numbers for each package.
The RStudio CRAN mirror is not the only CRAN mirror, but it’s a popular one: it’s the default choice for RStudio users. The actual number of downloads over all CRAN mirrors is unknown.
A.10 crosstable
Package Profile: crosstable
Create descriptive tables for continuous and categorical variables. Apply summary statistics and counting function, with or without a grouping variable, and create beautiful reports using {rmarkdown} or {officer}. You can also compute effect sizes and statistical tests if needed.
{crosstable}: Personal evaluation
I believe that the main usage for this package is to prepare ready-to-print tables. Similar like {gtsummary} (see Section A.37) it provides some descriptive statistics with many display options. But I got the impression that analysis of data is not the main usage of these packages.
For instance you could use crosstable::display_test(chisq.test(x, y))
to get as result a string, for instance: “p value: <0.0001 (Pearson’s Chi-squared test)”. This is nice to include for a table, but for the analysis one would also need the values of the different cells.
A.11 curl
Package Profile: curl
(There is no hexagon sticker available for {curl}.)
The curl()
and curl_download(
) functions provide highly configurable drop-in replacements for base url()
and download.file()
with better performance, support for encryption (https, ftps), gzip compression, authentication, and other ‘libcurl’ goodies.
The core of the package implements a framework for performing fully customized requests where data can be processed either in memory, on disk, or streaming via the callback or connection interfaces. Some knowledge of ‘libcurl’ is recommended; for a more-user-friendly web client see the ‘httr’ package which builds on this package with http specific tools and logic.
A.12 data.table
Package Profile: data.table
data.frame
(T. Barrett et al. 2024)
{data.table} provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.
Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
Features
- fast and friendly delimited file reader:
data.table::fread()
, see also convenience features for small data - fast and feature rich delimited file writer:
data.table::fwrite()
- low-level parallelism: many common operations are internally parallelized to use multiple CPU threads
- fast and scalable aggregations; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
- fast and feature rich joins: ordered joins (e.g. rolling forwards, backwards, nearest and limited staleness), overlapping range joins (similar to IRanges::findOverlaps), non-equi joins (i.e. joins using operators >, >=, <, <=), aggregate on join (by=.EACHI), update on join
- fast add/update/delete columns by reference by group using no copies at all
- fast and feature rich reshaping data:
data.table::dcast()
(pivot/wider/spread) anddata.table::melt()
(unpivot/longer/gather) - any R function from any R package can be used in queries not just the subset of functions made available by a database backend, also columns of type list are supported
- has no dependencies at all other than base R itself, for simpler production/maintenance
- the R dependency is as old as possible for as long as possible, dated April 2014, and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0
{data.table}: Personal evaluation
I believe the most important application of {data.table} is working with huge amount of data (several GB). In the book SwR it is used in this first chapter with the data.table::fread()
function. I have used there the readr::read_csv()
as part of the {tidyverse} collection, because the dataset is very small (29 kB).
With {DT} there is a similar package that seems important. It is a wrapper of the JavaScript library ‘DataTables’ (See Section A.19). I was using already {DT} to display interactive tables on websites, but at time I didn’t understand completely the difference between {data.table} and {DT}. As far as I understand it now the differences are:
- {data.table}: A package for efficient data manipulation and analysis, focusing on speed, memory efficiency, and flexibility. It provides a powerful data structure for handling large datasets.
- {DT} (datatable): A package for rendering R data frames as interactive HTML tables, focusing on visualization and user interaction. It provides a simple way to create web-based tables with filtering, sorting, and editing capabilities.
A.13 datawizard
Package Profile: datawizard
{datawizard} covers two aspects of data preparation:
-
Data manipulation: datawizard offers a very similar set of functions to that of the tidyverse packages, such as a {dplyr} and {tidyr}, to select, filter and reshape data, with a few key differences.
- All data manipulation functions start with the prefix
data_*
(which makes them easy to identify). - Although most functions can be used exactly as their tidyverse equivalents, they are also string-friendly (which makes them easy to program with and use inside functions).
- Finally, datawizard is super lightweight (no dependencies, similar to {poorman}), which makes it awesome for developers to use in their packages.
- All data manipulation functions start with the prefix
- Statistical transformations: {datawizard} also has powerful functions to easily apply common data transformations, including standardization, normalization, rescaling, rank-transformation, scale reversing, recoding, binning, etc.
A.14 descr
Package Profile: descr
(There is no hexagon sticker available for {descr}.)
Weighted frequency and contingency tables of categorical variables and of the comparison of the mean value of a numerical variable by the levels of a factor, and methods to produce xtable objects of the tables and to plot them. There are also functions to facilitate the character encoding conversion of objects, to quickly convert fixed width files into csv ones, and to export a data.frame to a text file with the necessary R and SPSS codes to reread the data. (Enzmann et al. 2023)
A.15 DescTools
Package Profile: DescTool
(There is no hexagon sticker available for {DescTools}.)
A collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The author’s intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results.
The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The ‘BigCamelCase’ style was consequently applied to functions borrowed from contributed R packages as well.
A.16 dfidx
Package Profile: dfidx
(There is no hexagon sticker available for {dfidx}.)
Provides extended data frames, with a special data frame column which contains two indexes, with potentially a nesting structure.
A.17 dichromat
Package Profile: dichromat
(There is no hexagon sticker available for {dichromat}.)
Collapse red-green or green-blue distinctions to simulate the effects of different types of color-blindness.
A.18 dplyr
Package Profile: dplyr
{dplyr} is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: - mutate() adds new variables that are functions of existing variables - select() picks variables based on their names. - filter() picks cases based on their values. - summarise() reduces multiple values down to a single summary. - arrange() changes the ordering of the rows.
These all combine naturally with group_by()
which allows you to perform any operation “by group”. You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”). (Wickham et al. 2023)
A.19 DT
Package Profile: DT
(There is no hexagon sticker available for {DT}.)
Data objects in R can be rendered as HTML tables using the JavaScript library DataTables (typically via {R Markdown} or {Shiny}). The ‘DataTables’ library has been included in this R package. The package name {DT} is an abbreviation of ‘DataTables’.
A.20 dunn.test
Package Profile: dunn.test
(There is no hexagon sticker available for {dunn.test}.)
Computes Dunn’s test (Dunn 1964) for stochastic dominance and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other.
{dunn.test} makes k(k-1)/2 multiple pairwise comparisons based on Dunn’s z-test-statistic approximations to the actual rank statistics. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test. Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Dunn’s test may be understood as a test for median difference. {dunn.test} accounts for tied ranks.
A.21 e1071
Package Profile: e1071
(There is no hexagon sticker available for {e1071}.)
Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbor …
A.22 effectsize
Package Profile: effectsize
Provide utilities to work with indices of effect size for a wide variety of models and hypothesis tests (see list of supported models using the function ‘insight::supported_models()’), allowing computation of and conversion between indices such as Cohen’s d, r, odds, etc.
A.23 fmsb
Package Profile: fmsb
(There is no hexagon sticker available for {fmsb}.)
Several utility functions for the book entitled “Practices of Medical and Health Data Analysis using R” (Pearson Education Japan, 2007) with Japanese demographic data and some demographic analysis related functions.
A.24 forcats
Package Profile: forcats
- reordering factor levels
- moving specified levels to front,
- ordering by first appearance,
- reversing, and
- randomly shuffling
- tools for modifying factor levels
- collapsing rare levels into other,
- ‘anonymizing’, and
- manually ‘recoding’
A.25 GGally
Package Profile: GGally
(There is no hexagon logo for {GGally} available)
The R package {ggplot2} is a plotting system based on the grammar of graphics. {GGally} extends {ggplot2} by adding several functions to reduce the complexity of combining geometric objects with transformed data. Some of these functions include
- a pairwise plot matrix,
- a two group pairwise plot matrix,
- a parallel coordinates plot,
- a survival plot,
- and several functions to plot networks.
A.26 ggfortify
Package Profile: ggfortify
(There is no hexagon sticker available for {ggfortify}.)
Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using {ggplot2}.
This package offers fortify()
and autoplot()
functions to allow automatic {ggplot2} to visualize statistical result of popular R packages. Check out our R Journal paper for more details on the overall architecture design and a gallery of visualizations created with this package. Also check out autoplotly package that could automatically generate interactive visualizations with plotly.js style based on ggfortify. The generated visualizations can also be easily extended using ggplot2 syntax while staying interactive.
A.27 ggmosaic
Package Profile: ggmosaic
Furthermore, {ggmosaic} allows various features to be customized:
- the order of the variables,
- the formula setup of the plot,
- faceting,
- the type of partition, and
- the space between the categories.
A.28 ggokabeito
Package Profile: ggokabeito
(There is no hexagon sticker available for {ggokabeito}.)
Discrete scales for the colorblind-friendly Okabe-Ito
palette, including ‘color’, ‘fill’, and ‘edge_colour’. {ggokabeito} provides {ggplot2} and {ggraph} scales to easily use the discrete, colorblind-friendly ‘Okabe-Ito’ palette in your data visualizations.
Currently, {ggokabeito} provides the following scales:
-
scale_color_okabe_ito(
)/scale_colour_okabe_ito()
scale_fill_okabe_ito()
-
scale_edge_color_okabe_ito()
/scale_edge_colour_okabe_ito()
A.29 ggplot2
Package Profile: ggplot2
{ggplot2} is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell {ggplot2} how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. (Wickham 2016)
It’s hard to succinctly describe how {ggplot2} works because it embodies a deep philosophy of visualization. However, in most cases you start with ggplot()
, supply a dataset and aesthetic mapping (with aes()
). You then add on layers (like geom_point()
or geom_histogram()
), scales (like scale_colour_brewer()
), faceting specifications (like facet_wrap()
) and coordinate systems (like coord_flip()
).
A.30 gplots
Package Profile: gplots
(There is no hexagon sticker available for {gplots}.)
Various R programming tools for plotting data, including:
- calculating and plotting locally smoothed summary function as (‘bandplot’, ‘wapply’),
- enhanced versions of standard plots (‘barplot2’, ‘boxplot2’, ‘heatmap.2’, ‘smartlegend’),
- manipulating colors (‘col2hex’, ‘colorpanel’, ‘redgreen’, ‘greenred’, ‘bluered’, ‘redblue’, ‘rich.colors’),
- calculating and plotting two-dimensional data summaries (‘ci2d’, ‘hist2d’),
- enhanced regression diagnostic plots (‘lmplot2’, ‘residplot’),
- formula-enabled interface to ‘stats::lowess’ function (‘lowess’),
- displaying textual data in plots (‘textplot’, ‘sinkplot’),
- plotting a matrix where each cell contains a dot whose size reflects the relative magnitude of the elements (‘balloonplot’),
- plotting “Venn” diagrams (‘venn’),
- displaying Open-Office style plots (‘ooplot’),
- plotting multiple data on same region, with separate axes (‘overplot’),
- plotting means and confidence intervals (‘plotCI’, ‘plotmeans’),
- spacing points in an x-y plot so they don’t overlap (‘space’).
A.31 gridExtra
Package Profile: gridExtra
(There is no hexagon sticker available for {gridExtra}.)
Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
The {grid) package (= part of the R system library) provides low-level functions to create graphical objects (grobs
), and position them on a page in specific viewports. The {gtable} package introduced a higher-level layout scheme, arguably more amenable to user-level interaction. With the gridExtra::arrangeGrob()
/ gridExtra::grid.arrange()
pair of functions, {gridExtra} builds upon {gtable} to arrange multiple grobs
on a page.
A.32 ggrepel
Package Profile: ggrepel
{ggrepel} provides two geoms for {ggplot2} to repel overlapping text labels:
A.33 ggtext
Package Profile: ggtext
(There is no hexagon sticker available for {ggtext}.)
The ggtext package provides simple Markdown and HTML rendering for {ggplot2.} Under the hood, the package uses the {gridtext} package for the actual rendering, and consequently it is limited to the feature set provided by gridtext.
Support is provided for Markdown both in theme elements (plot titles, subtitles, captions, axis labels, legends, etc.) and in geoms (similar to ggplot2::geom_text()
). In both cases, there are two alternatives, one for creating simple text labels and one for creating text boxes with word wrapping.
A.34 glue
Package Profile: glue
Glue offers interpreted string literals that are small, fast, and dependency-free. Glue does this by embedding R expressions in curly braces which are then evaluated and inserted into the argument string.
A.35 gssr
Package Profile: gssr
(There is no hexagon sticker available for {gssr}.)
GSSR Package: The General Social Survey Cumulative Data (1972-2022) and Panel Data files packaged for easy use in R. {gssr} is a data package, developed and maintained by Kieran Healy, the author of Data Visualization. The package bundles several datasets into a convenient format. Because of its large size {gssr} is not hosted on CRAN but as a GitHub repository.
Instead of browsing and examining the complex dataset with the GSS Data Explorer or download datasets directly from the The National Opinion Research Center (NORC) you can now just work inside R. The current package 0.4 (see: gssr Update) provides the GSS Cumulative Data File (1972-2022), three GSS Three Wave Panel Data Files (for panels beginning in 2006, 2008, and 2010, respectively), and the 2020 panel file.
Version 0.40 also integrates survey code book information about variables directly into R’s help system, allowing them to be accessed via the help browser or from the console with ?, as if they were functions or other documented objects.
A.36 gt
Package Profile: gt
With the {gt} package, anyone can make wonderful-looking tables using the R programming language. The gt philosophy: we can construct a wide variety of useful tables with a cohesive set of table parts. These include the table header, the stub, the column labels and spanner column labels, the table body, and the table footer.
It all begins with table data (be it a tibble or a data frame). You then decide how to compose your {gt} table with the elements and formatting you need for the task at hand. Finally, the table is rendered by printing it at the console, including it in an R Markdown document, or exporting to a file using gtsave()
. Currently, {gt} supports the HTML, LaTeX, and RTF output formats.
A.37 gtsummary
Package Profile: gtsummary
Creates presentation-ready tables summarizing data sets, regression models, and more. The code to create the tables is concise and highly customizable. Data frames can be summarized with any function, e.g. mean(), median(), even user-written functions. Regression models are summarized and include the reference rows for categorical variables. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically identified and the tables are pre-filled with appropriate column headers.
- Summarize data frames or tibbles easily in R. Perfect for creating a Table 1.
- Summarize regression models in R and include reference rows for categorical variables.
- Customize {gtsummary} tables using a growing list of formatting/styling functions.
- Report statistics inline from summary tables and regression summary tables in R markdown. Make your reports completely reproducible!
By leveraging {broom}, {gt}, and {labelled} packages, {gtsummary} creates beautifully formatted, ready-to-share summary and result tables in a single line of R code!
A.38 haven
Package Profile: haven
- It creates
tibbles::tibble()
which a better print method for very long and very wide files. - Dates and times are converted to R date/time classes.
- Character vectors are not converted to factors.
- Value labels are translated into a new
haven::labelled()
class, which preserves the original semantics and can easily be coerced to factors withhaven::as_factor()
. Special missing values are preserved. See details in the vignette Conversion semantics.
{haven}: Personal Evaluation
I am here in this book interested especially in the fourth feature.
A.39 Hmisc
Package Profile: Hmisc
(There is no hexagon sticker available for {Hmisc}.)
The {Hmisc} has it names from Frank Harrell Jr. It contains many functions useful for
- data analysis,
- high-level graphics,
- utility operations,
- computing sample size and power,
- simulation,
- importing and annotating datasets,
- imputing missing values,
- advanced table making,
- variable clustering,
- character string manipulation,
- conversion of R objects to \(\LaTeX\) and HTML code,
- recoding variables,
- caching,
- simplified parallel computing,
- encrypting and decrypting data using a safe workflow,
- general moving window statistical estimation,
- assistance in interpreting principal component analysis (Harrell Jr 2023)
{Hmisc}: Personal Evaluation
This is big variety of functions. In contrast to other packages that are specific directed to solve one problem {Hmisc} seems to be an all-in-one-solution.
To learn more about I should visit Frank E. Harrell’s Jr Hmisc start page. Especially his online-book on R Workflow for Reproducible Data Analysis and Reporting seems to me very interesting!
A.40 httr2
Package Profile: httr2
{httr2} (pronounced hitter2) is a ground-up rewrite of {httr} that provides a pipeable API with an explicit request object that solves more problems felt by packages that wrap APIs (e.g. built-in rate-limiting, retries, OAuth, secure secrets, and more). — {httr2} is designed to map closely to the underlying HTTP protocol. For more details, read An overview of HTTP from MDN.
A.41 janitor
Package Profile: janitor
{janitor} has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can perform many of these tasks already, but with janitor they can do it faster and save their thinking for the fun stuff.
The main janitor functions:
- perfectly format data.frame column names;
- create and format frequency tables of one, two, or three variables - think an improved
base::table()
; and - provide other tools for cleaning and examining data.frames.
The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
{janitor} is a {tidyverse}-oriented package. Specifically, it plays nicely with the %>%
pipe and is optimized for cleaning data brought in with the {readr} and {readxl} packages.
{janitor}: Personal Evaluation
I am using {janitor} mostly in two ways:
- as better
base::table()
function, usingjanitor::tabyl()
-
base::table()
doesn’t accept data.frames and is therefore not compatible with the pipe -
base::table()
doesn’t output data.frames -
base::table()
results are hard to format (the most annoying “feature” for me)
-
- to add information and formatting to the table with the
janitor::adorn_*
functions
You could also use {tidyverse} commands (for instance for a two table dplyr::count()
followed by tidyr::pivot_wider()
) but the many adorn_*
-functions make it easy to enhance the results. BTW: The prefix adorn
comes from ‘adornment’ (ornament, decoration).
A.42 kableExtra
Package Profile: kableExtra
Build complex HTML or \(\LaTeX\) tables using kable()
from {knitr} and the piping syntax from {magrittr} Function kable()
is a light weight table generator coming from {knitr}. This package simplifies the way to manipulate the HTML or \(\LaTeX\) codes generated by kable()
and allows users to construct complex tables and customize styles using a readable syntax.
{kableExtra} is NOT a table generating package. It is a package that can “add features” to a kable()
output using a syntax that every useR loves - the pipes %>%. We see similar approaches to deal with plots in packages like ggvis and plotly. There is no reason why we cannot use it with tables.
Most functionalities in {kableExtra} can work in both HTML and PDF. In fact, as long as you specifies format in kable()
(which can be set globally through option knitr.table.format
), functions in this package will pick the right way to manipulate the table be themselves. As a result, if users want to left align the table, kable(...) %>% kable_styling(position = "left")
will work in both HTML and PDF. Recently, we also introduced a new kbl()
function acting as an alternative to {kable} but provides better documentation and format detection.
A.43 knitr
Package Profile: knitr
(There is no hexagon sticker available for {knitr}.)
Provides a general-purpose tool for dynamic report generation in R using Literate Programming techniques.
A.44 labelled
Package Profile: labelled
See details in the vignette Introduction to labelled and the GitHub website for labelled. There are other vignettes as well and a cheat sheet as PDF for download.
A.45 lmtest
Package Profile: lmtest
(There is no hexagon sticker available for {lmtest}.)
A collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.
A.46 lsr
Package Profile: lsr
(There is no hexagon sticker available for {lsr}.)
A collection of tools intended to make introductory statistics easier to teach, including wrappers for common hypothesis tests and basic data manipulation. It accompanies Navarro, D. J. (2015). Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners, Version 0.6.
A.47 MASS
Package Profile: MASS
(There is no hexagon sticker available for {MASS}.)
Functions and datasets to support Venables and Ripley, Modern Applied Statistics with S (Venables and Ripley 2002c)
A.48 mlogit
Package Profile: mlogit
(There is no hexagon sticker available for {mlogit}.)
Maximum likelihood estimation of random utility discrete choice models. The software is described in Croissant (2020b) and the underlying methods in Train (2009).
A.49 modeest
Package Profile: modeest
(There is no hexagon sticker available for {modeest}.)
The {modeest} package provides estimators of the mode of univariate unimodal (and sometimes multimodal) data and values of the modes of usual probability distributions.
{modeest} is a package specialized for mode estimation. It implements many different mode estimation reported in scientific articles. There is a long list of references on different methods of mode estimations.
A.50 moments
Package Profile: moments
(There is no hexagon sticker available for {moments}.)
Functions to calculate: moments, Pearson’s kurtosis, Geary’s kurtosis and skewness; tests related to them.
A.51 misty
Package Profile: misty
(There is no hexagon sticker available for {misty}.)
Miscellaneous functions for
- data management (e.g., grand-mean and group-mean centering, coding variables and reverse coding items, scale and cluster scores, reading and writing Excel and SPSS files),
- descriptive statistics (e.g., frequency table, cross tabulation, effect size measures),
- missing data (e.g., descriptive statistics for missing data, missing data pattern, Little’s test of Missing Completely at Random, and auxiliary variable analysis),
- multilevel data (e.g., multilevel descriptive statistics, within-group and between-group correlation matrix, multilevel confirmatory factor analysis, level-specific fit indices, cross-level measurement equivalence evaluation, multilevel composite reliability, and multilevel R-squared measures),
- item analysis (e.g., confirmatory factor analysis, coefficient alpha and omega, between-group and longitudinal measurement equivalence evaluation), and
- statistical analysis (e.g., confidence intervals, collinearity and residual diagnostics, dominance analysis, between- and within-subject analysis of variance, latent class analysis, t-test, z-test, sample size determination).
A.52 naniar
Package Profile: naniar
Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. {naniar} provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of {ggplot2} and tidy data. The work is fully discussed in Tierney & Cook (2023b).
A.53 nhanesA
Package Profile: nhanesA
(There is no hexagon sticker available for {nhanesA}.)
{nhanesA} is an R package for browsing and retrieving data from the National Health And Nutrition Examination Survey NHANES website. This package is designed to be useful for research and instructional purposes.
The functions in the {nhanesA} package allow for fully customizable selection and import of data directly from the NHANES website thus it is essential to have an active network connection.
{nhanesA}: Personal Evaluation
There are other similar packages also available, but the are more restricted as newer data than 2014 can’t be downloaded:
- {NHANES}: For the years 2009-2012
- {RNHANES}: For the years 1999-2014
See for my other reflection of packages for downloading NHANES data in Important 1.1 and Section 3.3.3.
A.54 nhstplot
Package Profile: nhstplot
(There is no hexagon sticker available for {nhstplot}.)
Illustrate graphically the most common Null Hypothesis Significance Testing procedures. More specifically, this package provides functions to plot
- Chi-Squared,
- F,
- t (one- and two-tailed) and
- z (one- and two-tailed) tests,
by plotting the probability density under the null hypothesis as a function of the different test statistic values.
Although highly flexible (color theme, fonts, etc.), only the minimal number of arguments (observed test statistic, degrees of freedom) are necessary for a clear and useful graph to be plotted, with the observed test statistic and the p value, as well as their corresponding value labels. The axes are automatically scaled to present the relevant part and the overall shape of the probability density function. This package is especially intended for education purposes, as it provides a helpful support to help explain the Null Hypothesis Significance Testing process, its use and/or shortcomings.
A.55 nnet
Package Profile: nnet
(There is no hexagon sticker available for {nnet}.)
Software for feed-forward neural networks with a single hidden layer, and for multinomial log-linear models.
A.56 nortest
Package Profile: nortest
(There is no hexagon sticker available for {nortest}.)
Five omnibus tests for testing the composite hypothesis of normality.
A.57 odds.n.ends
Package Profile: odds.n.ends
Computes odds ratios and 95% confidence intervals from a generalized linear model object. It also computes model significance with the chi-squared statistic and p-value and it computes model fit using a contingency table to determine the percent of observations for which the model correctly predicts the value of the outcome. Calculates model sensitivity and specificity.
{odds.n.ends} was created in order to take the results from a binary logistic regression model estimated using the glm()
function and compute model significance, model fit, and the odds ratios and 95% confidence intervals typically reported from binary logistic regression analyses.
A.58 onewaytests
Package Profile: onewaytests
(There is no hexagon sticker available for {onewaytests}.)
Performs one-way tests in independent groups designs including homoscedastic and heteroscedastic tests. These are
- one-way analysis of variance (ANOVA),
- Welch’s heteroscedastic F test,
- Welch’s heteroscedastic F test with trimmed means and Winsorized variances,
- Brown-Forsythe test,
- Alexander-Govern test,
- James second order test,
- Kruskal-Wallis test,
- Scott-Smith test,
- Box F test,
- Johansen F test,
- Generalized tests equivalent to Parametric Bootstrap and Fiducial tests,
- Alvandi’s F test,
- Alvandi’s generalized p-value,
- approximate F test,
- B square test,
- Cochran test,
- Weerahandi’s generalized F test,
- modified Brown-Forsythe test,
- adjusted Welch’s heteroscedastic F test,
- Welch-Aspin test,
- Permutation F test.
The package performs pairwise comparisons and graphical approaches.
Also, the package includes
- Student’s t test,
- Welch’s t test and
- Mann-Whitney U test for two samples.
Moreover, it assesses variance homogeneity and normality of data in each group via tests and plots (Dag et al., 2018, https://journal.r-project.org/archive/2018/RJ-2018-022/RJ-2018-022.pdf).
A.59 openintro
Package Profile: openintro
Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system.
A.60 ordinal
Package Profile: ordinal
(There is no hexagon sticker available for {ordinal}.)
Implementation of cumulative link (mixed) models also known as ordered regression models, proportional odds models, proportional hazards models for grouped survival times and ordered logit/probit/… models. Estimation is via maximum likelihood and mixed models are fitted with the Laplace approximation and adaptive Gauss-Hermite quadrature. Multiple random effect terms are allowed and they may be nested, crossed or partially nested/crossed. Restrictions of symmetry and equidistance can be imposed on the thresholds (cut-points/intercepts). Standard model methods are available (summary, anova, drop-methods, step, confint, predict etc.) in addition to profile methods and slice methods for visualizing the likelihood function and checking convergence.
A.61 paletteer
Package Profile: paletteer
The palettes are divided into 2 groups; discrete and continuous. For discrete palette you have the choice between the fixed width palettes and dynamic palettes.
- discrete
- fixed width palettes: These are the most common discrete palettes. They have a set amount of colors which doesn’t change when the number of colors requested vary.
- dynamic palettes: The colors of dynamic palettes depend on the number of colors you need.
- continuous: These palettes provides as many colors as you need for a smooth transition of color.
This package includes 2759 palettes from 75 different packages and information about these can be found in the following data.frames: palettes_c_names
, palettes_d_names
and palettes_dynamic_names
. Additionally this github repo showcases all the palettes included in the package and more.
A.62 patchwork
Package Profile: patchwork
The goal of {patchwork} is to make it ridiculously simple to combine separate ggplots
into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange()
and cowplot::plot_grid
but using an API that incites exploration and iteration, and scales to arbitrarily complex layouts.
The {ggplot2} package provides a strong API for sequentially building up a plot, but does not concern itself with composition of multiple plots. {patchwork} is a package that expands the API to allow for arbitrarily complex composition of plots by, among others, providing mathematical operators for combining multiple plots. Other packages that try to address this need (but with a different approach) are {gridExtra} and {cowplot} (see Section A.31 and Section A.8).
Before plots can be laid out, they have to be assembled. Arguably one of patchwork’s biggest selling points is that it expands on the use of +
in ggplot2 to allow plots to be added together and composed, creating a natural extension of the {ggplot2} API.
While quite complex compositions can be achieved using +
, |
, and /
, it may be necessary to take even more control over the layout. All of this can be controlled using the patchwork::plot_layout()
function along with a couple of special placeholder objects.
{patchwork}: Personal Evaluation
In this book I am using the dobble colon notation instead of a library()
call. Without this call it is more difficult to use the {patchwork} package.
See Using plot arithmetic functions with ::
syntax
operator | function | effect |
---|---|---|
+ | ggplot2:::"+.gg"() |
side by side |
- | patchwork:::"-.ggplot"() |
|
| | patchwork:::"\|.ggplot"() |
|
/ | patchwork:::"/.ggplot"() |
stacked |
* | patchwork:::"*.gg"() |
|
& | patchwork:::"&.gg"() |
A.63 performance
Package Profile: performance
Utilities for computing measures to assess model quality, which are not directly provided by R’s ‘base’ or ‘stats’ packages. These include e.g. measures like r-squared, intraclass correlation coefficient (Nakagawa, Johnson & Schielzeth (2017) doi:10.1098/rsif.2017.0213), root mean squared error or functions to check models for overdispersion, singularity or zero-inflation and more. Functions apply to a large variety of regression models, including generalized linear models, mixed effects models and Bayesian models. References: Lüdecke et al. (2021) doi:10.21105/joss.03139.
A crucial aspect when building regression models is to evaluate the quality of modelfit. It is important to investigate how well models fit to the data and which fit indices to report. Functions to create diagnostic plots or to compute fit measures do exist, however, mostly spread over different packages. There is no unique and consistent approach to assess the model quality for different kind of models.
The primary goal of the performance package is to fill this gap and to provide utilities for computing indices of model quality and goodness of fit. These include measures like r-squared (\(R^2\)), root mean squared error (RMSE) or intraclass correlation coefficient (ICC) , but also functions to check (mixed) models for overdispersion, zero-inflation, convergence or singularity.
A.64 plotly
Package Profile: plotly
Plotly.js is a standalone Javascript data visualization library, and it also powers the Python and R modules named plotly in those respective ecosystems (referred to as Plotly.py and Plotly.R).
Plotly.js can be used to produce dozens of chart types and visualizations, including statistical charts, 3D graphs, scientific charts, SVG and tile maps, financial charts and more.
A.65 ppcor
Package Profile: ppcor
(There is no hexagon sticker available for {ppcor}.)
Calculates partial and semi-partial (part) correlations along with p value. (Kim 2015b)
A.66 pscl
Package Profile: pscl
(There is no hexagon sticker available for {pscl}.)
Bayesian analysis of item-response theory (IRT) models, roll call analysis; computing highest density regions; maximum likelihood estimation of zero-inflated and hurdle models for count data; goodness-of-fit measures for GLMs; data sets used in writing and teaching; seats-votes curves.
A.67 psych
Package Profile: psych
(There is no hexagon sticker available for {psych}.)
A general purpose toolbox developed originally for personality, psychometric theory and experimental psychology.
- Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
- Item Response Theory is done using factor analysis of tetrachoric and polychoric correlations.
- Functions for analyzing data at multiple levels include within and between group statistics, including correlations and factor analysis.
- Validation and cross validation of scales developed using basic machine learning algorithms are provided, as are functions for simulating and testing particular item and test structures.
- Several functions serve as a useful front end for structural equation modeling.
- Graphical displays of path diagrams, including mediation models, factor analysis and structural equation models are created using basic graphics.
- Some of the functions are written to support a book on psychometric theory as well as publications in personality research.
A.68 pubh
Package Profile: pubh
(There is no hexagon sticker available for {pubh}.)
A toolbox for making R functions and capabilities more accessible to students and professionals from Epidemiology and Public Health related disciplines. Includes a function to report coefficients and confidence intervals from models using robust standard errors (when available), functions that expand {ggplot2} plots and functions relevant for introductory papers in Epidemiology or Public Health. Please note that use of the provided data sets is for educational purposes only.
A.69 purrr
Package Profile: purrr
If you’ve never heard of FP before, the best place to start is the family of purrr::map()
functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the purrr::map()
functions is the iteration chapter in R for data science.
A.70 rcompanion
Package Profile: rcompanion
(There is no hexagon sticker available for {rcompanion}.)
Functions and datasets to support Summary and Analysis of Extension Program Evaluation in R, and An R Companion for the Handbook of Biological Statistics. Vignettes are available at https://rcompanion.org. (See also the PDF book (Salvatore S. Mangiafico 2023).)
Both books provide example programs for nearly all of the statistical tests that are described in the Handbook of Biological Statistics (McDonald 2009).
{rcompanion}: Personal Evaluation
Although all three books are now 15 years old, it seems to me that they cover valuable material that is still important. The books by McDonald explains many different tests that are used in the two books by Mangiafico. I will add here the complete table of content to remind me that I should look into this three books:
TABLE OF CONTENT
Introduction
- Purpose of this Book
- Author of this Book
- Using R
- Statistics Textbooks and Other Resources
Statistics for Educational Program Evaluation
- Why Statistics?
- Evaluation Tools and Surveys
Variables, Descriptive Statistics, and Plots
- Types of Variables
- Descriptive Statistics
- Confidence Intervals
- Basic Plots
Understanding Statistics and Hypothesis Testing
- Hypothesis Testing and p-values
- Reporting Results of Data and Analyses
- Choosing a Statistical Test
- Independent and Paired Values
Likert Data
- Introduction to Likert Data
- Descriptive Statistics for Likert Item Data
- Descriptive Statistics with the likert Package
- Confidence Intervals for Medians
- Converting Numeric Data to Categories
Traditional Nonparametric Tests
- Introduction to Traditional Nonparametric Tests
- One-sample Wilcoxon Signed-rank Test
- Sign Test for One-sample Data
- Two-sample Mann–Whitney U Test
- Mood’s Median Test for Two-sample Data
- Two-sample Paired Signed-rank Test
- Sign Test for Two-sample Paired Data
- Kruskal–Wallis Test
- Mood’s Median Test
- Friedman Test
- Quade Test
- Scheirer–Ray–Hare Test
- Aligned Ranks Transformation ANOVA
- Nonparametric Regression and Local Regression
- Nonparametric Regression for Time Series
Permutation Tests
- Introduction to Permutation Tests
- One-way Permutation Test for Ordinal Data
- One-way Permutation Test for Paired Ordinal Data
- Permutation Tests for Medians and Percentiles
Tests for Ordinal Data in Tables
- Association Tests for Ordinal Tables
- Measures of Association for Ordinal Tables
Concepts for Linear Models
- Introduction to Linear Models
- Using Random Effects in Models
- What are Estimated Marginal Means?
- Estimated Marginal Means for Multiple Comparisons
- Factorial ANOVA: Main Effects, Interaction Effects, and Interaction Plots
- p-values and R-square Values for Models
- Accuracy and Errors for Models
Ordinal Tests with Cumulative Link Models
- Introduction to Cumulative Link Models (CLM) for Ordinal Data
- Two-sample Ordinal Test with CLM
- Two-sample Paired Ordinal Test with CLMM
- One-way Ordinal Regression with CLM
- One-way Repeated Ordinal Regression with CLMM
- Two-way Ordinal Regression with CLM
- Two-way Repeated Ordinal Regression with CLMM
Tests for Nominal Data
- Introduction to Tests for Nominal Variables
- Confidence Intervals for Proportions
- Goodness-of-Fit Tests for Nominal Variables
- Association Tests for Nominal Variables
- Measures of Association for Nominal Variables
- Tests for Paired Nominal Data
- Cochran–Mantel–Haenszel Test for 3-Dimensional Tables
- Cochran’s Q Test for Paired Nominal Data
- Models for Nominal Data
Parametric Tests
- Introduction to Parametric Tests
- One-sample t-test
- Two-sample t-test
- Paired t-test
- One-way ANOVA
- One-way ANOVA with Blocks
- One-way ANOVA with Random Blocks
- Two-way ANOVA
- Repeated Measures ANOVA
- Correlation and Linear Regression
- Advanced Parametric Methods
- Transforming Data
- Normal Scores Transformation
Analysis of Count Data and Percentage Data
- Regression for Count Data
- Beta Regression for Percent and Proportion Data
Other Books
A.71 readr
Package Profile: readr
The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results. (Wickham, Hester, and Bryan 2024)
{readr} supports the following formats:
- read_csv(): comma-separated values (CSV)
- read_tsv(): tab-separated values (TSV)
- read_csv2(): semicolon-separated values with , as the decimal mark
- read_delim(): delimited files (CSV and TSV are important special cases)
- read_fwf(): fixed-width files
- read_table(): whitespace-separated files
- read_log(): web log files
A.72 readxl
Package Profile: readxl
The readxl package makes it easy to get data out of Excel and into R. Compared to many of the existing packages (e.g. {gdata}, {xlsx}, {xlsReadWrite}) {readxl} has no external dependencies, so it’s easy to install and use on all operating systems. It is designed to work with tabular data. Works on Windows, Mac and Linux without external dependencies. ————————————————————————
{readxl} supports both the legacy .xls
format and the modern xml-based .xlsx
format. The embedded libxls C library is used to support .xls
, which abstracts away many of the complexities of the underlying binary format. To parse .xlsx
, we use the RapidXML C++ library.
A.73 rms
Package Profile: rms
(There is no hexagon sticker available for {rms}.)
Regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. {rms} is a collection of functions that assist with and streamline modeling.
It also contains functions for binary and ordinal logistic regression models, ordinal models for continuous Y with a variety of distribution families, and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models.
{rms} works with almost any regression model, but it was especially written to work with binary or ordinal regression models, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression.
A.74 report
Package Profile: report
The primary goal of {report} is to bridge the gap between R’s output and the formatted results contained in your manuscript. This package converts statistical models and data frames into textual reports suited for publication, ensuring standardization and quality in results reporting.
{report} automatically produces reports of models and data frames according to best practices guidelines (e.g., APA’s style), ensuring standardization and quality in results reporting.
A.75 RNHANES
Package Profile: RNHANES
(There is no hexagon sticker available for {RNHANES}.)
RNHANES is an R package for accessing and analyzing CDC NHANES (National Health and Nutrition Examination Survey) data that was developed by Silent Spring Institute.
{RNHANES}: Personal Evaluation
The CRAN version of {RNHANES} only works with data before 2015. It is said in the book that for the year 2015-2016 you could use the GitHub developer version. But this didn’t work for me.
The problem is the function RNHANES::validate_year()
that is not up-to-date. It has the valid years included as fixed strings which I see as bad programming. (One could generate these pair of years programmatically, checking with the modulo operator %%
, subtracting from the current year 4 years, because the data has to be prepared to make it public available.)
I therefore used in Listing / Output 3.3 code to download data directly from the website. Currently I learned that there is another — more updated — {nhanesA} packages that I am going to test in chapter 6, where I need NHANES data again.
A.76 rstatix
Package Profile: rstatix
(There is no hexagon sticker available for {rstatix}.)
Provides a simple and intuitive pipe-friendly framework, coherent with the {tidyverse} design philosophy, for performing basic statistical tests, including t-test, Wilcoxon test, ANOVA, Kruskal-Wallis and correlation analyses.
The output of each test is automatically transformed into a tidy data frame to facilitate visualization.
Additional functions are available for reshaping, reordering, manipulating and visualizing correlation matrix. Functions are also included to facilitate the analysis of factorial experiments, including purely ‘within-Ss’ designs (repeated measures), purely ‘between-Ss’ designs, and mixed ‘within-and-between-Ss’ designs.
It’s also possible to compute several effect size metrics, including “eta squared” for ANOVA, “Cohen’s d” for t-test and “Cramer’s V” for the association between categorical variables. The package contains helper functions for identifying univariate and multivariate outliers, assessing normality and homogeneity of variances.
A.77 rvest
Package Profile: rvest
{rvest} helps you scrape (or harvest) data from web pages. It is designed to work with {magrittr} to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.
{rvest}: Personal Evaluation
If you’re scraping multiple pages, Hadley Wickham highly recommends using {rvest} in concert with {polite}. The {polite} package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.
It is important to read the introductory vignette Web scraping 101. It introduces you to the basics of web scraping with {rvest}. You’ll first learn the basics of HTML and how to use CSS selectors to refer to specific elements, then you’ll learn how to use {rvest} functions to get data out of HTML and into R.
A very important tool to get the appropriate CSS selector is SelectorGadget. To learn how to install and to use this tool read the [SelectorGadget help page]https://rvest.tidyverse.org/articles/selectorgadget.html) of {rvest}.
A.78 scales
Package Profile: scales
One of the most difficult parts of any graphics package is scaling, converting from data values to perceptual properties. The inverse of scaling, making guides (legends and axes) that can be used to read the graph, is often even harder! The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes.
A.79 scico
Package Profile: scico
Color choice in information visualization is important in order to avoid being mislead by inherent bias in the used color palette. The {scico} package provides access to the perceptually uniform and color-blindness friendly palettes developed by Fabio Crameri and released under the “Scientific Color-Maps” moniker. The package contains 39 different palettes and includes both diverging and sequential types. It uses more or less the same API as {viridis} and provides scales for {ggplot2} without requiring {ggplot2} to be installed.
Features of {scico}
- Perceptually uniform
- Perceptually ordered
- Color-vision-deficiency (CVD) friendly
- Readable in black & white prints
- All color map types & classes in all major formats
- Citable & reproducible
A.80 semTools
Package Profile: semTools
(There is no hexagon sticker available for {semTools}.)
Provides tools for structural equation modeling, many of which extend the {lavaan} package; for example, to pool results from multiple imputations, probe latent interactions, or test measurement invariance.
{semTools}: Personal Evaluation
This is a very specialized package. I believe I will not use it at the moment besides the function semTools::skew()
and semTools::kurtosis()
.
A.81 sjlabelled
Package Profile: sjlabelled
{sjlabelled} includes easy ways to get, set or change value and variable label attributes, to convert labelled vectors into factors or numeric (and vice versa), or to deal with multiple declared missing values.
{sjlabelled}: Personal Evaluation
The prefix sj
in {sjlabelled} (= in German Strenge Jacke, “strict jacket”) refers to other work of Daniel Lüdecke, who has developed many R packages. All the sj
-packages support labelled data.
His packages are divided in two approaches:
- Most packages are part pf the project EasyStats, that provides with 11 packages “An R Framework for Easy Statistical Modeling, Visualization, and Reporting”, similar to the {tidyverse} collection. The {easystats} collection is orientated more to statistics, whereas {tidyverse} is more directed to data science.
- The other line of package development supports labelled data in combination with different R task like
- Data and Variable Transformation Functions {sjmisc},
- Data Visualization for Statistics in Social Science {sjPlot} and a
- Collection of Convenient Functions for Common Statistical Computations {sjStats}. -Additionally there exists {sjtable2df} a package to Convert ‘sjPlot’ HTML-Tables to R ‘data.frame’.
A.82 sjPlot
Package Profile: sjPlot
Results of various statistical analyses (that are commonly used in social sciences) can be visualized using this package, including simple and cross tabulated frequencies, histograms, box plots, (generalized) linear models, mixed effects models, PCA and correlation matrices, cluster analyses, scatter plots, Likert scales, effects plots of interaction terms in regression models, constructing index or score variables and much more.
{sjPlot}: Personal Evaluation
The standard plot versions are easy to create, but to adapt the resulted graph is another issue. Although {sjPlot} uses in the background the {ggplot2} package, you can’t specify changes with ggplot2 commands. I tried it and it produced two different plots.
To customize plot appearance you have to learn the many arguments of of sjPlot:set_theme()
and sjPlot::plot_grpfrq()
. See the documentation of the many specialized functions to tweak the default values.
A.83 sjstats
Package Profile: sjstats
This package aims at providing:
- sShortcuts for statistical measures, which otherwise could only be calculated with additional effort (like Cramer’s V, Phi, or effict size statistics like Eta or Omega squared), or for which currently no functions available.
- Another focus lies on weighted variants of common statistical measures and tests like weighted standard error, mean, t-test, correlation, and more.
The comprised tools include:
- Especially for mixed models: design effect, sample size calculation
- Weighted statistics and tests for: mean, median, standard error, standard deviation, correlation, Chi-squared test, t-test, Mann-Whitney-U-test
A.84 skimr
Package Profile: skimr
The default summary statistics may be modified by the user as can the default formatting. Support for data frames and vectors is included, and users can implement their own skim methods for specific object types as described in a vignette. Default summaries include support for inline spark graphs. Instructions for managing these on specific operating systems are given in the Using skimr vignette and the README.
{skimr}: Personal Evaluation
At the moment I am just using the skimr::skim()
function. I believe most of the many other functions for adaption are oriented to developers. But still: I need to have a closer look to this package.
A.85 statpsych
Package Profile: statpsych
(There is no hexagon sticker available for {statpsych}.)
{statpsych} implements confidence interval and sample size methods that are especially useful in psychological research.
The methods can be applied in 1-group, 2-group, paired-samples, and multiple-group designs and to a variety of parameters including means, medians, proportions, slopes, standardized mean differences, standardized linear contrasts of means, plus several measures of correlation and association.
The confidence intervals and sample size functions are applicable to single parameters as well as differences, ratios, and linear contrasts of parameters.
The sample size functions can be used to approximate the sample size needed to estimate a parameter or function of parameters with desired confidence interval precision or to perform a variety of hypothesis tests (directional two-sided, equivalence, superiority, noninferiority) with desired power. For details, see: https://dgbonett.sites.ucsc.edu/.
A.86 stringi
Package Profile: stringi
A collection of character string/text/natural language processing tools for pattern searching (e.g., with ‘Java’-like regular expressions or the ‘Unicode’ collation algorithm), random string generation, case mapping, string transliteration, concatenation, sorting, padding, wrapping, Unicode normalization, date-time formatting and parsing, and many more.
The {stringi} tools are fast, consistent, convenient, and - thanks to ICU (International Components for Unicode) - portable across all locales and platforms. Documentation about {stringi} is provided via its website at https://stringi.gagolewski.com/ and the paper by Gagolewski (2022b).
A.87 stringr
Package Profile: stringr
A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.
Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The {stringr} package provides a cohesive set of functions designed to make working with strings as easy as possible. If you’re not familiar with strings, the best place to start is the chapter on strings in R for Data Science.
{stringr}: Personal Evaluation
Even if I had not used {stringi} in this book I will add this package profile into the appropriate section Section A.86.
A.88 tableone
Package Profile: tableone
(There is no hexagon sticker available for {tableone}.)
Creates ‘Table 1’, i.e., description of baseline patient characteristics, which is essential in every medical research. Supports both continuous and categorical variables, as well as p-values and standardized mean differences. Weighted data are supported via the {survey} package.
{tableone}: Personal Evaluation
Instead of using {tableone} I will use {gtsummry} in conjunction with {gt}.
A.89 tabulizer
Package Profile: tabulizer
(There is no hexagon sticker available for {tabulizer}.)
{tabulizer} provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents. The {tabulizerjars} package https://github.com/ropensci/tabulizerjars provides versioned ‘Java’ .jar files, including all dependencies, aligned to releases of ‘Tabula’.
{tabulizer} depends on {rJava}, which implies a system requirement for Java. This can be frustrating, especially on Windows.
{tabulizer}: Personal Evaluation
I just noticed that {tabulizer} was removed from the CRAN repository. But you can still install it from the CRAN archive — or even better — from the GitHub site. I have installed it several years ago (version 0.2.3) and it works smoothly.
I have looked and tested alternatives, but nothing worked satisfactorily:
{pdftool}: A great tool to scrap text from PDFs, but not so good with tables: “It is possible to use {pdftools} with some creativity to parse tables from PDF documents, which does not require Java to be installed.”
An example how to do that is explained in How to extract data tables from PDF in r Tutorial, a video by Data Centrics Inc. Another approach can be found on StackOverflow. But both procedures are way to complex and I have to say that it does not repays the effort, especially with the small example table in the video tutorial. It would be much easier to use other tools, for instance on macOS with the app TextSniper or even input the figures manually.
{PDE} PDF Data extractor (PDE) seems the right tool for the task because it should “Extract Tables and Sentences from PDFs with User Interface”. I couldn’t work with interactive user interface because I has many different options and I didn’t have time to study them thoroughly. But I succeeded with the programming interface, although the result had some errors. Part of columns appeared in extra columns. So these errors were easy to detect and to repair.
With the following code I could extract all 13 tables from the ATF document and could also scrap the pictures in the PDF and convert them to PNGs.
atf_tables <- PDE::PDE_pdfs2table(
pdfs = "data/chap03/firearms_commerce_2019.pdf",
out = "data/chap03/test/",
table.heading.words = "Exhibit",
out.table.format = ".csv (macintosh)"
)
It took me about 20-30 seconds and I got the following message:
Following file is processing: ‘firearms_commerce_2019.pdf’ No filter words chosen for analysis. 13 table(s) found in ‘firearms_commerce_2019.pdf’. Analysis of ‘firearms_commerce_2019.pdf’ complete.
Maybe the interactive UI would also work, but as I am very content with {tabulizer} I did not delve deeply into {PDE}.
My recommendation: As the first choice try to install and use {tabulizer}. If this does not work for you, try {PDE}.
A.90 tibble
Package Profile: tibble
Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print()
method which makes them easier to use with large datasets containing complex objects.
A.91 tidyr
Package Profile: tidyr
Tidy data is data where: - Every column is a variable. - Every row is an observation. - Every cell is a single value.
A.92 tidyselect
Package Profile: tidyselect
(There is no hexagon sticker available for {tidyselect}.)
The {tidyselect} package is the backend of functions like dplyr::select()
or dplyr::pull()
as well as several {tidyr} verbs. It allows you to create selecting verbs that are consistent with other {tidyverse} packages.
To learn about the selection syntax as a user of {dplyr} or {tidyr}, read the user-friendly ?language reference.
To learn how to implement tidyselect in your own functions, read vignette(“tidyselect”).
To learn exactly how the {tidyselect} syntax is interpreted, read the technical description in vignette(“syntax”).
A.93 tidyverse
Package Profile: tidyverse
The {tidyverse} is an opinionated collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures (Wickham et al. 2019). Read more about the philosophy and purpose: The tidy tools manifesto and Welcome to the {tidyverse}
{tidyverse}: Personal Evaluation
In this book I am not going to load {tidyverse} with all its packages. Instead I am using the <package>::<function>
format to access the commands. Explicitly mentioned the used packages with every function call helps me to learn which package is responsible for which function.
A.94 vcd
Package Profile: vcd
(There is no hexagon sticker available for {vcd}.)
Visualization techniques, data sets, summary and inference procedures aimed particularly at categorical data. Special emphasis is given to highly extensible grid graphics. The package was package was originally inspired by the book “Visualizing Categorical Data” by Michael Friendly and is now the main support package for a new book, “Discrete Data Analysis with R” by Michael Friendly and David Meyer (2015).
A.95 vcdExtra
Package Profile: vcdExtra
This package provides additional data sets, documentation, and many functions designed to extend the vcd package for Visualizing Categorical Data and the {gnm} package for Generalized Nonlinear Models. In particular, {vcdExtra} extends mosaic, assoc and sieve plots from {vcd} to handle stats::glm()
and gnm::gnm()
models and adds a 3D version in vcdExtra::mosaic3d()
.
{vcdExtra} is a support package for the book Discrete Data Analysis with R (DDAR) by Michael Friendly and David Meyer (2015). There is also a web site for DDAR with all figures and code samples from the book. It is also used in Friendly’s graduate course, Psy 6136: Categorical Data Analysis.
A.96 viridis
Package Profile: viridis
{viridis}, and its companion package {viridisLite} provide a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency. The color maps are also perceptually-uniform, both in regular form and also when converted to black-and-white for printing.
{viridisLite} provides the base functions for generating the color maps in base R. The package is meant to be as lightweight and dependency-free as possible for maximum compatibility with all the R ecosystem. {viridis} provides additional functionalities, in particular bindings for {ggplot2}.
A.97 waffle
Package Profile: waffle
(There is no hexagon sticker available for {waffle}.)
Square pie charts (a.k.a. waffle charts) can be used to communicate parts of a whole for categorical quantities. To emulate the percentage view of a pie chart, a 10x10 grid should be used with each square representing 1% of the total.
Modern uses of waffle charts do not necessarily adhere to this rule and can be created with a grid of any rectangular shape.
Best practices suggest keeping the number of categories small, just as should be done when creating pie charts.
Tools are provided to create waffle charts as well as stitch them together, and to use glyphs for making isotype pictograms.
It uses {ggplot2} and returns a ggplot2
object.
A.98 withr
Package Profile: withr
Pure functions, such as the sum()
function, are easy to understand and reason about: they always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they are not affected by, nor do they affect, the global state in any way apart from the value they return.
The purpose of the {withr} package is to help you manage side effects in your code. You may want to run code with secret information, such as an API key, that you store as an environment variable. You may also want to run code with certain options, with a given random-seed, or with a particular working-directory.
The {withr} package helps you manage these situations, and more, by providing functions to modify the global state temporarily, and safely. These functions modify one of the global settings for duration of a block of code, then automatically reset it after the block is completed.
A.99 qualvar
Package Profile: qualvar
(There is no hexagon sticker available for {qualvar}.)
In 1973, Wilcox published a paper presenting various indices of qualitative variation for social scientists. The problem is to find relevant statistical indices to measure the variation in nominal-scale (i.e. qualitative or categorical) data. Please see the Wilcox paper for more details on the rationale (Wilcox 1973).
Wilcox presents six indices that can be used to measure qualitative variation. The {qualvar} package implements these indices so that R users can easily use them.