Notes on Statistics with R (SwR)
Preface
- I have finished the first draft of all book chapters. I know that there are is a huge amount of typos and English errors. I am planning to correct most of them in the second draft. But I do not know when I will have time for this (additional) work, because I have my main goal — to learn statistical concepts and computing practices — achieved.
What maybe is still missing from my learning perspective: A summary and collection of lesson learned including when to use which test and R package.
Another plan is to revise the system of my callout boxes, because I have changed the systematic slightly in the middle of this project (approximately starting with chapter 7).
But I do not know if I have time to meet this good resolutions, because I am planning to go the next step in my learning path. At the moment I am thinking what would be the next book I want to study intensive like this one.
WATCH OUT: This is my personal learning material and is therefore neither an accurate replication nor an authoritative textbook.
I am writing this book as a text for others to read because that forces me to become explicit and explain all my learning outcomes more carefully. Please keep in mind that this text is not written by an expert but by a learner.
Text passages with content I am already familiar I have skipped. Section of the original text where I needed more in-depth knowledge I have elaborated and added my own comments resulted from my personal research.
Be warned! In spite of replicating most of the content this Quarto book may contain many mistakes. All the misapprehensions and errors are of course my own responsibility.
Content and Goals of this Book
This Quarto book collects my personal notes, trials and exercises of Statistics With R: Solving Problems Using Real-World Data by Jenine K. Harris (Harris 2020).
This introductory textbook for Statistics with R has three outstanding features:
Data Wrangling
The book applies real data sets with all their problems: missing data, inconsistent structure, not appropriate data types, not understandable labels, accompanying extensive code books etc. Data management is therefore a major part of the book, a subject often not taught. Many introductory textbooks work with already cleaned data and miss the necessity to guide students to bring messy data into an analyzable form.
Inclusion
The author aims to support women and other underrepresented groups to pursue a data science career. By choosing a narrative style with three prototypical feminine characters the book discusses the approaches not only in details but also shows the effect of not optimal coding solutions. The solutions are developed step by step and each improvement replicates the code already written. These repetitions helps not only to compare the differences but shows that code has to be developed bit by bit, tested, improved, and tested again. Another interesting practice shown in the book is to try different approaches to the same problem and the re-usability of already written code. All these practices help to lower barriers and to facilitate learning statistics with R.
Text passages
Quotes and personal comments
My text consists mostly of quotes from the first edition of Harris’ book. I converted my kindle book into a PDF file which I copied via the annotation system in Zotero into my Quarto files.
Example 1 : Quote
“NA is a reserved “word” in R. In order to use NA, both letters must be uppercase (Na or na does not work), and there can be no quotation marks (R will treat “NA” as a character rather than a true missing value)” (Harris, 2020, p. 121) (pdf)
Example 1 has links to my PDF and also to my annotation of the PDF. These links are a practical way for me to get the context of the quote. But as the linked PDF is saved locally at my hard disk these links do not work for you! (There is an option about Zotero groups to share files, but the PDF is not free to use and so I can’t offer this possibility.)
Often I made minor editing (e.g., shorting the text) or put the content in my own wording. In this case I couldn’t quote the text as it does not represent a specific annotation in my Zotero file. In this case I ended the paraphrase with (Harris ibid.)
.
In any case most of the text in this Quarto book is not mine but coming from different resources (Harris’ book, R help files, websites). Most of the time I have put my own personal notes into a notes box as shown in Example 2.
Example 2 : Personal note
Assessment 1 : This is a personal note
In this kind of box I will write my personal thoughts and reflections. Usually this box will appear stand-alone (without the wrapping example box).
Glossary
I am using the {glossary} package to create links to glossary entries.]
R Code 1 : Load glossary
If you hover with your mouse over the double underlined links it opens an window with the appropriate glossary text. Try this example: Z-Score.
WATCH OUT! Glossary text not authorized by the author of SWR
I have added many of the glossary entries when I was working through other books either taking the text passage of these books I was reading or via an internet recherche from other resources. I have added the source of glossary entry. Sometimes I have used abbreviation, but I need still to provide a key what this short references mean.
Jenine Harris has collected her own glossary. Where ever it is suitable for my learning path I have added her entries into my dictionary. To apply the glossary into this text I have used the {glossary} package by Lisa DeBruine.
If you fork the repository of this quarto book then the glossary will not work out of the box. Load down the glossary.yml
file from my glossary-pb GitHub repo, store it on your hard disk and change the path in the code chunk Listing / Output 1.
In any case I am the only responsible person for this text, especially if I have used code from the resources wrongly or misunderstood a quoted text passage.
R Code and Datasets
Harris provides R code and datasets via her Github site but you can also download these files directly from the Student Resources of the publisher’s SAGE website.
Harris introduces and uses in the book Google’s R Style Guide with camelCase. The reference is pointing to a fork of the Tidyverse Style Guide. I am going to use underscore (_
) or snake case to replace spaces as studies has shown that it is easier to read (Sharif and Maletic 2010). But I will use the other Google modifications from the tidyverse style guide:
- Start the names of private functions with a dot.
- Don’t use
base::attach()
. - No right-hand assignments.
- Use explicit returns.
- Qualify namespace.
Especially the last point (qualifying namespace) is important for my learning. Besides preventing conflicts with functions of identical names from different packages it helps to learn (or remember) which function belongs to which package. I think this justifies the small overhead and helps to make R code chunks self-sufficient. (No previous package loading, or library calls in the setup chunk.) To foster learning the relation between function and package I embrace the package name with curly brakes and format it in bold.
I am using the package name also for the default installation of base R. This wouldn’t be necessary but it helps me to understand where the base R functions come from. What follows is a list of base R packages of the system library included into every installation and attached (opened) by default:
- {base}: The R Base Package
- {datsets}: The R Datasets Package
- {graphics}: The R Graphics Package
- {grDevices}: The R Graphics Devices and Support for Colours and Fonts
- {methods}: Formal Methods and Classes
- {stats}: The R Stats Package
- {utils}: The R Utils Package
I am not using always the exact code snippets for my replications because I am not only replicating the code to see how it works but also to change the values of parameters to observe their influences.
In “Statistics with R” there are all names of function arguments explicitly mentioned. This is also the case for function with just one argument, for instance base::summary(object = <r object to summarize>)
. When it is clear then I will follow the advice from Hadley Wickham:
When you call a function, you typically omit the names of data arguments, because they are used so commonly. If you override the default value of an argument, use the full name (tidyverse style guide).
For educational reasons Harris develops code step by step and replicates the complete code including the previous — already explained — snippets. In these cases I use tabs as an organizing structure so that one can see (and compare) the piecemeal development.
Resources
Resource 1 : Resources used for this Quarto book
- Statistics With R: Website
- R Code: Download
- Datasets: Download
Packages introduced in the preface
Resource 2 glossary: Glossaries for Markdown and Quarto Documents
Glossary
term | definition |
---|---|
Z-score | A z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. (<a href="https://www.statisticshowto.com/probability-and-statistics/z-score/#Whatisazscore">StatisticsHowTo</a>) |
Session Info
Session Info
Code
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os macOS Sequoia 15.1
#> system x86_64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Vienna
#> date 2024-11-13
#> pandoc 3.5 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.3 2024-06-21 [2] CRAN (R 4.4.1)
#> colorspace 2.1-1 2024-07-26 [2] CRAN (R 4.4.0)
#> commonmark 1.9.2 2024-10-04 [2] CRAN (R 4.4.1)
#> curl 6.0.0 2024-11-05 [2] CRAN (R 4.4.1)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.4.1)
#> evaluate 1.0.1 2024-10-10 [2] CRAN (R 4.4.1)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.4.0)
#> glossary * 1.0.0.9003 2024-08-05 [2] Github (debruine/glossary@05e4a61)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.4.1)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] CRAN (R 4.4.1)
#> kableExtra 1.4.0 2024-01-24 [2] CRAN (R 4.4.0)
#> knitr 1.49 2024-11-08 [2] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.4.0)
#> markdown 1.13 2024-06-04 [2] CRAN (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [2] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] CRAN (R 4.4.0)
#> rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.4.1)
#> rstudioapi 0.17.1 2024-10-22 [2] CRAN (R 4.4.1)
#> rversions 2.1.2 2022-08-31 [2] CRAN (R 4.4.0)
#> scales 1.3.0 2023-11-28 [2] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.4.0)
#> stringi 1.8.4 2024-05-06 [2] CRAN (R 4.4.0)
#> stringr 1.5.1 2023-11-14 [2] CRAN (R 4.4.0)
#> svglite 2.1.3 2023-12-08 [2] CRAN (R 4.4.0)
#> systemfonts 1.1.0 2024-05-15 [2] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.4.0)
#> viridisLite 0.4.2 2023-05-02 [2] CRAN (R 4.4.0)
#> xfun 0.49 2024-10-31 [2] CRAN (R 4.4.1)
#> xml2 1.3.6 2023-12-04 [2] CRAN (R 4.4.0)
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.4.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#> [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────