9 Linear regression – Notes on Statistics with R (SwR)

9.1 Achievements to unlock

Objectives for chapter 09

SwR Achievements

Achievement 1: Using exploratory data analysis to learn about the data before developing a linear regression model (Section 9.4)
Achievement 2: Exploring the statistical model for a line (Section 9.5)
Achievement 3: Computing the slope and intercept in a simple linear regression (Section 9.6)
Achievement 4: Slope interpretation and significance ($b_{1}$, p-value, CI) (Section 9.7)
Achievement 5: Model significance and model fit (Section 9.8)
Achievement 6: Checking assumptions and conducting diagnostics (Section 9.9)
Achievement 7: Adding variables to the model and using transformation (Section 9.10)

Objectives 9.1: Achievements for chapter 09

9.2 The needle exchange examination

Some infectious diseases like HIV and Hepatitis C are on the rise again with young people in non-urban areas having the highest increases and needle-sharing being a major factor. Clean needles are distributed by syringe services programs (SSPs), which can also provide a number of other related services including overdose prevention, referrals for substance use treatment, and infectious disease testing. But even there are programs in place — which is not allowed legally in some US states! — some people have to travel long distances for health services, especially for services that are specialized, such as needle exchanges.

In discussing the possible quenstion one could analyse it turned out that for some questions critical data are missing:

There is a distance-to-syringe-services-program variable among the health services data sources of amfAR (https://opioid.amfar.org/).
Many of the interesting variables were not available for much of the nation, and many of them were only at the state level.

Given these limitations, the book focuses whether the distance to a syringe program could be explained by

whether a county is urban or rural,
what percentage of the county residents have insurance (as a measure of both access to health care and socioeconomic status [SES]),
HIV prevalence,
and the number of people with opioid prescriptions.

As there is no variable for rural or urban status in the amfAR database, the book will tale a variable from the U.S. Department of Agriculture Economic Research Services website (https://www.ers.usda.gov/data-products/county-typology-codes/) that classifies all counties as metro or non-metro.

9.3 Resources & Chapter Outline

9.3.1 Data, codebook, and R packages

Resource 9.1 : Data, codebook, and R packages for learning about descriptive statistics

Data

Two options for accessing the data:

Download the clean data set dist_ssp_amfar_ch9.csv from https://edge.sagepub.com/harris1e.
Follow the instructions in Box 9.1 to import, merge, and clean the data from multiple files or from the original online source

Codebook

Two options for accessing the codebook:

Download the codebook file opioid_county_codebook.xlsx from https://edge.sagepub.com/harris1e.
Use the online codebook from the amfAR Opioid & Health Indicators Database website (https://opioid.amfar.org/)

Packages

Packages used with the book (sorted alphabetically)

{broom}, Section A.3 (David Robinson and Alex Hayes)
{car}, Section A.4 (John Fox)
{lmtest}: Section A.45 (Achim Zeileis)
{tableone}: Section A.88 (Kazuki Yoshida)
{tidyverse}: Section A.93 (Hadley Wickham)

My additional packages (sorted alphabetically)

{gt}: Section A.36 (Richard Iannone)
{gtsummary}: Section A.37 (Daniel D. Sjoberg)

9.3.2 Get, recode and show data

I will use the data file provided by the book because I am feeling quite confident with reading and recoding the original data. But I will change the columns names so that the variable names conform to the tidyverse style guide.

Example 9.1 : Data for chapter 9

R Code 9.1 : Get & recode data for chapter 9

Code

## run only once (manually)
distance_ssp <- readr::read_csv(
    "data/chap09/dist_ssp_amfar_ch9.csv",
    show_col_types = FALSE)

distance_ssp_clean <- distance_ssp |> 
    dplyr::rename(
        state = "STATEABBREVIATION",
        dist_ssp = "dist_SSP",
        hiv_prevalence = "HIVprevalence",
        opioid_rate = "opioid_RxRate",
        no_insurance = "pctunins"
    ) |> 
    dplyr::mutate(
        state = forcats::as_factor(state),
        metro = forcats::as_factor(metro)
    ) |> 
    dplyr::mutate(
        hiv_prevalence = dplyr::na_if(
            x = hiv_prevalence,
            y = -1
        )
    )

save_data_file("chap09", distance_ssp_clean, "distance_ssp_clean.rds")

(For this R code chunk is no output available)

R Code 9.2 : Show data for chapter 9

Code

distance_ssp_clean <- base::readRDS("data/chap09/distance_ssp_clean.rds")

skimr::skim(distance_ssp_clean)

Table 9.1: Descriptive statistics for data of chapter 9

Data summary
Name	distance_ssp_clean
Number of rows	500
Number of columns	7
_______________________
Column type frequency:
character	1
factor	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
county	0	1	10	28	0	406	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	49	TX: 50, GA: 30, NC: 21, TN: 21
metro	0	1	FALSE	2	non: 274, met: 226

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
dist_ssp	0	1.00	107.74	94.23	0.0	35.12	75.94	163.83	510.0	▇▃▂▁▁
hiv_prevalence	70	0.86	192.89	213.35	14.4	72.30	118.95	227.48	2150.7	▇▁▁▁▁
opioid_rate	0	1.00	68.33	36.81	0.2	45.12	62.40	89.95	345.1	▇▆▁▁▁
no_insurance	0	1.00	12.18	4.97	3.0	8.60	11.70	15.00	35.9	▅▇▂▁▁

Bullet List 9.1

: Codebook: Explanation of variables used in Chapter 9

county: the county name
state: the two-letter abbreviation for the state the county is in
dist_ssp: distance in miles to the nearest syringe services program
hiv_prevalence: people age 13 and older living with diagnosed HIV per 100,000
opioid_rate: number of opioid prescriptions per 100 people
no_insurance:percentage of the civilian non-institutionalized population with no health insurance coverage
metro: county is either non-metro, which includes open countryside, rural towns, or smaller cities with up to 49,999 people, or metro

R Code 9.3 : Summary of metro variable

Code

distance_ssp_clean |> 
    dplyr::group_by(metro) |> 
    skimr::skim()

Data summary
Name	dplyr::group_by(distance_…
Number of rows	500
Number of columns	7
_______________________
Column type frequency:
character	1
factor	1
numeric	4
________________________
Group variables	metro

Variable type: character

skim_variable	metro	n_missing	complete_rate	min	max	empty	n_unique	whitespace
county	metro	0	1	10	28	0	201	0
county	non-metro	0	1	10	25	0	241	0

Variable type: factor

skim_variable	metro	n_missing	complete_rate	ordered	n_unique	top_counts
state	metro	0	1	FALSE	46	TX: 20, GA: 15, VA: 12, NC: 11
state	non-metro	0	1	FALSE	42	TX: 30, KS: 18, GA: 15, TN: 13

Variable type: numeric

skim_variable	metro	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
dist_ssp	metro	0	1.00	85.00	88.00	0.0	20.31	50.37	123.96	436.0	▇▂▁▁▁
dist_ssp	non-metro	0	1.00	126.50	95.21	6.0	50.61	96.05	181.96	510.0	▇▅▂▁▁
hiv_prevalence	metro	18	0.92	245.59	262.55	14.4	89.30	162.60	316.35	2150.7	▇▁▁▁▁
hiv_prevalence	non-metro	52	0.81	143.51	136.86	22.6	63.65	96.60	159.18	904.3	▇▂▁▁▁
opioid_rate	metro	0	1.00	65.44	28.65	4.7	45.55	60.10	87.43	143.6	▂▇▅▃▁
opioid_rate	non-metro	0	1.00	70.71	42.28	0.2	44.60	65.60	91.97	345.1	▇▆▁▁▁
no_insurance	metro	0	1.00	11.17	4.42	3.0	8.22	10.80	14.05	30.2	▅▇▅▁▁
no_insurance	non-metro	0	1.00	13.02	5.25	3.3	9.17	12.15	15.88	35.9	▅▇▃▁▁

Listing / Output 9.1: Summary of metro variable

For the exploratory data analysis I need more details about the association between the distance to the next SSP separated for people living in metro and non-metro areas. See

WATCH OUT! Do missing values have a pattern?

We know from Table 9.1 that the variable hiv_prevalence has many missing values. In all the forthcoming analyses we will remove those 70 NAs and work with complete cases. 70 NA’s in a sample of 500 is with 14% a big proportion from the available data. The question arises: Is there a reason why there are so many missing values? Could it be that this reason is distorting our analysis?

Most of the time I have provided code that suppresses these warnings. This is a dangerous enterprise as it could bias results and conclusions without knowledge of the researcher. I think that a more prudent approach would need an analysis of the missing values. I do not know how to do this yet, but with {naniar} (Section A.52) there is a package for exploring missing data structures. Its website and package has several vignettes to learn its functions and there is also an scientific article about the package (Tierney and Cook 2023).

Exploring missing data structures is in the book no planned achievement, therefore it is here enough to to get rid of the NA’s and to follow the books outline. But I am planning coming back to this issue and learn how to address missing data structures appropriately.

9.4 Achievement 1: Explorative Data Analysis

9.4.1 Introduction

Instead following linearly the chapter I will try to compute my own EDA. I will try three different method:

Manufacturing the data and graphs myself. Writing own functions and using {tidyverse} packages to provide summary plots and statistics.
Trying out the graphics::pairs() function.
Experimenting with {GGally}, an extension package to {ggplot2} where one part (GGally::ggpairs()) is the equivalent to the base R graphics::pairs() function.

9.4.2 Steps for EDA

I will apply the following steps:

Procedure 9.1 : Some useful steps to explore data for regression analysis

Order and completeness of the following tasks is not mandatory.

Browse the data:
- RStudio Data Explorer: I am always using the data explorer in RStudio to get my first impression of the data. Although this step is not reproducible it forms my frame of mind what EDA steps I should follow and if there are issues I need especially to care about.
- Skim data: Look at the data with skimr::skim() to get a holistic view of the data: names, data types, missing values, ordered (categorical) minimum, maximum, mean, sd, distribution (numerical).
- Read the codebook: It is important to understand what the different variables mean.
- Check structure: Examine with utils::str() if the dataset has special structures, e.g. labelled data, attributes etc.
- Glimpse actual data: To get a feeling about data types and actual values use dplyr::glimpse().
- Glance at example rows: As an alternative of utils::head() / utils::tails() get random row examples including first and last row of the dataset with my own function my_glance_data().
Check normality assumption:
- Draw histograms of numeric variables: To get a better idea I have these histograms overlaid with the theoretical normal distributions and the density plots of the current data. The difference between these two curves gives a better impression if normality is met or not.
- Draw Q-Q plots of numeric variables: Q-Q plots gives even a more detailed picture if normality is met.
- Compute normality tests: If your data has less than 5.000 rows then use the Shapiro-Wilk test, otherwise the Anderson-Darling test.
Check homogeneity assumption: If the normality assumption is not met, then test if the homogeneity of variance assumption between groups is met with Levene’s test or with the more robust Fligner-Killeen’s test. In the following steps use always median instead of mean and do not compute the Pearson’s r but the Spearman’s rho coefficient.
Compute correlation coefficient: Apply either Pearson’s r or the Spearman’s rho coefficient.
Explore categorical data with box plots or violin plots: Box plots work well between a numerical and categorical variable. You could also overlay the data and violin plots to maximize the information in one single graph (see Graph 9.4).

There are packages like {GGally} and {ggfortify} (see Section A.25 and Section A.26) that provide a graphical and statistical representation of all combinations of bivariate relationships. They can be used as convenient shortcuts to many of the task listed here above.

9.4.3 Executing EDA for chapter 9

Example 9.2 : Explorative Data Analysis for chapter 9

R Code 9.4 : Descriptive statics with the {tableone} package

Code

tableone::CreateTableOne(data = distance_ssp_clean,
                         vars = c('dist_ssp', 'hiv_prevalence',
                                  'opioid_rate', 'no_insurance',
                                  'metro'))

Table 9.2: Descriptive statics with the ‘tableone’ package

#>                             
#>                              Overall        
#>   n                             500         
#>   dist_ssp (mean (SD))       107.74 (94.23) 
#>   hiv_prevalence (mean (SD)) 192.89 (213.35)
#>   opioid_rate (mean (SD))     68.33 (36.81) 
#>   no_insurance (mean (SD))    12.18 (4.97)  
#>   metro = non-metro (%)         274 (54.8)

skimr::skim() with Table 9.1 is a much better alternative! The second version of {tableone} in the book with the median instead of the mean is not necessary because it is in skimr::skim() integrated.

R Code 9.5 : Histograms of numeric variables

Code

hist_distance <- my_hist_dnorm(
    df = distance_ssp_clean,
    v = distance_ssp_clean$dist_ssp,
    n_bins = 30,
    x_label = "Nearest syringe services program in miles"
) 

hist_hiv <- my_hist_dnorm(
    df = distance_ssp_clean,
    v = distance_ssp_clean$hiv_prevalence,
    n_bins = 30,
    x_label = "People with diagnosed HIV per 100,000"
) 

hist_opioid <- my_hist_dnorm(
    df = distance_ssp_clean,
    v = distance_ssp_clean$opioid_rate,
    n_bins = 30,
    x_label = "Opioid prescriptions per 100 people"
)

hist_insurance <- my_hist_dnorm(
    df = distance_ssp_clean,
    v = distance_ssp_clean$no_insurance,
    n_bins = 30,
    x_label = "Percentage with no health insurance coverage"
)

gridExtra::grid.arrange(
   hist_distance, hist_hiv, hist_opioid, hist_insurance, nrow = 2
)

Graph 9.1: Histograms for numeric variables of chapter 9

I developed a function where I can overlay the theoretical normal distribution and the density of the current data. The difference between the two curves gives an indication if we have a normal distribution.

From our data we see that the biggest difference is between SPP distance and HIV prevalence. This right skewed distribution could also be detected from other indicator already present in the skimr::skim()view of Table 9.1: - The small histogram on the right is the most right skewed distribution. - The standard deviation of hiv_prevalence is the only one, that is bigger than the mean of the variable. - There is a huge difference between mean and the median (p50) where the mean is much bigger than the median (= right skewed distribution), e.g. there is a long tail to the right as can also be seen in the tiny histogram.

Aside from hiv_prevalence the variable distance_ssp is almost equally right skewed. The situation seems better for the rest of the numeric variables. But let’s manufacture Q-Q plots for all of them to see more in detail if they are normally distributed or not.

R Code 9.6 : Q-Q plots of numeric variables

Code

qq_distance <- my_qq_plot(
    df = distance_ssp_clean,
    v  = distance_ssp_clean$dist_ssp,
    col_qq = "Distance to SSP"
)

qq_hiv <- my_qq_plot(
    df = distance_ssp_clean,
    v  = distance_ssp_clean$hiv_prevalence,
    col_qq = "HIV diagnosed"
)

qq_opioid <- my_qq_plot(
    df = distance_ssp_clean,
    v  = distance_ssp_clean$opioid_rate,
    col_qq = "Opioid prescriptions"
)

qq_insurance <- my_qq_plot(
    df = distance_ssp_clean,
    v  = distance_ssp_clean$no_insurance,
    col_qq = "Health insurance"
)


gridExtra::grid.arrange(
   qq_distance, qq_hiv, qq_opioid, qq_insurance, nrow = 2
)

Graph 9.2: Q-Q plots for numeric variables of chapter 9

It turned out that all four numeric variables are not normally distributed. Some of them looked in the histograms quite OK, because the differences to the normal distribution on the lower and upper end of the data compensate each other.

Testing normality with Shapiro-Wilk or Anderson-Darling test will show that they are definitely not normally distributed.

R Code 9.7 : Normality checking with Shapiro-Wilk & Anderson-Darling tests

Code

dist_test <-  stats::shapiro.test(distance_ssp_clean$dist_ssp)
hiv_test <-  stats::shapiro.test(distance_ssp_clean$hiv_prevalence)
opioid_test <- stats::shapiro.test(distance_ssp_clean$opioid_rate)
insurance_test <- stats::shapiro.test(distance_ssp_clean$no_insurance)

dist_test2 <-  nortest::ad.test(distance_ssp_clean$dist_ssp)
hiv_test2 <-  nortest::ad.test(distance_ssp_clean$hiv_prevalence)
opioid_test2 <- nortest::ad.test(distance_ssp_clean$opioid_rate)
insurance_test2 <- nortest::ad.test(distance_ssp_clean$no_insurance)


normality_test <- 
    dplyr::bind_rows(
        broom:::glance.htest(dist_test),
        broom:::glance.htest(hiv_test),
        broom:::glance.htest(opioid_test),
        broom:::glance.htest(insurance_test),
        broom:::glance.htest(dist_test2),
        broom:::glance.htest(hiv_test2),
        broom:::glance.htest(opioid_test2),
        broom:::glance.htest(insurance_test2)
    ) |> 
    dplyr::bind_cols(
        variable = c("dist_ssp", "hiv_prevalence",
                     "opioid_rate", "no_insurance",
                     "dist_ssp", "hiv_prevalence",
                     "opioid_rate", "no_insurance")
    ) |> 
    dplyr::relocate(variable)

normality_test

Table 9.3: Testing normality with Shapiro-Wilk & Anderson-Darling tests

#> # A tibble: 8 × 4
#>   variable       statistic  p.value method                         
#>   <chr>              <dbl>    <dbl> <chr>                          
#> 1 dist_ssp           0.874 1.10e-19 Shapiro-Wilk normality test    
#> 2 hiv_prevalence     0.649 8.72e-29 Shapiro-Wilk normality test    
#> 3 opioid_rate        0.938 1.55e-13 Shapiro-Wilk normality test    
#> 4 no_insurance       0.946 1.85e-12 Shapiro-Wilk normality test    
#> 5 dist_ssp          18.0   3.7 e-24 Anderson-Darling normality test
#> 6 hiv_prevalence    37.0   3.7 e-24 Anderson-Darling normality test
#> 7 opioid_rate        2.68  9.14e- 7 Anderson-Darling normality test
#> 8 no_insurance       3.43  1.39e- 8 Anderson-Darling normality test

The p-values from both tests are for all four variables very small, e.g. statistically significant. Therefore we have to reject the Null that they are normally distributed.

Report

It turned out that all four variable are not normally distributed. We can’t therefore not use Pearson’s r coefficient.

Before we are going to use Spearman’s rho let’s check the homogeneity of variance assumption (homoscedasticity) with a scatterplot with lm and loess curve and using Levene’s Test and the Fligner-Killeen’s test.

R Code 9.8 : Scatterplots of numeric variables

Code

scatter_dist_hiv <-  my_scatter(
    df = distance_ssp_clean,
    v =  distance_ssp_clean$hiv_prevalence,
    w =  distance_ssp_clean$dist_ssp,
    x_label = "HIV prevalence",
    y_label = "Distance to SSP"
)

scatter_dist_opioid <-  my_scatter(
    df = distance_ssp_clean,
    v =  distance_ssp_clean$opioid_rate,
    w =  distance_ssp_clean$dist_ssp,
    x_label = "Opioid rate",
    y_label = "Distance to SSP"
)

scatter_dist_insurance <-  my_scatter(
    df = distance_ssp_clean,
    v =  distance_ssp_clean$no_insurance,
    w =  distance_ssp_clean$dist_ssp,
    x_label = "No insurance",
    y_label = "Distance to SSP"
)

gridExtra::grid.arrange(
   scatter_dist_hiv, scatter_dist_opioid, scatter_dist_insurance, nrow = 3
)

Graph 9.3: Scatterplots of numeric variables

R Code 9.9 : Testing homogeneity of variances with Levene’s and Fligner-Killeen’s test

Code

hiv_test <-  stats::fligner.test(
    distance_ssp_clean$dist_ssp,
    distance_ssp_clean$hiv_prevalence
    )
opioid_test <- stats::fligner.test(
    distance_ssp_clean$dist_ssp,
    distance_ssp_clean$opioid_rate
    )
insurance_test <- stats::fligner.test(
    distance_ssp_clean$dist_ssp,
    distance_ssp_clean$no_insurance
    )

hiv_test2 <-  car::leveneTest(
    distance_ssp_clean$dist_ssp,
    forcats::as_factor(distance_ssp_clean$hiv_prevalence)
    )
opioid_test2 <- car::leveneTest(
    distance_ssp_clean$dist_ssp,
    forcats::as_factor(distance_ssp_clean$opioid_rate)
    )
insurance_test2 <- car::leveneTest(
    distance_ssp_clean$dist_ssp,
    forcats::as_factor(distance_ssp_clean$no_insurance)
    )


homogeneity_test <- 
    dplyr::bind_rows(
        broom::tidy(hiv_test2),
        broom::tidy(opioid_test2),
        broom::tidy(insurance_test2)
    ) |> 
    dplyr::mutate(method = "Levene's Test for Homogeneity of Variance") |> 
    dplyr::bind_rows(
        broom:::glance.htest(hiv_test),
        broom:::glance.htest(opioid_test),
        broom:::glance.htest(insurance_test),
    ) |> 
    dplyr::bind_cols(
        variable = c("dist_hiv",
                     "dist_opioid", 
                     "dist_insurance",
                     "dist_hiv",
                     "dist_opioid", 
                     "dist_insurance"
                 )
    ) |> 
    dplyr::relocate(variable)

homogeneity_test

Table 9.4: Homogeneity of variances tested with Levene’s and Fligner-Killeen’s test

#> # A tibble: 6 × 7
#>   variable       statistic p.value    df df.residual method            parameter
#>   <chr>              <dbl>   <dbl> <int>       <int> <chr>                 <dbl>
#> 1 dist_hiv           0.493  0.999    395          34 Levene's Test fo…        NA
#> 2 dist_opioid        0.893  0.770    406          93 Levene's Test fo…        NA
#> 3 dist_insurance     0.750  0.983    171         328 Levene's Test fo…        NA
#> 4 dist_hiv         408.     0.312     NA          NA Fligner-Killeen …       395
#> 5 dist_opioid      451.     0.0609    NA          NA Fligner-Killeen …       406
#> 6 dist_insurance   170.     0.502     NA          NA Fligner-Killeen …       171

All p-values are higher than the threshold of .05 and are therefore not statistically significant. The Null must not rejected, the homogeneity of variance assumption for all variables is met.

R Code 9.10 : Correlations for numeric variables of chapter 9

Code

cor_pearson <- distance_ssp_clean |> 
    dplyr::summarize(
        hiv_cor = stats::cor(
        x = dist_ssp,
        y = hiv_prevalence,
        use = "complete.obs",
        method = "pearson"
        ),
        opioid_cor = stats::cor(
        x = dist_ssp,
        y = opioid_rate,
        use = "complete.obs",
        method = "pearson"
        ),
        insurance_cor = stats::cor(
        x = dist_ssp,
        y = no_insurance,
        use = "complete.obs",
        method = "pearson"
        ),
        `n (sample)` = dplyr::n()
    )

cor_spearman <- distance_ssp_clean |> 
    dplyr::summarize(
        hiv_cor = stats::cor(
        x = dist_ssp,
        y = hiv_prevalence,
        use = "complete.obs",
        method = "spearman"
        ),
        opioid_cor = stats::cor(
        x = dist_ssp,
        y = opioid_rate,
        use = "complete.obs",
        method = "spearman"
        ),
        insurance_cor = stats::cor(
        x = dist_ssp,
        y = no_insurance,
        use = "complete.obs",
        method = "spearman"
        ),
        `n (sample)` = dplyr::n()
    )

cor_kendall <- distance_ssp_clean |> 
    dplyr::summarize(
        hiv_cor = stats::cor(
        x = dist_ssp,
        y = hiv_prevalence,
        use = "complete.obs",
        method = "kendall"
        ),
        opioid_cor = stats::cor(
        x = dist_ssp,
        y = opioid_rate,
        use = "complete.obs",
        method = "kendall"
        ),
        insurance_cor = stats::cor(
        x = dist_ssp,
        y = no_insurance,
        use = "complete.obs",
        method = "kendall"
        ),
        `n (sample)` = dplyr::n()
    )

cor_chap09 <- dplyr::bind_rows(cor_pearson, cor_spearman, cor_kendall)
cor_chap09 <- dplyr::bind_cols(
    method = c("Pearson", "Spearman", "Kendall"), cor_chap09)
cor_chap09

Table 9.5: Correlations for numeric variables of chapter 9

#> # A tibble: 3 × 5
#>   method   hiv_cor opioid_cor insurance_cor `n (sample)`
#>   <chr>      <dbl>      <dbl>         <dbl>        <int>
#> 1 Pearson  -0.0240   -0.0998          0.413          500
#> 2 Spearman  0.0855   -0.0164          0.345          500
#> 3 Kendall   0.0600   -0.00918         0.234          500

Here I have computed for a comparison all three correlation coefficients of the nearest distance to the next SSP with the numeric variabeles of the dataset.

Pearson’s $r$ is not allowed for all of the three variables, because our data didn’t meet the normality assumption.
Using Spearman’s $\rho$ or Kendall’s $\tau$ instead of Pearson’s $r$ results in big differences. For instance: the correlation of distance to the next SSP and HIV prevalence reverses it direction.
Kendall’s tau $\tau$ is more conservative (smaller) than Spearman’s rho and it is also preferred in most scenarios. (Kendall’s tau is not mentioned in the book. Maybe the reason is — as far as I understand – that Spearman’s is the most widely used correlation coefficient?)

I want to confirm my internet research with the following quotes:

First quote

In the normal case, Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples or some outliers. (Pearson vs Spearman vs Kendall) (Pluviophile 2019).

Second quote

Kendall’s Tau: usually smaller values than Spearman’s rho correlation. Calculations based on concordant and discordant pairs. Insensitive to error. P values are more accurate with smaller sample sizes.

Spearman’s rho: usually have larger values than Kendall’s Tau. Calculations based on deviations. Much more sensitive to error and discrepancies in data.

The main advantages of using Kendall’s tau are as follows:

The distribution of Kendall’s tau has better statistical properties.

The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.

In most of the situations, the interpretations of Kendall’s tau and Spearman’s rank correlation coefficient are very similar and thus invariably lead to the same inferences. (Kendall’s Tau and Spearman’s Rank Correlation Coefficient) (Statistics Solutions, n.d.)

Third quote

Kendall Tau-b is more accurate for small sample sizes with strong correlations.

Spearman’s rho is preferred for weak correlations in small datasets.

In large samples, Kendall Tau-b’s reliability surpasses Spearman’s rho.

Kendall’s Tau is a robust estimator against outliers and non-normality.

Overall, Kendall Tau-b outperforms Spearman for most statistical scenarios Kendall Tau-b vs Spearman: Which Correlation Coefficient Wins? (Learn Statistics Easily 2024)

Resource 9.2 Understanding the different correlation coefficients

Kendall Tau-b vs Spearman: Which Correlation Coefficient Wins?: This important article wxplains the decisive factors in choosing the proper non-parametric correlation coefficient (Kendall Tau-b vs Spearman) for your data analysis. (Learn Statistics Easily 2024)
Pearson, Spearman and Kendall correlation coefficients by hand: This articles illustrates how to compute the Pearson, Spearman and Kendall correlation coefficients by hand and under two different scenarios (i.e., with and without ties). (Soetewey 2023)
Chapter 22: Correlation Types and When to Use Them: This chapter of (Demos and Salas 2024) covers the strengths, weaknesses, and when or when not to use three common types of correlations (Pearson, Spearman, and Kendall). It’s part statistics refresher, part R tutorial. (Sarmento, n.d.)

R Code 9.11 : Distance in miles to nearest syringe programs by metro or non-metro status for a sample of 500 counties

Code

distance_ssp_clean |>
    dplyr::group_by(metro) |>
    dplyr::summarize(mean.dist = base::mean(dist_ssp),
                     median.dist = stats::median(dist_ssp),
                     min.dist = base::min(dist_ssp),
                     max.dist = base::max(dist_ssp)
                     )

Table 9.6: Distance in miles to nearest syringe programs by metro or non-metro status for a sample of 500 counties

#> # A tibble: 2 × 5
#>   metro     mean.dist median.dist min.dist max.dist
#>   <fct>         <dbl>       <dbl>    <dbl>    <dbl>
#> 1 metro          85.0        50.4        0      436
#> 2 non-metro     126.         96.0        6      510

The big difference between mean and median reflects a right skewed distribution. There are some people living extremely far from the next SSP both in non-metro and metro areas.

It is no surprise that the distance for people living in a non-metro area is much longer than for people in big city. But what certainly surprised me, is that even in big cities half of people live more than 50 miles form the next SSP.

R Code 9.12 : Distance in miles to nearest syringe programs by metro or non-metro status for a sample of 500 counties

Code

distance_ssp_clean |> 
  ggplot2::ggplot(
      ggplot2::aes(
          x = metro, 
          y = dist_ssp, 
          fill = metro
          )
      ) +
  ggplot2::geom_violin(
      ggplot2::aes(
          color = metro
          ), 
      fill = "white", 
      alpha = .8
      ) +
  ggplot2::geom_boxplot(
      ggplot2::aes(
          fill = metro, 
          color = metro
          ), 
      width = .2, 
      alpha = .3
      ) +
  ggplot2::geom_jitter(
      ggplot2::aes(
          color = metro
          ), 
      alpha = .4
      ) +
  ggplot2::labs(
      x = "Type of county",
      y = "Miles to syringe program"
      ) +
  ggplot2::scale_fill_manual(
      values = c("#78A678", "#7463AC"), 
      guide = "none") +
  ggplot2::scale_color_manual(
      values = c("#78A678", "#7463AC"), 
      guide = "none") +
  ggplot2::coord_flip()

Graph 9.4: Distance in miles to nearest syringe programs by metro or non-metro status for a sample of 500 counties

TODO: Exploring several variables together with {**GGally}

During working the material of SwR I had often to look for more details in the internet. During one of this (re)searches I learned about about the possibility to explore multiple variables together with graphics::pairs and the {tidyverse} pendant {GGally}.

After exploring {GGally} I noticed that there is with the ggnostic() function, working as a wrapper around GGally::ggduo(), also a tool that displays full model diagnostics for each given explanatory variable. There are even many other tools where “GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data” GGally: Extension to ggplot2.

I plotted some examples in Section 9.11 but I need to learn these very practical tools much more in detail. Admittedly I have to understand all these different diagnostic tests before I am going to read the extensive documentation (currently 14 articles) and trying to apply shortcuts with the {GGally} functions.

TODO 9.1: Learn to explore several variables together with {GGally}

9.5 Achivement 2: Exploring line model

9.5.1 Introduction

Formula 9.1 : Equation for linear model

\[ \begin{align*} y = &m_{x}+b \\ y = &b_{0}+b_{1}x \\ y = &c+b_{1}x \end{align*} \tag{9.1}\]

$m, b_{1}$: slope of the line
$b, b_{0}, c$: y-intercept of the line, or the value of y when x = 0
$x, y$: the coordinates of each point along the line

Sometimes $b^*$ is used. This means that the variable had been standardized, or transformed into z-scores, before the regression model was estimated.

An example of a linear equation would be $y = 3 + 2x$.

Important 9.1: Variable names and the difference between deterministic and stochastic

The y variable on the left-hand side of the equation is called the dependent or outcome variable.
The x variable(s) on the right-hand side of the equation is/are called the independent or predictor variable(s).

A deterministic equation, or model, has one precise value for y for each value of x. Some equation in physics are deterministic, e.g., $e = mc^2$.
In a stochastic equation, or model, you cannot predict or explain something exactly. Most of the time, there is some variability that cannot be fully explained or predicted. This unexplained variability is represented by an error term that is added to the equation. Relationships measured in social science are typically stochastic.

Equation 9.1 can be re-written with these terms:

Formula 9.2 : Equation of a linear model (rewritten)

\[ \begin{align*} outcome = &b_{0} + b_{1} \times predictor \\ outcome = &b_{0} + b_{1} \times predictor + error \\ \end{align*} \tag{9.2}\]

9.5.2 Plotting an example

Example 9.3 : Example of a linear model

deterministic
stochastic

R Code 9.13 : Example of a deterministic linear model with gallons of water needed as an outcome and weeks as a predictor

Code

# make a vector called weeks that has the values 1 through 12 in it
weeks <- 1:12

# use the regression model to make a vector called gallons with
# weeks as the values
gallons <- 3 + 2 * weeks

# make a data frame of weeks and gallons
water <- data.frame(weeks, gallons)

# Make a plot (Figure 9.9)
water |> 
  ggplot2::ggplot(
      ggplot2::aes(
          x = weeks, 
          y = gallons
          )
      ) +
  ggplot2::geom_line(
      ggplot2::aes(
          linetype = "Linear model\ngallons=3+2*weeks"
          ), 
      color = "gray60", 
      linewidth = 1
      ) + 
  ggplot2::geom_point(
      ggplot2::aes(
          color = "Observation"
          ), 
      size = 4, 
      alpha = .6
      ) +
  ggplot2::labs(
      x = "Weeks", 
      y = "Gallons of water needed"
      ) +
  ggplot2::scale_linetype_manual(values = 2, name = "") +
  ggplot2::scale_color_manual(values = "#7463AC", name = "")

Graph 9.5: Example of a linear model with gallons of water needed as an outcome and weeks as a predictor

There is nothing new in this code chunk, therefore I have just taken the code from the book only adapted with changes resulting from newer versions of {ggplot2} (e.g., linewidth instead of size).

It is important to know that the graph does not use the calculation of a linear model with ggplot2::geom_smooth() but merely uses ggplot2::geom_line to connect the points. We are using #eq-chap09-linear-model, e.g. a deterministic formula.

R Code 9.14 : Example of a stochastic linear model with gallons of water needed as an outcome and weeks as a predictor

Code

# make a vector called weeks that has the values 1 through 12 in it
weeks <- 1:12

# use the regression model to make a vector called gallons with
# weeks as the values 
# but this time with simulated residuals

set.seed(1001) # for reproducibility
gallons <- 3 + (2 * weeks) + rnorm(12, 0, 2.5)

# make a data frame of weeks and gallons
water <- data.frame(weeks, gallons)

# calculate the residuals from the linear model
res <- base::summary(
    stats::lm(gallons ~ weeks, data = water)
    )$residuals


water |> 
      ggplot2::ggplot(
          ggplot2::aes(
                x = weeks,
                y = gallons
          )
      ) +
      ggplot2::geom_point(
            ggplot2::aes(
                color = "Observation"
          ), 
          size = 4, 
          alpha = .6
      ) +
      ggplot2::geom_smooth(
          formula = y ~ x,
          method = "lm",
          se = FALSE,
          ggplot2::aes(
                linetype = "Linear model\ngallons=3+2*weeks"
          ), 
          color = "gray60", 
          linewidth = 1
      ) +
      ggplot2::geom_segment(
          ggplot2::aes(
              x = weeks,
              y = gallons,
              xend = weeks,
              yend = gallons - res
          )
      )  +
      ggplot2::labs(
          x = "Weeks", 
          y = "Gallons of water needed"
          ) +
      ggplot2::scale_linetype_manual(values = 2, name = "") +
      ggplot2::scale_color_manual(values = "#7463AC", name = "")

Graph 9.6: Example of a linear model with gallons of water needed as an outcome and weeks as a predictor with deviations (errors)

This is my replication of book’s Figure 9.10, where no R example code is available. I am proud to state that I did this graph without looking ahead or to read the tutorial by Jackson (2016) that is recommended later in the book. To draw this graph I had to take three new actions:

I had to simulate with stats::rnorm() residuals to change from Equation 9.1 to the second line of Equation 9.2.
I had to calculate a linear model to get the residuals with base::summary(stats::lm()).
I had this time to compute the line for the linear model with ggplot2::geom_smooth().

Without these three code addition, I wouldn’t have been able to draw the vertical lines from the observations to the line of the linear model.

Although I managed to create Graph 9.6 myself I mixed up in the explaining text the concepts of errors and residuals.

Important 9.2: Errors vs. Residuals

Errors and residuals are two closely related and easily confused measures:

The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean).
The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

9.6 Achievement 3: Slope and Intercept

9.6.1 Introduction

A simple linear- regression model could be used to examine the relationship between the percentage of people without health insurance and the distance to a syringe program for a county.

Formula 9.3 : Regression of people without health insurance and the distance to SSP

\[ distance = b_{0} + b_{1} \times \text{no\_insurance} \tag{9.3}\]

9.6.2 Computing the slope

The slope formula in Equation 9.4 is adding up the product of differences between the observed values and mean value of percentage uninsured (no_insurance) and the observed values and mean value of distance to syringe program (dist_ssp) for each of the 500 counties. This value is divided by the summed squared differences between the observed and mean values of no_insurance for each county.

Formula 9.4 : Computing the slope

\[ b_{1} = \frac{\sum_{i = 1}^n (x_{i}-m_{x})(y_{i}-m_{y})}{\sum_{i = 1}^n (x_{i}-m_{x})^2} \tag{9.4}\]

$i$: individual observation, in this case a county
$n$: sample size, in this case 500
$x_{i}$: mean value of no_insurance for the sample
$y_{i}$: value of dist_ssp for $i$
$m_{y}$: mean value of dist_ssp for the sample
$\sum$: symbol for the sum
$b_{i}$: slope

9.6.3 Computing the intercept

Once the slope is computed, the intercept can be computed by putting the slope and the values of $m_{x}$ and $m_{y}$ into the equation for the line with x and y replaced by $m_{x}$ and $m_{y}$, $m_{y} = b_{0} + b_{1} times m_{x}$, and solving it for $b_{0}$, which is the y-intercept. Because this method of computing the slope and intercept relies on the squared differences and works to minimize the residuals overall, it is often called ordinary least squares or OLS regression.

9.6.4 Estimating the linear regression model with R

Example 9.4 : Numbered Example Title

lm9.1
summary lm9.1

R Code 9.15 : Linear regression of distance to syringe program by percent uninsured

Code

## linear regression of distance to syringe program by percent uninsured

lm9.1  <-  stats::lm(
        formula = dist_ssp ~ no_insurance,
        data = distance_ssp_clean, 
        na.action = na.exclude
        )

save_data_file("chap09", lm9.1, "lm9.1.rds")

lm9.1

#> 
#> Call:
#> stats::lm(formula = dist_ssp ~ no_insurance, data = distance_ssp_clean, 
#>     na.action = na.exclude)
#> 
#> Coefficients:
#>  (Intercept)  no_insurance  
#>       12.480         7.819

Listing / Output 9.2: Linear regression of distance to syringe program by percent uninsured

The books does not go into details of the results from stats::lm() but recommends immediately to use base::summary(stats::lm()) to get the best results. The reason is that the summary output of the linear model has much more details (See lst-chap09-summary-lm-distance-uninsured). But I think that it is important to know that there are many different aspects of the result incorporated in the lm object that are not reported.

The screenshot lists different aspects of the result. From top to bottom: coefficients, residuals, effects, rank, fitted.values, assign, qr, df.residual, xlevels, call, terms, model — Graph 9.7: Screenshot of the lm9.1 object

Hidden in the object are important results:

residuals: In Section 9.6.5 I have used the computed residuals from the lm9.1 object to draw the vertical lines from the observation points (dist_ssp) to the regression line in Listing / Output 9.4.
fitted.values: These values build the regression line and are identical with the results from stats::predict(lm9.1) used in Listing / Output 9.7.
effects: There is also a vector the size of the sample (500) called effects. I have looked up what this mean and came to the following definition which I do not (up to now) understand:

For a linear model fitted by lm or aov, the effects are the uncorrelated single-degree-of-freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR decomposition during the fitting process. The first r (the rank of the model) are associated with coefficients and the remainder span the space of residuals (but are not associated with particular residuals). (effects: Effects from Fitted Model)

Some help what effects are why they are important is the explanation of orthogonal in a statistical context. After reading (Orthogonal: Models, Definition & Finding) I got some first vague ideas about the meaning of effects. Hopefully I will learn later was they mean in details and how they are computed.

R Code 9.16 : Estimating the linear regression model of people without health insurance and the distance to SSP using R

Code

# linear regression of distance to syringe program by percent uninsured

base::summary(lm9.1)

#> 
#> Call:
#> stats::lm(formula = dist_ssp ~ no_insurance, data = distance_ssp_clean, 
#>     na.action = na.exclude)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -217.71  -60.86  -21.61   47.73  290.77 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   12.4798    10.1757   1.226    0.221    
#> no_insurance   7.8190     0.7734  10.110   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 85.91 on 498 degrees of freedom
#> Multiple R-squared:  0.1703, Adjusted R-squared:  0.1686 
#> F-statistic: 102.2 on 1 and 498 DF,  p-value: < 2.2e-16

Listing / Output 9.3: Summary of linear regression of distance to syringe program by percent uninsured

Intercept: The $y$-intercept of 12.48 is the $y$-value when $x$ is zero. The model predicts that a county with 0% of people being uninsured would have a distance to the nearest syringe program of 12.48 miles.
Slope: The slope of 7.82 is the change in $y$ for every one-unit change in $x$. If the percent uninsured goes up by 1% in a county, the distance in miles to a syringe program would change by 7.82 miles.

\[ \begin{align*} distance = 12.48 + 7.82 \times \text{no\_insurance} \\ distance = 12.48 + 7.82 \times 10 = 90.68 \end{align*} \] Based on the linear regression model, a county with 10% of people uninsured would be 90.68 miles from the nearest syringe program.

9.6.5 Understanding residuals

In Graph 9.6 I have already graphed a demonstration how the residuals relate to the regression line. The regression line minimizes the residual differences between the values predicted by the regression line and the observed values.

This is how OLS works. OLS minimizes those distances captured in Graph 9.6 by the solid vertical lines: It minimizes the residuals.

R Code 9.17 : Regression with residuals between percentage without health insurance and distance to nearest SSP

Listing / Output 9.4: Relationship between percentage without health insurance and distance to nearest syringe program in 500 counties with residuals (vertical lines)

This is the replication of Figure 9.12, where no example code is available. After I had calculated the linear model I needed either the position on the regression line (lm_fitted_values in my case) or the values of the residuals (lm_residuals). I couldn’t use base::summary(stats::lm()) because fitted.values are calculated only for the lm object (in my case lm_dist), which also has the residuals computed and included.

In the end I decided to subtract the residuals from the distance to get the position of the regression line (and the end of the vertical line from the observation).

9.7 Achievement 4: Slope interpretation and significance

9.7.1 Interpreting statistical significance of the slope

The output of Listing / Output 9.2 for the linear model included a p-value for the slope (<2e16) and a p-value for the intercept (0.221). The statistical significance of the slope in linear regression is tested using a Wald test, which is like a one-sample t-test where the hypothesized value of the slope is zero. To get the p-value from the regression model of distance to syringe program, the slope of 12.48 was compared to a hypothesized value of zero using the Wald test.

9.7.1.1 NHST Step 1

Write the null and alternate hypotheses:

Wording for H0 and HA

H0: The slope of the line is equal to zero.
HA: The slope of the line is not equal to zero.

9.7.1.2 NHST Step 2

Compute the test statistic.

The test statistic for the Wald test in OLS regression is the t-statistic.

Formula 9.5 : Formula for the for the Wald test in OLS regression

\[ \begin{align*} t = &\frac{b_{1}-0}{se_{b_{1}}} \\ t = &\frac{7.8190-0}{0.7734} = 10.11 \end{align*} \tag{9.5}\]

Note that the formula is the same as the formula for the one-sample t-test from Equation 6.1, but with the slope of the regression model instead of the mean. The t-statistic, that was computed manually in Equation 9.5 can also be found in the model output of Listing / Output 9.2.

9.7.1.3 NHST Step 3

Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).

The p-value of the slope in Listing / Output 9.2 is < 2e-16.

9.7.1.4 NHST Step 4

Conclude and write report.

The p-value is < 0.01 and therefore the null hypothesis is rejected in favor of the alternate hypothesis that the slope is not equal to zero.

Report 9.1: Interpretation of the linear regression model lm9.1 after NHST (first draft)

The percentage of uninsured residents in a county is a statistically significant predictor of the distance to the nearest syringe program (b = 7.82; p < .05) in our sample. For every 1% increase in uninsured residents in a county, the predicted distance to the nearest syringe program increases by 7.82 miles.

9.7.2 Computing confidence intervals

stats::confint() computes the confidence interval for the intercept and the slope.

R Code 9.18 : Confidence interval for regression parameters

Code

stats::confint(lm9.1)

#>                  2.5 %    97.5 %
#> (Intercept)  -7.512773 32.472391
#> no_insurance  6.299493  9.338435

Listing / Output 9.5: Confidence interval for regression parameters

The intercept is often reported but not interpreted because it does not usually contribute much to answering the research question.

Report 9.2: Interpretation of the linear regression model lm9.1 after statistical significance and confidence intervals (second draft)

The percentage of uninsured residents in a county is a statistically significant predictor of the distance to the nearest syringe program (b = 7.82; p < .05). For every 1% increase in uninsured residents in a county, the predicted distance to the nearest syringe program increases by 7.82 miles. The value of the slope in the sample is 7.82, and the value of the slope is likely between 6.30 and 9.34 in the population that the sample came from (95% CI: 6.30–9.34). With every 1% increase in uninsured residents in a county, the nearest syringe program is between 6.30 and 9.34 more miles away. These results suggest that counties with a larger percentage of uninsured are farther from this resource, which may exacerbate existing health disparities.

9.7.3 Making predcitions

Predicted values of $y$ are called y-hat and denoted $\hat{y}$. The stats::predict() function can be used to find the predicted values for all observations, or for a specific value of the independent variable.

Important 9.3: stats::predict() as a generic function

stats::predict() is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

In our case we have to consult the help page of stats::predict.lm(). R knows which method to apply so just using stats::predict() is enough to invoke the correct computation. But for us users to know which arguments to apply we need the specified help page and not the explanation of the generic command.

Most prediction methods which are similar to those for linear models have an argument newdata specifying the first place to look for explanatory variables to be used for prediction.

Example 9.5 : Using the model to make prediction

Predict value
Predict all

R Code 9.19 : Predict distance for a county where 10% of people are uninsured

Code

stats::predict(
    lm9.1,
    newdata = data.frame(no_insurance = 10),
    interval = "confidence"
)

#>        fit      lwr      upr
#> 1 90.66945 82.42356 98.91534

Listing / Output 9.6: Predicted distance for a county where 10% of people are uninsured

The predicted distance to a syringe program from a county with 10% of people uninsured is 90.67 miles with a confidence interval for the prediction (sometimes called a prediction interval) of 82.42 to 98.92 miles.

R Code 9.20 : Find predictions for all observed values

Code

pred_all <- tibble::as_tibble(
    stats::predict(
        lm9.1,
        interval = "confidence"
    )
)

my_glance_data(pred_all)

#> # A tibble: 10 × 4
#>      obs   fit   lwr   upr
#>    <int> <dbl> <dbl> <dbl>
#>  1     1  52.4  39.2  65.5
#>  2    49 245.  218.  273. 
#>  3    74  93.0  84.9 101. 
#>  4   122  78.9  69.5  88.3
#>  5   146  87.5  79.0  96.1
#>  6   153  75.0  65.2  84.9
#>  7   228  78.2  68.7  87.6
#>  8   321  82.1  73.0  91.1
#>  9   485 160.  148.  173. 
#> 10   500  56.3  43.7  68.8

Listing / Output 9.7: Predicted values for all observed x

This is the same code as in Listing / Output 9.6 but without the newdata line.

9.8 Achievement 5: Model fit

9.8.1 Introduction

There is another p-value toward the bottom of Listing / Output 9.3. This p-value was from a test statistic that measures how much better the regression line is at getting close to the data points compared to the mean value of the outcome. Essentially, the question asked to produce this p-value is: Are the predicted values shown by the regression line in Figure 9.13 better than the mean value of the distance to the syringe program at capturing the relationship between no_insurance and dist_ssp?

Important 9.4: F-statistic for linear regression

Like the t-statistic is the test statistic for a t-test comparing two means, the F-statistic is the test statistic for linear regression comparing the regression line to the mean.

It is the same F-statistic that we have seen working with ANOVA in Chapter 7. ANOVA is actually a special type of linear model where all the predictors are categorical.

R Code 9.21 : Distance to mean versus regression line

Listing / Output 9.8: What is smaller? Sum of distances to regression line or distances to mean?

9.8.2 Understanding F-statistic

The F-statistic is a ratio of explained information (in the numerator) to unexplained information (in the denominator). If a model explains more than it leaves unexplained, the numerator is larger and the F-statistic is greater than 1. F-statistics that are much greater than 1 are explaining much more of the variation in the outcome than they leave unexplained. Large F-statistics are more likely to be statistically significant.

Formula 9.6 : F-statistic for the linear regression

\[ \begin{align*} F = \frac{\frac{\sum_{i=1}^{n}(\hat{y}-m_{y})^2}{p-1}}{\frac{\sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^2}{n-p}} \end{align*} \tag{9.6}\]

$i$: individual observation, in this case a county
$n$: sample size, or total number of counties, in this case 500
$p$: number of parameters in the model; slope and intercept are parameters
$y_{i}$: observed outcome of distance to syringe program for county $i$
$\hat{y_{i}}$: predicted value of distance to syringe program for county $i$
$m_{y}$: mean of the observed outcomes of distance to syringe program

Numerator: How much differ the predicted values from the observed mean on average. ($MS_{regression}$) Denominator: How much differ the predicted values from the actual observed values on average. ($MS_{residual}$)

The F-statistic is how much a predicted value differs from the mean value on average — which is explained variance, or how much better (or worse) the prediction is than the mean at explaining the outcome — divided by how much an observed value differs from the predicted value on average, which is the residual information or unexplained variance. Or: Explained variance divided by residual information resp. unexplained variance,

Sometimes, these relationships are referred to in similar terminology to ANOVA: the numerator is the $MS_{regression}$ (where MS stands for mean square) divided by the $MS_{residual}$.

Bullet List 9.2

: Features of the F-statistic

The F-statistic is always positive, due to the squaring of the terms in the numerator and denominator.
The F-statistic starts at 0 where the regression line is exactly the same as the mean.
The larger the F-statistic gets the more the model explains the variation in the outcome.
F-statistics with large values are less likely to occur when there is no relationship between the variables.
The shape of the F-distribution depends on the number of parameters in the statistical model and the sample size, which determine two degrees of freedom (df) values.

For instance in the last line of Listing / Output 9.3 we see that the value of the F-statistic is 102.2 with 1 (p - 1 = 2 - 1 = 1) and 498 (n - p = 500 - 2 = 498) df.

Example 9.6 : F-distributions

Examples for F-distributions
F-distribution lm9.1

R Code 9.22 : Examples of the distribution of probability density for $F$

Listing / Output 9.9: Examples of the distribution of probability density for $F$

This is the replication of Figure 9.15.

R Code 9.23 : F-distribution for model lm9.1

Listing / Output 9.10: F-distribution with 1 and 498 degrees of freedom for model of distance to syringe program by uninsured

This is the replication of Figure 9.16 of the book.

There is very little space under the curve from F = 102.2 to the right, which is consistent with the tiny p-value (p < .001) in Listing / Output 9.3.

9.8.3 Understanding $R^2$ measure of model fit

The measure¹ how well the model fits is $R^2$ or R-squared.

Bullet List 9.3

: Features of R-squared and adjusted R-squared

$R^2$ is the amount of variation in the outcome that the model explains and is reported as a measure of model fit.
When the model predicts values that are close to the observed values, the correlation is high and the $R^2$ is high.
To get the percentage of variance explained by the model, multiply $R^2$ by 100.
Subtracting $R^2$ from 1 (1 – $R^2$) and multiplying by 100 for a percent will give the percent of variance not explained by the model.
The value of $R^2$ tends to increase with each additional variable added to the model, whether the variable actually improves the model or not.
Adjusted $R^2$ ($R^2_{adj}$) penalizes the value of $R^2$ a small amount for each additional variable added to the model to ensure that the only increases when the additional predictors explain a notable amount of the variation in the outcome.

$R^2$ is computed by squaring the value of the correlation between the observed distance to syringe programs in the 500 counties and the values of distance to syringe programs predicted for the 500 counties by the model.

For the relationship between uninsured percentage and distance to syringe program in Listing / Output 9.3, the $R^2$ is 0.1703. To get the percentage of variance explained by the model, multiply by 100, so 17.03% of the variation in distance to syringe programs is explained by the percentage of uninsured people living in a county.

Important 9.5: $R^2_{adj}$ is more commonly reported than $R^2$

9.8.4 Reproting linear regression results

Bullet List 9.4

: Simple linear regression analysis results to report

Interpretation of the value of the slope (b)
Significance of the slope (t and p, confidence intervals)
Significance of the model (F and p)
Model fit ($R^2$ or better $R_{adj}^2$)

The following report is for our linear model lm9.1 example takein into account information from the

summary of the computed model (Listing / Output 9.3) and the
computed confidence intervals (Listing / Output 9.5).

Report 9.3: Interpretation of the linear regression model lm9.1 after model fit (third draft)

A simple linear regression analysis found that the percentage of uninsured residents in a county is a statistically significant predictor of the distance to the nearest syringe program (b = 7.82; p < .001). For every 1% increase in uninsured residents, the predicted distance to the nearest syringe program increases by 7.82 miles. The value of the slope is likely between 6.30 and 9.34 in the population that the sample came from (95% CI: 6.30–9.34). With every 1% increase in uninsured residents in a county, there is likely a 6.30- to 9.34-mile increase to the nearest syringe program. The model was statistically significantly better than the baseline model (the mean of the distance to syringe program) at explaining distance to syringe program [F(1, 498) = 102.2; p < .001] and explained 16.86% of the variance in the outcome ($R_{adj}^2$ = .1686). These results suggest that counties with lower insurance rates are farther from this resource, which may exacerbate existing health disparities.

9.9 Achievement 6: Conducting diagnostics

9.9.1 Introduction

Bullet List 9.5

: Assumptions of simple linear regression

Observations are independent.
The outcome is continuous.
The relationship between the two variables is linear (linearity).
The variance is constant with the points distributed equally around the line (homoscedasticity).
The residuals are independent.
The residuals are normally distributed.
There is no perfect multicollinearity. (Only valid for multiple regression for continuous variables, see Section 9.10.5.)

9.9.2 Linearity

R Code 9.24 : Check linearity with loess curve

Listing / Output 9.11: Checking the linearity assumption

The linearity assumption is met if a scatterplot of the two variables shows a relationship that falls along a line. An example where this assumption is met in the author’s opinion is Graph 8.3. The route of the loess curve follows not exact the regression line, but there are only small deviations from the regression line.

The situation in Listing / Output 9.11 is a little different: Here the loess line is curved and shows deviation from the linear relationship especially at the lower level of the predictor but also at the zeniths of the two arcs. Harris therefore decided that in Listing / Output 9.11 the linearity assumption has failed.

Conclusion: The linearity assumption is not met.

TODO: How to test the linearity assumption more objectively?

I think that linearity judgements on the basis of visualizations are quite subjective. It would be nice to have a more objective test, similar to the Breusch-Pagan test for the homoscedasticity assumption in Listing / Output 9.12 or generally the Shapiro-Wilk or Anderson-Darling tests for the normality assumption.

Diagnostics are very important for linear regression models. I have not (yet) much experience but I am pretty sure there exist some tests for the linearity assumption of linear models. I just have to look and find them!

TODO 9.2: Look for packages with linearity test functions

9.9.3 Checking homoscedasticity

Example 9.7 : Checking homoscedasticity

Breusch-Pagan
predicted values vs. residuals

R Code 9.25 : Check homoscedasticity with Breusch-Pagan test

Code

lmtest::bptest(formula = lm9.1)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm9.1
#> BP = 46.18, df = 1, p-value = 1.078e-11

Listing / Output 9.12: Checking the homoscedasticity with the Breusch-Pagan test

The assumption of homoscedasticity requires the data points are evenly distributed around the regression line. One way would be to check again Listing / Output 9.11: The points seem closer to the line on the far left and then are more spread out around the line at the higher values of percentage uninsured. Zhis is an indicator that the assumption has failed.

But the Breusch-Pagan test is another – more objective – way to test the homoscedasticity assumption. The tiny p-value confirm our first impression: The Null assumption, that the data points are evenly distributed around the regression line has to be rejected.

Conclusion: The homoscedasticity assumption is not met.

R Code 9.26 : Check homoscedasticity assumption plotting predicted values vs. residuals

Listing / Output 9.13: Predicted values and residuals from linear regression model of distance to syringe program by percentage uninsured in a county

Another (third) way to examine the constant error variance assumption is plotting the predicted values versus the residuals. In Listing / Output 9.13 a dashed line is shown to indicate no relationship between the fitted (or predicted) values and the residuals, which would be the ideal situation to meet the assumption. This line is a helpful reference point for looking at these types of graphs.

For the homoscedasticity assumption to be met, the points should be roughly evenly distributed around the dashed line with no clear patterns. It should look like a cloud or random points on the graph with no distinguishable patterns.

In Listing / Output 9.13, there is a clear pattern. Under the dashed line the distribution of the points suggest a negative correlation whereas above the line the points are distributed more randomly. This finding confirms the Breusch-Pagan test result.

Conclusion: The homoscedasticity assumption of the residuals is not met.

9.9.4 Independent residuals

R Code 9.27 : Testing the independence of residuals with the Durbin-Watson test

Code

lmtest::dwtest(formula = lm9.1)

#> 
#>  Durbin-Watson test
#> 
#> data:  lm9.1
#> DW = 2.0103, p-value = 0.5449
#> alternative hypothesis: true autocorrelation is greater than 0

Listing / Output 9.14: Testing the independence of residuals with the Durbin-Watson test

Residuals have to be independent or unrelated to each other. Residuals that are independent do not follow a pattern. Conceptually, residuals are the variation in the outcome that the regression line does not explain. If the residuals form a pattern then the regression model is doing better for certain types of observations and worse for others.

The independence assumption of the residuals can be checked with the Durbin-Watson test. A Durbin-Watson (DW, or D-W) statistic of 2 indicates perfectly independent residuals, meaning that the null hypothesis of the test (that the residuals are independet) cannot be rejected. An this in fact the case in Listing / Output 9.14, as the DW-value is very near to 2.

Conclusion: The independence of residuals assumption is met.

9.9.5 Normality of residuals

Example 9.8 : Testing the normality of residuals assumption

hist
q-q plot

R Code 9.28 : Histogramm of residuals

Listing / Output 9.15: Testing the normality of residuals assumption with a histogram of the residuals

The histogram with the overlaid density curve of the theoretical normal distribution in Listing / Output 9.15 shows a right skewed distribution. This is an indicator that the normality assumption is not met.

Conclusion: The normality assumption of the residuals is not met

R Code 9.29 : Checking normality assumption of the residuals with a Q-Q plot

Listing / Output 9.16: Testing the normality of residuals assumption with a q-q plot

The q-q plot in Listing / Output 9.16 shows some deviation form the normal distribution. Although it is again a quite subjective decision (See TODO 9.2) it seems that the assumption of normality has failed.

Conclusion: The normality assumption of the residuals is not met.

R Code 9.30 : Checking normality assumption of the residuals with the Shapiro-Wilk normality test

Code

stats::shapiro.test(df_lm$lm_residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  df_lm$lm_residuals
#> W = 0.94357, p-value = 7.437e-13

Listing / Output 9.17: Checking normality assumption of the residuals of linear model lm9.1 with the Shapiro-Wilk normality test

This method is not mentioned, but I believe that you can also use the Shapiro-Wilk test for testing the normal distribution of residuals. At least the very tiny p-value rejects the Null (that the residuals have a normal distribution) and suggest the same conclusion as the other two (visually decided) tests.

Conclusion: The normality assumption of the residuals is not met.

9.9.6 Interpreting the assumption tests

Bullet List 9.6

: Checking simple linear regression assumptions for model lm9.1m

Observations are independent. No: Counties in the same state are not really independent. (This is a general problem for geographical data analysis.)
The outcome is continuous. Yes: The distance to a syringe program is measured in miles and can take any value of zero or higher.
The relationship between the two variables is linear (linearity). No: See: Section 9.9.2
The variance is constant with the points distributed equally around the line (homoscedasticity). No: See Listing / Output 9.12 and Listing / Output 9.13.
The residuals are independent. Yes: See Section 9.9.4.
The residuals are normally distributed. No: See Listing / Output 9.15 and Listing / Output 9.16.

Because linear model lm9.1 does not meet all the assumptions, the model has to be considered biased and should be interpreted with caution. Specifically, the results of a biased model are not usually considered generalizable to the population.

9.9.7 Using models diagnostics

We are not quite done checking model quality using diagnostics. In addition to testing assumptions, we need to identify problematic observations: outliers or other influential observations.

There are several measures:

standardized residuals,
df-betas,
Cook’s distance, and
leverage.

One good strategy for identifying the truly problematic observations is to identify those observations that are outliers or influential observations on two or more of these four measures.

Example 9.9 : Using models diagnostics to find outliers and influential values

R Code 9.31 : Standardize residuals to find outliers with stats::rstandard()

Code

df2_lm <- df_lm |> 
    dplyr::mutate(
        standard_res = stats::rstandard(model = lm9.1),
        predicted = stats::predict(object = lm9.1)
    ) 

save_data_file("chap09", df2_lm, "df2_lm.rds")

lm9.1_std_res <- df2_lm |> 
    dplyr::filter(base::abs(standard_res) > 1.96) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, standard_res, predicted)) |> 
    dplyr::arrange(standard_res)

print(lm9.1_std_res, n = 30)

#> # A tibble: 26 × 6
#>    county           state dist_ssp no_insurance standard_res predicted
#>    <chr>            <fct>    <dbl>        <dbl>        <dbl>     <dbl>
#>  1 gaines county    TX       49.7          32.6        -2.58     267. 
#>  2 wyandotte county KS        7.25         21          -1.98     177. 
#>  3 harlan county    NE      266.           10.9         1.96      97.7
#>  4 caledonia county VT      224.            5.3         1.99      53.9
#>  5 meade county     SD      268.           10.7         2.00      96.1
#>  6 brazos county    TX      305.           14.1         2.12     123. 
#>  7 falls county     TX      343            18.9         2.13     160. 
#>  8 lampasas county  TX      326.           16.2         2.18     139. 
#>  9 pawnee county    KS      262             7.9         2.19      74.2
#> 10 kiowa county     KS      272.            8.9         2.21      82.1
#> 11 webb county      TX      436            30.2         2.21     249. 
#> 12 gonzales county  TX      386.           22.6         2.31     189. 
#> 13 lincoln county   ME      303.           11.4         2.35     102. 
#> 14 burnet county    TX      342            15.8         2.40     136. 
#> 15 lee county       TX      339.           14.6         2.48     127. 
#> 16 garfield county  NE      300             8.8         2.55      81.3
#> 17 starr county     TX      510            35.1         2.66     287. 
#> 18 duval county     TX      461.           27.7         2.72     229. 
#> 19 kennebec county  ME      316.            8.5         2.76      78.9
#> 20 waldo county     ME      344.           11.9         2.78     106. 
#> 21 comal county     TX      367.           13.9         2.86     121. 
#> 22 coryell county   TX      341.           10.2         2.90      92.2
#> 23 dewitt county    TX      388.           14.8         3.03     128. 
#> 24 guadalupe county TX      387.           14.5         3.04     126. 
#> 25 jim wells county TX      456            21           3.26     177. 
#> 26 brooks county    TX      487            23.5         3.41     196.

Listing / Output 9.18: Standardizing residuals to find outliers with stats::rstandard

Standardized residuals are z-scores for the residual values.

Caution 9.1: Filter counties with absolute values of the residuals greater than 1.96

In my first trial, I forgot to filter using the absolute standardized residual values. I got a list of 24 counties – the two counties at the top of the list with negative values were missing.

That there are only two counties with negative values means that most counties had 2018 distances to syringe programs that were farther away than predicted. Another observation: Most of the outlier counties are situated in Texas, including one that is nearer than predicted.

There are 26 counties with outlier values, e.g., standardized residuals with an absolute value greater than 1.96.

R Code 9.32 : Standardize residuals manually to find outliers

Code

df2a_lm <- df_lm |> 
    dplyr::mutate(
        standard_res = (lm_residuals - 
            base::mean(lm_residuals)) / stats::sd(lm_residuals),
        predicted = stats::predict(object = lm9.1)
    ) 

lm9.1_outlier2 <- df2a_lm |> 
    dplyr::filter(base::abs(standard_res) > 1.96) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, standard_res, predicted)) |> 
    dplyr::arrange(standard_res)

print(lm9.1_outlier2, n = 30)

#> # A tibble: 26 × 6
#>    county           state dist_ssp no_insurance standard_res predicted
#>    <chr>            <fct>    <dbl>        <dbl>        <dbl>     <dbl>
#>  1 gaines county    TX       49.7          32.6        -2.54     267. 
#>  2 wyandotte county KS        7.25         21          -1.97     177. 
#>  3 harlan county    NE      266.           10.9         1.96      97.7
#>  4 caledonia county VT      224.            5.3         1.99      53.9
#>  5 meade county     SD      268.           10.7         2.00      96.1
#>  6 brazos county    TX      305.           14.1         2.12     123. 
#>  7 falls county     TX      343            18.9         2.13     160. 
#>  8 lampasas county  TX      326.           16.2         2.18     139. 
#>  9 webb county      TX      436            30.2         2.18     249. 
#> 10 pawnee county    KS      262             7.9         2.19      74.2
#> 11 kiowa county     KS      272.            8.9         2.21      82.1
#> 12 gonzales county  TX      386.           22.6         2.30     189. 
#> 13 lincoln county   ME      303.           11.4         2.35     102. 
#> 14 burnet county    TX      342            15.8         2.40     136. 
#> 15 lee county       TX      339.           14.6         2.48     127. 
#> 16 garfield county  NE      300             8.8         2.55      81.3
#> 17 starr county     TX      510            35.1         2.60     287. 
#> 18 duval county     TX      461.           27.7         2.70     229. 
#> 19 kennebec county  ME      316.            8.5         2.76      78.9
#> 20 waldo county     ME      344.           11.9         2.78     106. 
#> 21 comal county     TX      367.           13.9         2.86     121. 
#> 22 coryell county   TX      341.           10.2         2.90      92.2
#> 23 dewitt county    TX      388.           14.8         3.03     128. 
#> 24 guadalupe county TX      387.           14.5         3.04     126. 
#> 25 jim wells county TX      456            21           3.25     177. 
#> 26 brooks county    TX      487            23.5         3.39     196.

Listing / Output 9.19: Find outliers with standardized residuals computed manually

With Listing / Output 9.19 I have standardized residuals manually through (lm_residuals - base::mean(lm_residuals)) / stats::sd(lm_residuals). There are small rounding differences in some counties but the list of outlier counties is identical.

R Code 9.33 : Using dfbeta to find influential values

Code

df3a_lm <- df2_lm |> 
    dplyr::mutate(
        dfbeta_intercept = stats::dfbeta(model = lm9.1)[, 1],
        dfbeta_slope = stats::dfbeta(model = lm9.1)[, 2]
    ) 

save_data_file("chap09", df3a_lm, "df3a_lm.rds")

lm9.1_dfbeta <- df3a_lm |> 
    dplyr::filter(base::abs(dfbeta_intercept) > 2 | 
                      base::abs(dfbeta_slope) > 2) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, predicted, 
                    dfbeta_intercept, dfbeta_slope))

print(lm9.1_dfbeta, n = 30)

#> # A tibble: 7 × 7
#>   county     state dist_ssp no_insurance predicted dfbeta_intercept dfbeta_slope
#>   <chr>      <fct>    <dbl>        <dbl>     <dbl>            <dbl>        <dbl>
#> 1 webb coun… TX       436           30.2      249.            -3.04        0.282
#> 2 starr cou… TX       510           35.1      287.            -4.82        0.434
#> 3 hendry co… FL        84.3         29.8      245.             2.55       -0.236
#> 4 presidio … TX       171.          35.9      293.             2.75       -0.247
#> 5 brooks co… TX       487           23.5      196.            -2.70        0.270
#> 6 gaines co… TX        49.7         32.6      267.             4.10       -0.374
#> 7 duval cou… TX       461.          27.7      229.            -3.15        0.298

Listing / Output 9.20: Computing dfbeta to find influential values with dfbeta > 2

Computing dfbeta is a method to find influential values. df_beta removes each observation from the data frame, conducts the analysis again, and compares the intercept and slope for the model with and without the observation. Observations with high dfbeta values may be influencing the model. The book recommends a cutoff of 2.

Warning 9.1: Difference between dfbeta and dfbetas

Reading other resources (Zach 2020; Goldstein-Greenwood 2022) I noticed that they used another threshold: Instead of using 2 they recommended $2 / \sqrt{n}$, where n is the sample size. But this would result in a cutoff value of $2 / \sqrt{500}$ = 0.0894427 with the absurd high number of 343 counties as influential values.

But later I became aware that these resources are talking of dfbetas (plural) and not of dfbeta (singular). Between these two measure there is a big difference: dfbetas are the standardized values, whereas dfbeta values depend on the scale of the data.

Consequently, some analysts opt to standardize DFBETA values—in which case they’re referred to as DFBETAS — by dividing each DFBETA by the standard error estimate for $\hat{\beta_{j}}$ Detecting Influential Points in Regression with DFBETA(S) (Goldstein-Greenwood 2022)

The result is a list of seven counties, six situated in Texas and one in Florida. All of them are listed because of higher absolute dfbeta intercept values.

R Code 9.34 : Using dfbetas to find influential values

Code

df3b_lm <- df3a_lm |> 
    dplyr::mutate(
        dfbetas_intercept = stats::dfbetas(model = lm9.1)[, 1],
        dfbetas_slope = stats::dfbetas(model = lm9.1)[, 2]
    ) 

save_data_file("chap09", df3b_lm, "df3b_lm.rds")

lm9.1_dfbetas <- df3b_lm |> 
    dplyr::filter(base::abs(dfbetas_intercept) > (2 / base::sqrt(dplyr::n())) | 
                  base::abs(dfbetas_slope) > (2 / base::sqrt(dplyr::n()))) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, predicted, 
                    dfbetas_intercept, dfbetas_slope))

print(lm9.1_dfbetas, n = 50)

#> # A tibble: 40 × 7
#>    county  state dist_ssp no_insurance predicted dfbetas_intercept dfbetas_slope
#>    <chr>   <fct>    <dbl>        <dbl>     <dbl>             <dbl>         <dbl>
#>  1 gilles… TX      313            18.2     155.            -0.0617        0.100 
#>  2 webb c… TX      436            30.2     249.            -0.300         0.365 
#>  3 garfie… NE      300             8.8      81.3            0.116        -0.0782
#>  4 starr … TX      510            35.1     287.            -0.476         0.564 
#>  5 hendry… FL       84.3          29.8     245.             0.251        -0.307 
#>  6 scott … KS      210             4.1      44.5            0.164        -0.142 
#>  7 addiso… VT      166.            5.2      53.1            0.0992       -0.0830
#>  8 bennin… VT      149.            4.8      50.0            0.0911       -0.0772
#>  9 mcpher… KS      173.            6        59.4            0.0908       -0.0739
#> 10 presid… TX      171.           35.9     293.             0.271        -0.320 
#> 11 wyando… KS        7.25         21       177.             0.113        -0.158 
#> 12 concor… LA       71.3          26.4     219.             0.178        -0.224 
#> 13 antelo… NE      245.            8.6      79.7            0.0906       -0.0625
#> 14 coryel… TX      341.           10.2      92.2            0.0978       -0.0522
#> 15 hill c… TX      326.           19.7     167.            -0.0856        0.127 
#> 16 orange… VT      202.            6.1      60.2            0.113        -0.0912
#> 17 mitche… KS      198.            6.6      64.1            0.0992       -0.0786
#> 18 jack c… TX      296.           18.9     160.            -0.0620        0.0959
#> 19 suffol… MA      173.            4.4      46.9            0.121        -0.104 
#> 20 essex … MA      191.            3.5      39.8            0.158        -0.139 
#> 21 pawnee… KS      262             7.9      74.2            0.116        -0.0849
#> 22 carrol… IA      152.            3.3      38.3            0.121        -0.107 
#> 23 jim we… TX      456            21       177.            -0.187         0.263 
#> 24 kinney… TX      291            22.8     191.            -0.0845        0.113 
#> 25 llano … TX      318            17.8     152.            -0.0584        0.0986
#> 26 somerv… TX      329.           19.5     165.            -0.0848        0.127 
#> 27 kenneb… ME      316.            8.5      78.9            0.132        -0.0922
#> 28 dallam… TX      106.           24.9     207.             0.106        -0.136 
#> 29 gonzal… TX      386.           22.6     189.            -0.163         0.219 
#> 30 kiowa … KS      272.            8.9      82.1            0.0983       -0.0657
#> 31 lamb c… TX       54            21       177.             0.0816       -0.114 
#> 32 brooks… TX      487            23.5     196.            -0.268         0.353 
#> 33 gaines… TX       49.7          32.6     267.             0.405        -0.486 
#> 34 el pas… TX       33.5          23.8     199.             0.156        -0.204 
#> 35 maveri… TX      330            31.1     256.            -0.126         0.152 
#> 36 bosque… TX      339            20.7     174.            -0.105         0.149 
#> 37 falls … TX      343            18.9     160.            -0.0839        0.130 
#> 38 duval … TX      461.           27.7     229.            -0.312         0.387 
#> 39 caledo… VT      224.            5.3      53.9            0.149        -0.124 
#> 40 hamilt… TX      314.           17.4     149.            -0.0514        0.0908

Listing / Output 9.21: Computing dfbetas to find influential values with dfbetas > $ 2 / (500)$

Instead of seven we got now a bigger list with 40 counties.

R Code 9.35 : Using Cook’s distance to find influential values

Code

df4_lm <- df3b_lm |> 
    dplyr::mutate(
        cooks_d = stats::cooks.distance(model = lm9.1)
    ) 

save_data_file("chap09", df4_lm, "df4_lm.rds")

lm9.1_cooks_d <- df4_lm |> 
    dplyr::filter(cooks_d > (4 / dplyr::n())) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, predicted, 
                    cooks_d))

print(lm9.1_cooks_d, n = 50)

#> # A tibble: 32 × 6
#>    county           state dist_ssp no_insurance predicted cooks_d
#>    <chr>            <fct>    <dbl>        <dbl>     <dbl>   <dbl>
#>  1 gillespie county TX      313            18.2     155.  0.00845
#>  2 webb county      TX      436            30.2     249.  0.0713 
#>  3 garfield county  NE      300             8.8      81.3 0.00954
#>  4 starr county     TX      510            35.1     287.  0.165  
#>  5 hendry county    FL       84.3          29.8     245.  0.0505 
#>  6 scott county     KS      210             4.1      44.5 0.0137 
#>  7 presidio county  TX      171.           35.9     293.  0.0533 
#>  8 wyandotte county KS        7.25         21       177.  0.0164 
#>  9 concordia parish LA       71.3          26.4     219.  0.0281 
#> 10 coryell county   TX      341.           10.2      92.2 0.00977
#> 11 hill county      TX      326.           19.7     167.  0.0115 
#> 12 essex county     MA      191.            3.5      39.8 0.0127 
#> 13 pawnee county    KS      262             7.9      74.2 0.00838
#> 14 dewitt county    TX      388.           14.8     128.  0.0118 
#> 15 jim wells county TX      456            21       177.  0.0446 
#> 16 llano county     TX      318            17.8     152.  0.00862
#> 17 somervell county TX      329.           19.5     165.  0.0117 
#> 18 kennebec county  ME      316.            8.5      78.9 0.0118 
#> 19 dallam county    TX      106.           24.9     207.  0.0107 
#> 20 gonzales county  TX      386.           22.6     189.  0.0291 
#> 21 guadalupe county TX      387.           14.5     126.  0.0113 
#> 22 lamb county      TX       54            21       177.  0.00860
#> 23 brooks county    TX      487            23.5     196.  0.0727 
#> 24 comal county     TX      367.           13.9     121.  0.00918
#> 25 burnet county    TX      342            15.8     136.  0.00885
#> 26 gaines county    TX       49.7          32.6     267.  0.124  
#> 27 el paso county   TX       33.5          23.8     199.  0.0245 
#> 28 maverick county  TX      330            31.1     256.  0.0124 
#> 29 bosque county    TX      339            20.7     174.  0.0147 
#> 30 falls county     TX      343            18.9     160.  0.0129 
#> 31 duval county     TX      461.           27.7     229.  0.0816 
#> 32 caledonia county VT      224.            5.3      53.9 0.0116

Listing / Output 9.22: Using Cook’s distance to find influential values

Cook’s D is computed in a very similar way as dfbeta(s). That is, each observation is removed and the model is re-estimated without it. Cook’s D then combines the differences between the models with and without an observation for all the parameters together instead of one at a time like the dfbeta(s).

The cutoff for a high Cook’s D value is usually $4/n$.

Thirty-two counties had a high Cook’s D. Most were in Texas (TX), but a few were in Maine (ME), Massachusetts (MA), and Vermont (VT).

R Code 9.36 : Using leverage to find influential values

Code

df5_lm <- df4_lm |> 
    dplyr::mutate(
        leverage = stats::hatvalues(model = lm9.1)
    ) 

save_data_file("chap09", df5_lm, "df5_lm.rds")

lm9.1_leverage <- df5_lm |> 
    dplyr::filter(leverage > 2 * 2 / dplyr::n()) |> 
    dplyr::select(c(county, state, dist_ssp, 
                    no_insurance, predicted, 
                    leverage))

print(lm9.1_leverage, n = 50)

#> # A tibble: 28 × 6
#>    county                state dist_ssp no_insurance predicted leverage
#>    <chr>                 <fct>    <dbl>        <dbl>     <dbl>    <dbl>
#>  1 berkshire county      MA      104.            3        35.9  0.00883
#>  2 webb county           TX      436            30.2     249.   0.0283 
#>  3 starr county          TX      510            35.1     287.   0.0446 
#>  4 hendry county         FL       84.3          29.8     245.   0.0271 
#>  5 val verde county      TX      248.           22.8     191.   0.0111 
#>  6 presidio county       TX      171.           35.9     293.   0.0476 
#>  7 wyandotte county      KS        7.25         21       177.   0.00830
#>  8 caldwell parish       LA      116.           21.2     178.   0.00859
#>  9 concordia parish      LA       71.3          26.4     219.   0.0184 
#> 10 essex county          MA      191.            3.5      39.8  0.00811
#> 11 carroll county        IA      152.            3.3      38.3  0.00839
#> 12 jim wells county      TX      456            21       177.   0.00830
#> 13 kinney county         TX      291            22.8     191.   0.0111 
#> 14 nicollet county       MN       17.8           3.5      39.8  0.00811
#> 15 sabine county         TX      168.           22.1     185.   0.00997
#> 16 bremer county         IA       86.7           3.4      39.1  0.00825
#> 17 kodiak island borough AK      150            22.7     190.   0.0110 
#> 18 dallam county         TX      106.           24.9     207.   0.0151 
#> 19 gonzales county       TX      386.           22.6     189.   0.0108 
#> 20 lamb county           TX       54            21       177.   0.00830
#> 21 brooks county         TX      487            23.5     196.   0.0124 
#> 22 jefferson county      TX      177.           21.5     181.   0.00903
#> 23 gaines county         TX       49.7          32.6     267.   0.0358 
#> 24 dawson county         TX       77            20.8     175.   0.00802
#> 25 el paso county        TX       33.5          23.8     199.   0.0129 
#> 26 maverick county       TX      330            31.1     256.   0.0310 
#> 27 brantley county       GA      226.           24.4     203.   0.0141 
#> 28 duval county          TX      461.           27.7     229.   0.0215

Listing / Output 9.23: Using leverage to find influential values

Leverage is determined by the difference between the value of an observation for a predictor and the mean value of the predictor. The farther away an observed value is from the mean for a predictor, the more likely the observation will be influential. Leverage values range between 0 and 1. A threshold of $2p / n$ is often used with p being the number of parameters and n being the sample size.

The leverage values to find influential states are computed by using the stats::hatvalues() function, because the predicted value of $y$ is often depicted as $\hat{y}$.

This time we got a list of 28 counties.

9.9.8 Summarizing outliers and influential values

It would be useful to have all the counties identified by at least two of these four measures in a single list or table to more easily see all the counties that seemed problematic.

Example 9.10 : Summarizing outliers and influential values

R Code 9.37 : Summarizing diagnostics for outliers and influential values with dfbeta

Code

df_lm_diag1 <- df5_lm |> 
    dplyr::mutate(outliers = 
       as.numeric(x = leverage > 2 * 2 / dplyr::n()) +
       as.numeric(x = cooks_d > 4 / dplyr::n()) +
       as.numeric(x = abs(x = dfbeta_intercept) > 2) +
       as.numeric(x = abs(x = dfbeta_slope) > 2) +
       as.numeric(x = abs(x = standard_res) > 1.96)) |> 
    dplyr::filter(outliers >= 2) |> 
    dplyr::select(county, state, dist_ssp, 
                  no_insurance, predicted, outliers) |> 
    dplyr::mutate(county = stringr::word(county, 1))
    
df_print1 <- df_lm_diag1 |> 
    DT::datatable(
        class = "compact",
        options = list(
            paging = FALSE,
            searching = FALSE
            ),
        filter = "top",
        colnames = c("ID" = 1)
    ) |> 
    DT::formatRound(c("dist_ssp", "predicted"), 2) |> 
    DT::formatRound("no_insurance", 1)
    

df_print1

Listing / Output 9.24: Listing of all counties with influential values at least in two tests (using dfbeta like in the book)

Here I have use the threshold for the non-standardized dfbeta values, which in my opinion is the wrong option.

I have used the R datatable package {DT} (see: Section A.19), so you can experiment with the listed counties, for example sorting after certain columns or filtering certain states.

R Code 9.38 : Summarizing diagnostics for outliers and influential values with dfbetas

Code

df_lm_diag2 <- df5_lm |> 
    dplyr::mutate(outliers = 
       as.numeric(x = leverage > 2 * 2 / dplyr::n()) +
       as.numeric(x = cooks_d > 4 / dplyr::n()) +
       as.numeric(x = abs(x = dfbetas_intercept) > 2 / sqrt(dplyr::n())) +
       as.numeric(x = abs(x = dfbetas_slope) > 2 / sqrt(dplyr::n())) +
       as.numeric(x = abs(x = standard_res) > 1.96)) |> 
    dplyr::filter(outliers >= 2) |> 
    dplyr::select(county, state, dist_ssp, 
                  no_insurance, predicted, outliers)  |> 
    dplyr::mutate(county = stringr::word(county, 1))
    
df_print2 <- df_lm_diag2 |> 
    DT::datatable(
        class = "compact",
        options = list(
            paging = FALSE,
            searching = FALSE
            ),
        filter = "top",
        colnames = c("ID" = 1)
    ) |> 
    DT::formatRound(c("dist_ssp", "predicted"), 2) |> 
    DT::formatRound("no_insurance", 1)
    

df_print2

Listing / Output 9.25: Listing of all counties with influential values at least in two tests (using dfbetas, different from the book)

Here I have use the threshold for the standardized dfbetas values, which in my opinion is the correct option. Although there is a big difference in the number of counties with cutoff dfbeta (7 counties) to dfbetas(40 counties) in the end the difference is only 11 counties. The listing shows some counties exceeding the threshold 5 times. The reason is that counties can have outliers corresponding to dfbetas intercept and dfbetas slope.

I have used the R datatable package {DT} (see: Section A.19), so you can experiment with the listed counties, for example sorting after certain columns or filtering certain states. If you sort for instance the distances to the next syringe program you will see that there are some counties with similar values that are not included in Listing / Output 9.24, like kiowa county with similar values as garfied county.

R Code 9.39 : Difference between dfbeta and dfbetas results

Code

county_join <- dplyr::full_join(df_lm_diag1, df_lm_diag2, by = c("county", "state"))
county_diff <- county_join |> dplyr::filter(is.na(dist_ssp.x)) |> 
    dplyr::select(-(3:6)) |> 
    dplyr::arrange(desc(dist_ssp.y))

print(county_diff, n = 20)

#> # A tibble: 11 × 6
#>    county    state dist_ssp.y no_insurance.y predicted.y outliers.y
#>    <chr>     <fct>      <dbl>          <dbl>       <dbl>      <dbl>
#>  1 bosque    TX          339            20.7       174.           3
#>  2 somervell TX          329.           19.5       165.           2
#>  3 hill      TX          326.           19.7       167.           2
#>  4 llano     TX          318            17.8       152.           2
#>  5 gillespie TX          313            18.2       155.           2
#>  6 kinney    TX          291            22.8       191.           2
#>  7 kiowa     KS          272.            8.9        82.1          2
#>  8 scott     KS          210             4.1        44.5          3
#>  9 orange    VT          202.            6.1        60.2          2
#> 10 suffolk   MA          173.            4.4        46.9          2
#> 11 carroll   IA          152.            3.3        38.3          3

Listing / Output 9.26: Additional counties calculated with the standardized dfbetas (and not dfbeta)

These 11 counties have not been included in the book’s version because of a different method (using unstandardized dfbeta with an undocumented threshold of 2 versus standardized dfbetas with the in several resource documented cutoff $2 / \sqrt{n = samplesize}$.

When an observation is identified as an outlier or influential value, it is worth looking at the data to see if there is anything that seems unusual. Sometimes, the observations are just different from the rest of the data, and other times, there is a clear data entry or coding error.

Report 9.4: Interpretation of the linear regression model lm9.1 after diagnostics (final version)

A simple linear regression analysis found that the percentage of uninsured residents in a county is a statistically significant predictor of the distance to the nearest syringe program (b = 7.82; p < .001). For every 1% increase in uninsured residents, the distance to the nearest syringe program is expected to increase by 7.82 miles. The value of the slope is likely between 6.30 and 9.34 in the population that the sample came from (95% CI: 6.30–9.34), so with every 1% increase in uninsured residents, there is likely a 6.30- to 9.34-mile increase in distance to the nearest syringe program. The model was statistically significantly better than the baseline at explaining distance to syringe program (F[1, 498] = 102.2; p < .001) and explained 16.86% of the variance in the outcome ($R_{adj}^2$ = .1686). These results suggest that counties with lower insurance rates are farther from this type of resource, which may exacerbate existing health disparities.

An examination of the underlying assumptions found that the data fail several of the assumptions for linear regression, and so the model should be interpreted with caution; the results do not necessarily generalize to other counties beyond the sample. In addition, regression diagnostics found a number of counties that were outliers or influential observations. Many of these counties were in Texas, which may suggest that counties in Texas are unlike the rest of the sample.

9.10 Achievement 7: Adding variables and transformation

9.10.1 Adding a binary variable `metro` to the model

Example 9.11 : Adding binary variable metro to the model

Compute lm9.2
Plot lm9.2

R Code 9.40 : Linear regression: Distance to syringe program by uninsured percent and metro status

Code

lm9.2 <- stats::lm(formula = dist_ssp ~ no_insurance + metro,
                   data = distance_ssp_clean)

save_data_file("chap09", lm9.2, "lm9.2.rds")

base::summary(lm9.2)

#> 
#> Call:
#> stats::lm(formula = dist_ssp ~ no_insurance + metro, data = distance_ssp_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -219.80  -60.07  -18.76   48.33  283.96 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      3.4240    10.3621   0.330 0.741212    
#> no_insurance     7.3005     0.7775   9.389  < 2e-16 ***
#> metronon-metro  28.0525     7.7615   3.614 0.000332 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 84.89 on 497 degrees of freedom
#> Multiple R-squared:  0.1915, Adjusted R-squared:  0.1883 
#> F-statistic: 58.88 on 2 and 497 DF,  p-value: < 2.2e-16

Listing / Output 9.27: Linear regression: Distance to syringe program by uninsured percent and metro status

Small p-values indicate that percentage uninsured and metro status both statistically significantly help to explain the distance to a syringe program.
The model is statistically significant, with an F-statistic of F(2, 497) = 58.88 and a p-value of < .001.
$R_{adj}^2$ indicates that 18.83% of the variation in distance to syringe program is accounted for by this model with no_insurance and metro in it. This is somewhat higher than the $R_{adj}^2$ of 16.86 from the simple linear model with just no_insurance in it.
The coefficient for no_insurance of 7.30 means that for a county with 1% more uninsured people the distance to SSP grows by 7.30 miles.
metronon-metro is confusing because it combines two aspects: The variable name (here metro) and the category to which the coefficient refers (here non-metro). The other group name metro is the reference group for the metro variable². The non-metro counties are 28.05 miles farther away from the nearest syringe program than the metro counties.

R Code 9.41 : Graphing the regression model lm9.2 with percent uninsured and metro

Listing / Output 9.28: Regression model `lm9.2` with percent uninsured and metro

This is my reproduction of Figure 9.21. There are two differences noteworthy:

I used the colorblind friendly palette scale_color_okabe_ito() from the {ggokabeito} package (See: Barrett (2021)). Therefore my colors are different from Figure 9.21. scale_color_okabe_ito() has 9 colors (default = order = 1:9). With the order argument you determine which colors you want to use. In my case the first and third color. order = c(1, 3).
Listing / Output 9.28 is the first time that I have used the augment() function from the {broom} package (See: Section A.3). I have already used broom::glance() and broom::tidy() in this chapter but I have not understood why and when to use one of these three functions. Therefore I am exploring how to apply these {broom} function in Section 9.11.2.

9.10.2 Using the multiple regression model

Formula 9.7 : Applying multiple regression model lm9.2

I am going to use as an example Miami county (Indiana) with exactly 10% of their residents not insured. This conforms to the book’s example. For a metro county I am using Sonoma county (California) with a slight different uninsured population of 10.1%.

\[ \begin{align*} \text{Distance to SSP} = \text{Intercept} &+ \text{SlopeV1} \times \text{ValueV1} + \\ &+ \text{SlopeV2} \times \text{ValueV2} \\ \end{align*} \tag{9.7}\]

\[ \begin{align*} \text{For Miami (IN) to SSP} &= 3.42 + 7.3 * 10 + 28.05 * 1 = 104.48\\ \text{For Sonoma (CA) to SSP} &= 3.42 + 7.3 * 10.1 + 28.05 * 0 = 77.15 \\ \end{align*} \tag{9.8}\]

Let’s now check these two values with the real data:

R Code 9.42 : Compute the values for Miami and Sonoma county

Code

tbl9.2_check <-  broom::augment(
    lm9.2,
    interval = c("prediction")
    ) |> 
    dplyr::bind_cols(distance_ssp_clean[, 1:2]) |> 
    dplyr::relocate(county, state) |> 
    dplyr::select(1:9) |> 
    dplyr::filter(
        county == "miami county" & state == "IN" |
        county == "sonoma county" & state == "CA") |> 
    dplyr::arrange(no_insurance)

print.data.frame(tbl9.2_check, digits = 6)

#>          county state dist_ssp no_insurance     metro  .fitted   .lower  .upper
#> 1  miami county    IN    48.88         10.0 non-metro 104.4813 -62.6779 271.640
#> 2 sonoma county    CA    13.11         10.1     metro  77.1589 -90.0093 244.327
#>     .resid
#> 1 -55.6013
#> 2 -64.0489

Listing / Output 9.29: Compute lm9.2 model values for Miami county (IN) and Sonoma county (CA)

Our values calculated manually (104.48 and 77.15) match the values in .fitted column! We also see that the real values (48.88 and 13.11) are very far from our predicted values but inside the insanely huge 95% predicted intervals.

If we add the values in the column .resid to our predicted values in .fitted we will get the real distances to the nearest Syringe Services Program (SSP). For example: 104.4813 - 55.50131 = 48.97999.

9.10.3 Adding more variables

Before adding (more) variables one has to check the assumptions. For instance: Before we are going to add another continuous variable we should check for normality. If they are not normally distributed then it doesn’t pay the effort to regress with these variables without an appropriate transformation.

Checking the distribution of the continuous variables brings nothing new to my understanding. It is the already usual plotting of histograms and q-q plots. So I will not go into every details.

But what is interesting is the book’s conclusion of the no_insurance distribution (the percentage of uninsured residents): It claims that the normality assumption is met. But I will show that this is not the case and that a logit transformation improves the situation essentially.

Example 9.12 : Exploring no_insurance distribution

check
logit

R Code 9.43 : Checking normality assumption of no_insurance

Listing / Output 9.30: Checking normality assumption of residents without insurance. Left: Histogram with overlaid density curve. Right: Q-Q plot. Bottom: Shapiro-Wilk test.

All three checks indicate that the normality assumption with the percentage of uninsured residents is not met.

R Code 9.44 : Checking normality assumption of no_insurance after logit transformation

Listing / Output 9.31: Checking normality assumption of residents without insurance after logit transformation. Left: Histogram with overlaid density curve. Right: Q-Q plot. Bottom: Shapiro-Wilk test.

The Shapiro-Wilk test is still statistically significant p < .05, and therefore the Null (that the no_insurance is normally distributed) has to be rejected. But the situation has drastically improved as one can seen in both graphs. In contrast to the book I would the percentage of uninsured residents logit transform.

Note 9.2

In contrast to Listing / Output 8.1 where I computed the logit transformation manually I used here the logit() function from the {car} package (see Section A.4).

9.10.4 Identifying transformation procedures

All of the continuous variables are right skewed. To find an appropriate transformation algorithm the book proposes to check with several procedures and to compare the results: Square, cube, inverse and log transformation are used. (For percent or proportions literature recommend the logit or arcsine transformation. — I have also tried arcsine and square but logit was by far the best.)

It is easy to find out with visual inspection the best transformation by comparing their histograms. Even if the situations with all of them has improved no histogram or q-q plot was so normally shaped as my no_insurance variable. But even my logit() transformation was not good enough that the Shapiro-Wilk test could confirm that the variable is normally distributed!

Remark 9.1. : What should I do if the assumptions are not met?

To choose the most normally distributed of the options does not guarantee that the variable finally really meet the assumption. What does this mean and how has one to proceed?

The only option as far as I have learned in SwR is to interpret the results cautiously and to prevent inferences from the sample to the population. But then I do not understand the purpose of the whole inference procedure when — in the end — one has to abstain from inferences.

As I am not going to reproduce all the details of the section about finding the appropriate transformation, I will at least report two other noteworthy things I learned from the book:

It is not necessary to generate a new column in the data.frame of the transformed variable. You could add the transformation directly into the ggplot2::aes() or stats:lm() code.
A disadvantages of transformed variables is the model interpretation. We cannot take the values from the printed output to interpret the model as I have demonstrated it in Equation 9.8. We have to take into account the transformation and to reverse the transformation for the interpretation. For instance if we have log-transformed one variable then we have to inverse this transformation by the appropriate exponential function. Or if we have applied a cube transformation then we need to raise the printed output by the power of 3.

9.10.5 No multicollinearity assumption

For multiple regression. e.g., when there are multiple predictor variables in the model, there is another assumption to met: Multicollinearity. (I have added this assumption to the Bullet List 9.5.)

Multicollinearity occurs when predictors are linearly related, or strongly correlated, to each other. When two or more variables are linearly related to one another, the standard errors for their coefficients will be large and statistical significance will be impacted. This can be a problem for model estimation, so variables that are linearly related to each other should not be in a model together.

There are several ways to check for multicollinearity. Examining correlation or — with more complex models, where one variable is linearly related to a combination of other variables — the use of variance inflation factor (VIF) statistics.

The VIF statistics are calculated by running a separate regression model for each of the predictors where the predictor is the outcome and everything else stays in the model as a predictor. With this model, for example, the VIF for the no_insurance variable would be computed by running the model $no_insurance = log(x = hiv_prevalence) + metro$.

The $R^2$ from this linear regression model would be used to determine the VIF by substituting it into Equation 9.9:

Formula 9.8 : Computing the variance inflation factor (VIF)

\[ \begin{align*} VIF_{no\_insurance} = \frac{1}{1-R^2} \end{align*} \tag{9.9}\]

Assessment 9.1 : Assessment of VIFs

1: No variance
> 1: There is variance
2.5: Cutoff; too much collinearity. Corresponds to $R^2$ of 60% $\approx$ r of 77%, a little more than often used when correlation coefficients are calculated for checking multicollinearity.

Example 9.13 : Checking for multicollinearity

cor
VIF

R Code 9.45 : Using correlation to check multicollinearity

Code

distance_ssp_clean  |> 
  dplyr::mutate(log_hiv_prevalence = log(x = hiv_prevalence))  |> 
  tidyr::drop_na(log_hiv_prevalence)  |> 
  dplyr::summarize(cor_hiv_insurance = 
       stats::cor(x = log_hiv_prevalence, y = no_insurance)
       )

#> # A tibble: 1 × 1
#>   cor_hiv_insurance
#>               <dbl>
#> 1             0.244

Listing / Output 9.32: Checking for multicollinearity with correlation

.24 is a weak correlation between percent residents uninsured (no_insurance) and the transformed value of hiv_prevalence. Multicollinearity would be a problem when the result would have been .7 or higher.

R Code 9.46 : Checking for multicollinearity with variance inflation factor (VIF)

Code

lm9.3 <- 
    lm(formula = (dist_ssp)^(1/3) ~ no_insurance + log(x = hiv_prevalence) + metro, 
       data = distance_ssp_clean, 
       na.action = na.exclude
       )

car::vif(mod = lm9.3)

#>            no_insurance log(x = hiv_prevalence)                   metro 
#>                1.165165                1.207491                1.186400

Listing / Output 9.33: Checking for multicollinearity with variance inflation factor (VIF)

The VIF values are small, especially given that the lower limit of the VIF is 1. This confirmed no problem with multicollinearity with this model. The model meets the assumption of no perfect multicollinearity.

There is nothing new in checking the other assumption. It is the same procedure as used for simple linear regression model.

9.10.6 Partial-F test: Choosing a model

I skipped so far the summary of the third model with the hiv_prevalencevariable. To fully understand this section I need to add it:

R Code 9.47 : Summarize full model (lm9.3)

Code

base::summary(lm9.3)

#> 
#> Call:
#> lm(formula = (dist_ssp)^(1/3) ~ no_insurance + log(x = hiv_prevalence) + 
#>     metro, data = distance_ssp_clean, na.action = na.exclude)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.2624 -0.8959 -0.0745  0.8032  3.1967 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)              3.03570    0.38448   7.896 2.48e-14 ***
#> no_insurance             0.11269    0.01220   9.237  < 2e-16 ***
#> log(x = hiv_prevalence) -0.06529    0.07729  -0.845 0.398691    
#> metronon-metro           0.48808    0.12763   3.824 0.000151 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.214 on 426 degrees of freedom
#>   (70 observations deleted due to missingness)
#> Multiple R-squared:  0.2372, Adjusted R-squared:  0.2318 
#> F-statistic: 44.16 on 3 and 426 DF,  p-value: < 2.2e-16

Listing / Output 9.34: Summary of the full model (lm9.3)

Which model to choose? The HIV variable was not a significant predictor in the full model, and all three models failed one or more of the assumptions.

Before thinking about how a model performed, we have to ensure that

the model addresses the research question of interest and
the model includes variables that have been demonstrated important in the past to help explain the outcome. (For instance: Smoking has to be included in a lung cancer study.)

One tool for choosing between two linear regression models is a statistical test called the Partial-F test. The Partial-F test compares the fit of two nested models to determine if the additional variables in the larger model improve the model fit enough to warrant keeping the variables and interpreting the more complex model.

Formula 9.9 : Formula for the Partial-F test

\[ \begin{align*} F_{partial} = \frac{\frac{R_{full}^2 - R_{reduced}^2}{q}}{\frac{1-R_{full}^2}{n-p}} \end{align*} \tag{9.10}\]

$R_{full}^2$: $R^2$ for the larger model
$R_{reduced}^2$: $R^2$ for the smaller nestes model
n: sample size
q: difference in the number of parameters for the two models
p: number of parameters in the larger model

The $F_{partial}$ statistic has $q$ and $n – p$ degrees of freedom.

Example 9.14 : Numbered Example Title

manual
anova()

Fill into formula Equation 9.10 the values from the summaries of lm9.1 (Listing / Output 9.3) and lm9.2 (Listing / Output 9.27):

R Code 9.48 : Computing Partial-F statistic manually

Code

((.1915 - .1703) / 1) / ((1 - .1915) / (500 - 3))

#> [1] 13.03203

Listing / Output 9.35: Computing Partial-F statistic manually

R Code 9.49 : Computing Partial-F statistic with anova()

Code

stats::anova(object = lm9.1, lm9.2)

#> Analysis of Variance Table
#> 
#> Model 1: dist_ssp ~ no_insurance
#> Model 2: dist_ssp ~ no_insurance + metro
#>   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
#> 1    498 3675855                                  
#> 2    497 3581712  1     94143 13.063 0.0003318 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Listing / Output 9.36: Computing Partial-F statistic with anova()

The manual calculation of the Partial-F was with 13.03 a pretty good approximation to the stats::anova() computation with 13.06. to understand the p-value of .0003 we need a NHST procedure.

9.10.6.1 NHST Step 1

Write the null and alternate hypotheses:

Wording for H0 and HA

H0: The larger model is no better than the smaller model at explaining the outcome.
HA: The larger model is better than the smaller model at explaining the outcome.

9.10.6.2 NHST Step 2

Compute the test statistic: The Partial-F is 13.06 with 1 and 497 degrees of freedom.

9.10.6.3 NHST Step 3

Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).

The p-value is very small ($p = .0003$), so the probability is tiny that the test statistic would be this big or bigger if the null hypothesis were true.

9.10.6.4 NHST Step 4

Conclude and write report. The null hypothesis is rejected.

Report 9.5: Interpretation of the linear regression model lm9.3 (without diganostics and assumption testing)

A linear regression model including percent uninsured and metro status of a county to explain distance in miles to the nearest syringe program was statistically significantly better than a baseline model at explaining the outcome [F(2, 497) = 58.88]. The model explained 18.83% of the variation in distance to syringe programs . Percent uninsured was a statistically significant predictor of distance to the nearest syringe program (b = 7.30; t = 9.39; p < .001). In the sample, for every additional 1% uninsured people in a county, the distance to the nearest syringe program is 7.30 miles farther. The 95% confidence interval for the coefficient suggested that a 1% increase in uninsured in a county was associated with a 5.77- to 8.83-mile increase in distance to the nearest syringe program. Metro status was also a significant predictor of distance to a syringe program (b = 28.05; t = 7.76; p = .0003). Nonmetro counties are 28.05 miles farther than metro counties to the nearest syringe exchange in the sample and are 12.80 to 43.30 miles farther than metro counties from syringe programs in the population. Overall, the results suggest that more rural counties and counties that are poorer are farther from this health service, potentially exacerbating existing health disparities.

9.11 Experiments

9.11.1 Compute and visualize several variables combined

During working the material of SwR I consulted many times the internet. So I learned about the reference to plots where several graphs are combined. See: Linear Regression Assumptions and Diagnostics in R: Essentials (Kassambara 2018). It includes functions for plotting a collection of four diagnostic tests either in base R with stats::plot(<model name>) or using {ggfortify} with ggplot2::autoplot(). There exist also with the {autoplotly} package an interactive version of {ggfortify}.

After exploring {GGally} I noticed that there is with the ggnostic() function, working as a wrapper around GGally::ggduo(), also a tool that displays full model diagnostics for each given explanatory variable.

TODO: Looking at the packages {ggfortify} and {autoplotly}

In addition to learn {GGally} I have also to study the packages {ggfortify} — and together with {plotly} — the package {autoplotly}.

{ggfortify}: This package offers fortify() and autoplot() functions to allow automatic {ggplot2} to visualize statistical result of popular R packages. (See: Section A.26).
{autoplotly}: provides functionalities to automatically generate interactive visualizations for many popular statistical results supported by {ggfortify} package with plotly.js and {ggplot2} style. The generated visualizations can also be easily extended using {ggplot2} syntax while staying interactive. (See: Section A.64 and Section A.1).

TODO 9.3: Learning how to work with {ggfortify} and {autoplotly}

To give you examples of the power of these functions I did several experiments. But I have to confess that I do still not understand the different options.

Experiment 9.1 : Visualizing several variables together

R Code 9.50 : Scatterplots of variable pairs from dataset of chapter 9

Code

lt_purple <- t_col("purple3", perc = 50, name = "lt.purple")

panel.hist <- function(x, ...)
{
    usr <- par("usr")
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "grey80", ...)
}


graphics::pairs(distance_ssp_clean[3:6], 
                pch = 23, 
                panel = panel.smooth,
                cex = 1.5, 
                bg = lt_purple, 
                horOdd = TRUE,
                diag.panel = panel.hist, 
                cex.labels = 2, 
                font.labels = 2,
                gap = 0
                )

Graph 9.8: Scatterplots of variable pairs from dataset of chapter 9

R Code 9.51 : Distance to SSP with different numeric variable and metro/nonmetro location

Code

panel.hist <- function(x, ...) {
    usr <- par("usr")
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "grey80", ...)
}

graphics::pairs(distance_ssp_clean[3:6], 
                main = "Distance to SSP -- Metro (red) & Nonmetro (blue)",
                panel = panel.smooth,
                horOdd = TRUE,
                diag.panel = panel.hist, 
                pch = 21, 
                gap = 0,
                bg = c("red", "blue")[unclass(distance_ssp_clean$metro)])

Graph 9.9: Distance to SSP – Metro (red) & Nonmetro (blue)

R Code 9.52 : Bivariate exploratory data analysis for variables of chapter 9

Listing / Output 9.37: Bivariate exploratory data analysis for variables of chapter 9 (removed 70 NA’s from `hiv_prevalence`)

R Code 9.53 : Applying {ggfortify} to model and diagnose our model lm9.1

Listing / Output 9.38: Applying {**ggfortify**} to model and diagnose our model `lm9.1`

9.11.2 Using {broom}

{broom} converts untidy output into tidy tibbles. This is important whenever you want to combine results from multiple model (or tests). I have {broom} already used to compare results from the normality tests (Shapiro-Wilk and Anderson-Darling), and homogeneity tests (Levene and Fligner-Killeen. In the first case (normality) both tests result in htest objects. I could therefore bind them as different rows in a new tibble dataframe (see: Table 9.3). In the second case (homogeneity) I got with htest and anova two different objects and had to work around the differences manually before I could bind them into the same tibble (See: Table 9.4).

At that time I didn’t know that I should have used only broom::tidy() and not improvise with broom:::glance.htest(). The real advantage of glance can be seen when it is applied to a linear model.

Important 9.6: Three verbs in {broom} to make it convenient to interact with model objects

broom::tidy() summarizes information about model components
broom::glance() reports information about the entire model
broom::augment() adds information about observations to a dataset

Experiment 9.2 : How to use {broom}?

R Code 9.54 : Using broom::tidy() to compare different normality tests

Code

(test_shapiro = stats::shapiro.test(distance_ssp_clean$dist_ssp))
(test_ad = nortest::ad.test(distance_ssp_clean$dist_ssp))

test_compare <- 
    dplyr::bind_rows(
        broom::tidy(test_shapiro),
        broom::tidy(test_ad)
    )
test_compare

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  distance_ssp_clean$dist_ssp
#> W = 0.87419, p-value < 2.2e-16
#> 
#> 
#>  Anderson-Darling normality test
#> 
#> data:  distance_ssp_clean$dist_ssp
#> A = 17.995, p-value < 2.2e-16
#> 
#> # A tibble: 2 × 3
#>   statistic  p.value method                         
#>       <dbl>    <dbl> <chr>                          
#> 1     0.874 1.10e-19 Shapiro-Wilk normality test    
#> 2    18.0   3.7 e-24 Anderson-Darling normality test

Listing / Output 9.39: Compare Shapiro-Wilk with Anderson-Darling normality test

Even the details differ both tests result show a tiny p-value < .001 and support therefore the same decision to reject the Null.

It is of interest that the p-value for each test reported individually is the same and does not conform with the value in the htest object. It is with 2.22e-16 — as my internet research had turned out (How should tiny 𝑝 -values be reported? (and why does R put a minimum on 2.22e-16?)) “a value below which you can be quite confident the value will be pretty numerically meaningless - in that any smaller value isn’t likely to be an accurate calculation of the value we were attempting to compute”.

And even more important from the same author:

But statistical meaning will have been lost far earlier. Note that p-values depend on assumptions, and the further out into the extreme tail you go the more heavily the true p-value (rather than the nominal value we calculate) will be affected by the mistaken assumptions, in some cases even when they’re only a little bit wrong. (Glen_b 2013)

As a result of this discussion:

The broom::tidy() version reports the value inside the htestobject, showing that they are different and that Shapiro-Wilk is the more conservative test. But statistically it has no relevance because the report according to the recommendation of the APA style guide is: Do not use any value smaller than p < 0.001 (Jaap 2013).

R Code 9.55 : Using broom::glance() to report information about the entire model

Code

lm9.2
base::summary(lm9.2)
print(broom::glance(lm9.2), width = Inf)

#> 
#> Call:
#> stats::lm(formula = dist_ssp ~ no_insurance + metro, data = distance_ssp_clean)
#> 
#> Coefficients:
#>    (Intercept)    no_insurance  metronon-metro  
#>          3.424           7.300          28.052  
#> 
#> 
#> Call:
#> stats::lm(formula = dist_ssp ~ no_insurance + metro, data = distance_ssp_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -219.80  -60.07  -18.76   48.33  283.96 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      3.4240    10.3621   0.330 0.741212    
#> no_insurance     7.3005     0.7775   9.389  < 2e-16 ***
#> metronon-metro  28.0525     7.7615   3.614 0.000332 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 84.89 on 497 degrees of freedom
#> Multiple R-squared:  0.1915, Adjusted R-squared:  0.1883 
#> F-statistic: 58.88 on 2 and 497 DF,  p-value: < 2.2e-16
#> 
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0.192         0.188  84.9      58.9 1.13e-23     2 -2929. 5865. 5882.
#>   deviance df.residual  nobs
#>      <dbl>       <int> <int>
#> 1 3581712.         497   500

Listing / Output 9.40: Using broom::glance() to report additional information about the entire model

I have put together three types of model output:

The first call is just printing the model: lm9.2.
The second call summarizes more information in a convenient format: `base::summary(lm9.2)``
The third part of Listing / Output 9.40 finally shows the output generated with broom::glance(lm9.2). It adds some important information for evaluation of the model quality and for model comparison. Some of these parameters (logLik, AIC, BIC and deviance) We haven’t not learned so far in this book.

logLik: log-likelihood value

The log-likelihood value of a regression model is a way to measure the goodness of fit for a model. The higher the value of the log-likelihood, the better a model fits a dataset.

The log-likelihood value for a given model can range from negative infinity to positive infinity. The actual log-likelihood value for a given model is mostly meaningless, but it’s useful for comparing two or more models. (Zach 2021b)

AIC: Akaike information criterion

Akaike information criterion (AIC) is a metric that is used to compare the fit of different regression models.

There is no value for AIC that can be considered “good” or “bad” because we simply use AIC as a way to compare regression models. The model with the lowest AIC offers the best fit. The absolute value of the AIC value is not important. (Zach 2021a)

BIC: Bayesian Information Criterion

The Bayesian Information Criterion (BIC) is an index used in Bayesian statistics to choose between two or more alternative models. The BIC is also known as the Schwarz information criterion (abrv. SIC) or the Schwarz-Bayesian information criteria.

Comparing models with the Bayesian information criterion simply involves calculating the BIC for each model. The model with the lowest BIC is considered the best, and can be written BIC* (or SIC* if you use that name and abbreviation). (Glenn 2018)

Deviance

in statistics, a measure of the goodness of fit between a smaller hierarchical model and a fuller model that has all of the same parameters plus more. If the deviance reveals a significant difference, then the larger model is needed. If the deviance is not significant, then the smaller, more parsimonious model is retained as more appropriate. (https://dictionary.apa.org/deviance)

R Code 9.56 : Using broom::augment() to add information about observations to a dataset

Code

tbl9.2_aug <-  broom::augment(
    lm9.2,
    se_fit = TRUE,
    interval = c("prediction")
    ) |> 
    dplyr::bind_cols(distance_ssp_clean[, 1:2]) |> 
    dplyr::relocate(county, state)



save_data_file("chap09", tbl9.2_aug, "tbl9.2_aug.rds")

print(tbl9.2_aug, width = Inf)

#> # A tibble: 500 × 14
#>    county           state dist_ssp no_insurance metro     .fitted    .lower
#>    <chr>            <fct>    <dbl>        <dbl> <fct>       <dbl>     <dbl>
#>  1 wabasha county   MN       47.4           5.1 metro        40.7 -127.    
#>  2 johnson county   GA      125            18.5 non-metro   167.    -0.770 
#>  3 iron county      MO       77.2          15.9 non-metro   148.   -19.6   
#>  4 kosciusko county IN       42.1          12.4 non-metro   122.   -45.1   
#>  5 delaware county  OK       49.2          18.6 non-metro   167.    -0.0478
#>  6 scurry county    TX      141            14.8 non-metro   140.   -27.6   
#>  7 jefferson county MS       81.5          11.2 non-metro   113.   -53.9   
#>  8 lee county       IA       68.4           6.8 non-metro    81.1  -86.2   
#>  9 santa fe county  NM        9.17         15.7 metro       118.   -49.3   
#> 10 cass county      MO       31.2           8.4 metro        64.7 -102.    
#>    .upper .se.fit  .resid    .hat .sigma     .cooksd .std.resid
#>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>       <dbl>      <dbl>
#>  1   208.    7.36    6.78 0.00752   85.0 0.0000162       0.0802
#>  2   334.    6.67  -41.5  0.00617   85.0 0.000499       -0.491 
#>  3   315.    5.60  -70.4  0.00435   84.9 0.00100        -0.831 
#>  4   289.    5.15  -79.9  0.00368   84.9 0.00109        -0.942 
#>  5   335.    6.72 -118.   0.00627   84.8 0.00409        -1.39  
#>  6   307.    5.31    1.48 0.00392   85.0 0.000000398     0.0174
#>  7   280.    5.32  -31.7  0.00393   85.0 0.000184       -0.375 
#>  8   248.    7.05  -12.7  0.00689   85.0 0.0000520      -0.150 
#>  9   285.    6.65 -109.   0.00614   84.8 0.00341        -1.29  
#> 10   232.    6.04  -33.5  0.00507   85.0 0.000266       -0.396 
#> # ℹ 490 more rows

Listing / Output 9.41: Using broom::augment() to add information about observations to a dataset

With broom::augment() we got those diagnostic values that we have calculated individually for the lm9.1 model in previous sections of this chapter:

Residuals and fitted values (columns .resid and .fitted): See Listing / Output 9.4.
Standardized residuals (column .std.resid): See (Listing / Output 9.18).
Cook’s Distance (column .cooksd): See Listing / Output 9.22.
Leverage (column .hat): See Listing / Output 9.23
Sigma, here as column .sigma for the “estimated residual standard deviation when corresponding observation is dropped from model”. (Where’s the .sigma equivalent in this chapter?)

Additionally there is the option to include:

Standard errors of fitted values (column .se.fit): (Where’s the .se.fit equivalent in this chapter?)
Confidence intervals (columns .lower and .upper) or alternative predicted intervals: See Listing / Output 9.5 or Listing / Output 9.7. Here I have used interval = prediction.

9.11.3 Interactive plots

R Code 9.57 : Graphing the regression model lm9.2 as an interactive plot

Code

test_plot <- distance_ssp_clean |> 
  ggplot2::ggplot(
      ggplot2::aes(
          x = no_insurance, 
          y = dist_ssp, 
          group = metro)
      ) +
  ggplot2::geom_line(
      data = broom::augment(x = lm9.2),
      ggplot2::aes(
          y = .fitted, 
          linetype = metro
          )
      ) +
  ggplot2::geom_point(
      ggplot2::aes(
          text = county,
          color = metro
          ), 
      size = 2
      ) +
  ggplot2::labs(
      y = "Miles to nearest syringe program",
      x = "County percent uninsured"
      ) +
  ggokabeito::scale_color_okabe_ito(
      order = c(1,3),
      alpha = .4,
      name = "County"
      ) +
  ggplot2::scale_linetype_manual(
      values = c(1, 2), 
      name = "Regression line\n(predicted values)")

#> Warning in ggplot2::geom_point(ggplot2::aes(text = county, color = metro), :
#> Ignoring unknown aesthetics: text

Code

plotly::ggplotly(
    p = test_plot,
    tooltip = c("dist_ssp", "no_insurance", "text")
)

Listing / Output 9.42: Interactive regression model lm9.2 with percent uninsured and metro

This is an experiment with interactive graphic using the ggplotly() function from the {plotly) package. This is my first try. There is still something wrong with the legend.

TODO: Interactive graphics with {plotly}

The problem with the legend indicates my lacking knowledge about the {plotly} package. I have to go into more detail. Although it is easy to apply {plotly} by converting a {ggplot2} graph, there are many finer points to learn. It is therefore one of my most important intent to study Interactive Web-Based Data Visualization With R, Plotly, and Shiny (Sievert 2020).

BTW: The package {autoplotly} in the context of {ggfortify) as mentioned in TODO 9.3 is another is another tool supporting interactive plots.

TODO 9.4: Learning to develop interactive graphics with {plotly}

9.12 Exercises (empty)

9.13 Glossary

term	definition
Adjusted-R2	Adjusted R-squared is a measure of model fit for ordinary least squares linear regression that penalizes the R-squared, or percentage of variance explained, for the number of variables in the model (SwR, Glossary)
amfAR	amfAR, the Foundation for AIDS Research, known until 2005 as the American Foundation for AIDS Research, is an international nonprofit organization dedicated to the support of AIDS research, HIV prevention, treatment education, and the advocacy of AIDS-related public policy. (<a href="https://en.wikipedia.org/wiki/AmfAR">Wikipedia</a>)
Anderson-Darling	The Anderson-Darling Goodness of Fit Test (AD-Test) is a measure of how well your data fits a specified distribution. It’s commonly used as a test for normality. (<a href="https://www.statisticshowto.com/anderson-darling-test/">Statistics How-To</a>)
ANOVA	Analysis of variance is a statistical method used to compare means across groups to determine whether there is a statistically significant difference among the means; typically used when there are three or more means to compare. (SwR, Glossary)
Breusch-Pagan	Breusch-Pagan is a statistical test for determining whether variance is constant, which is used to test the assumption of homoscedasticity; Breusch-Pagan relies on the [chi-squared] distribution and is used during assumption checking for [homoscedasticity] in [linear regression]. (SwR, Glossary)
Cook’s D	Cook’s distance (often abbreviated Cook’s D) is used in Regression Analysis to find influential outliers in a set of predictor variables. IIt is a way to identify points that negatively affect the regression model. ([Statistics How To](https://www.statisticshowto.com/cooks-distance/))
Correlation	Correlation coefficients are a standardized measure of how two variables are related, or co-vary. They are used to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression. ([Statistics How To](https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/))
Degrees of Freedom	Degree of Freedom (df) is the number of pieces of information that are allowed to vary in computing a statistic before the remaining pieces of information are known; degrees of freedom are often used as parameters for distributions (e.g., chi-squared, F). (SwR, Glossary)
Density Plots	Density plots are used for examining the distribution of a variable measured along a continuum; density plots are similar to histograms but are smoothed and may not show existing gaps in data (SwR, Glossary)
Deterministic	A deterministic equation, or model, has one precise value for y for each value of x. (SwR, Chap09)
Diagnostics	Diagnostics in linear and logistic regression are a set of tests to identify outliers and influential values among the observations. (SwR, Glossary)
Durbin-Watson	Durbin-Watson test is a statistical test that is used to check the assumption of independent residuals in linear regression; a Durbin-Watson statistic of 2 indicates that the residuals are independent. (SwR, Glossary)
EDA	Explorative Data Analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. (<a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">Wikipedia</a>)
F-Statistic	F-statistic is a test statistic comparing explained and unexplained variance in [ANOVA] and linear regression. The F-statistic is a ratio where the variation between the groups is compared to the variation within the groups. (SwR, Glossary)
Fligner	The Fligner-Killeen test is a non-parametric test for homogeneity of group variances based on ranks. It is useful when the data is non-normal or when there are outliers. (<a href="https://real-statistics.com/one-way-analysis-of-variance-anova/homogeneity-variances/fligner-killeen-test/">Real Statistics Using Excel</a>)
Histograms	Histograms are visual displays of data used to examine the distribution of a numeric variable. (SwR, Glossary)
Homoscedasticity	Homoscedasticity is [homogeneity of variances], contrast is [Heteroscedasticity]. Homoscedasticity is an assumption of correlation and linear regression that requires that the variance of y be constant across all the values of x; visually, this assumption would show points along a fit line between x and y being evenly spread on either side of the line for the full range of the relationship. (SwR, Glossary)
Influential Observation	An influential observation is an observation that changes the slope of a regression line. (SwR, Glossary) Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly. (<a href= "https://openstax.org/books/introductory-statistics/pages/12-6-outliers">Introductory Statistics 12.6</a>)
Intercept	The intercept is the value of the dependent variables if all independent variables have the value zero. (<a href="https://link.springer.com/referenceworkentry/10.1007/978-94-007-0753-5_1486">Intercept, Slope in Regression</a>)
Levene	Levene’s test is a statistical test to determine whether observed data meet the homogeneity of variances assumption; Levene’s test is used to test this assumption for t-tests and analysis of variance. (SwR, Glossary)
Linearity	Linearity is the assumption of some statistical models that requires the outcome, or transformed outcome, to have a linear relationship with numeric predictors, where linear relationships are relationships that are evenly distributed around a line. (SwR, Glossary)
Mean Square	Mean square is the mean of the squared differences between two values; mean squares are used to compute the F-statistic in analysis of variance and linear regression. (Swr, Glossary)
Model-fit	Model fit means how well the model captures the relationship in the observed data. (SwR, Glossary)
OLS	Ordinary least square regression (OLS) is a method of estimating a linear regression model that finds the regression line by minimizing the squared differences between each data point and the regression line. (Swr; Glossary)
One-sample	One-sample t-test, also known as the single-parameter t-test or single-sample t-test, is an inferential statistical test comparing the mean of a numeric variable to a population or hypothesized mean. (SwR, Glossary)
Outliers	Outliers are observations with unusual values. (SwR, Glossary). Outliers are observed data points that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical distance from the line to the point. (<a href= "https://openstax.org/books/introductory-statistics/pages/12-6-outliers">Introductory Statistics 12.6</a>)
p-value	The p-value is the probability that the test statistic is at least as big as it is under the null hypothesis (SwR, Glossary)
Partial-F	Partial-F test is a statistical test to see if two nested linear regression models are statistically significantly different from each other; this test is usually used to determine if a larger model accounts for enough additional variance to justify the complexity in interpretation that comes with including more variables in a model. (SwR, Glossary)
Pearson	Pearson’s r is a statistic that indicates the strength and direction of the relationship between two numeric variables that meet certain assumptions. (SwR, Glossary)
Q-Q-Plot	A quantile-quantile plot is a visualization of data using probabilities to show how closely a variable follows a normal distribution. (SwR, Glossary) This plot is made up of points below which a certain percentage of the observations fall. On the x-axis are normally distributed values with a mean of 0 and a standard deviation of 1. On the y-axis are the observations from the data. If the data are normally distributed, the values will form a diagonal line through the graph. (SwR, chapter 6)
R-squared	R-squared is the percent of variance in a numeric variable that is explained by one or more other variables; the r-squared is also known as the coefficient of determination and is used as a measure of model fit in linear regression and an effect size in correlation analyses. (SwR, Glossary)
Residually	Residuals are the differences between the observed values and the predicted values. (SwR, Glossary)
Shapiro-Wilk	The Shapiro-Wilk test is a statistical test to determine or confirm whether a variable has a normal distribution; it is sensitive to small deviations from normality and not useful for sample sizes above 5,000 because it will nearly always find non-normality. (SwR, Glossary)
Simple-Linear-Regression	Simple does not mean easy; instead, it is the term used for a statistical model used to predict or explain a continuous outcome by a single predictor. (SwR, Glossary, Chap09)
Slope	The slope is the increase in the dependent variable when the independent variable increases with one unit and all other independent variables remain the same. (<a href="https://link.springer.com/referenceworkentry/10.1007/978-94-007-0753-5_1486">Intercept, Slope in Regression</a>)
Spearman	Spearman’s rho a statistical test used to examine the strength, direction, and significance of the relationship between two numeric variables when they do not meet the assumptions for [Pearson]’s r. (SwR, Glossary)
SSP	SSP stands for Syringe Services Program (SwR)
SwR	SwR is my abbreviation of: Harris, J. K. (2020). Statistics With R: Solving Problems Using Real-World Data (Illustrated Edition). SAGE Publications, Inc.
T-Statistic	The T-Statistic is used in a T test when you are deciding if you should support or reject the null hypothesis. It’s very similar to a Z-score and you use it in the same way: find a cut off point, find your t score, and compare the two. You use the t statistic when you have a small sample size, or if you don’t know the population standard deviation. (<a href="https://www.statisticshowto.com/t-statistic/">Statistics How-To</a>)
T-Test	A t-test is a type of statistical analysis used to compare the averages of two groups and determine whether the differences between them are more likely to arise from random chance. (<a href="https://en.wikipedia.org/wiki/Student%27s_t-test">Wikipedia</a>)
VIF	The variance inflation factor (VIF) is a statistic for determining whether there is problematic multicollinearity in a linear regression model. (SwR, Glossary)
Wald	Wald test is the statistical test for comparing the value of the coefficient in linear or logistic regression to the hypothesized value of zero; the form is similar to a one-sample t-test, although some Wald tests use a t-statistic and others use a z-statistic as the test statistic. (SwR, Glossary)
Z-score	A z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. (<a href="https://www.statisticshowto.com/probability-and-statistics/z-score/#Whatisazscore">StatisticsHowTo</a>)

Session Info

Code

sessioninfo::session_info()

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Sequoia 15.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2024-11-13
#>  pandoc   3.5 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  abind          1.4-8      2024-09-12 [2] CRAN (R 4.4.1)
#>  backports      1.5.0      2024-05-23 [2] CRAN (R 4.4.0)
#>  base64enc      0.1-3      2015-07-28 [2] CRAN (R 4.4.0)
#>  broom          1.0.7      2024-09-26 [2] CRAN (R 4.4.1)
#>  bslib          0.8.0      2024-07-29 [2] CRAN (R 4.4.0)
#>  cachem         1.1.0      2024-05-16 [2] CRAN (R 4.4.0)
#>  car            3.1-3      2024-09-27 [2] CRAN (R 4.4.1)
#>  carData        3.0-5      2022-01-06 [2] CRAN (R 4.4.0)
#>  class          7.3-22     2023-05-03 [2] CRAN (R 4.4.2)
#>  cli            3.6.3      2024-06-21 [2] CRAN (R 4.4.1)
#>  colorspace     2.1-1      2024-07-26 [2] CRAN (R 4.4.0)
#>  commonmark     1.9.2      2024-10-04 [2] CRAN (R 4.4.1)
#>  crosstalk      1.2.1      2023-11-23 [2] CRAN (R 4.4.0)
#>  curl           6.0.0      2024-11-05 [2] CRAN (R 4.4.1)
#>  data.table     1.16.2     2024-10-10 [2] CRAN (R 4.4.1)
#>  DBI            1.2.3      2024-06-02 [2] CRAN (R 4.4.0)
#>  digest         0.6.37     2024-08-19 [2] CRAN (R 4.4.1)
#>  dplyr          1.1.4      2023-11-17 [2] CRAN (R 4.4.0)
#>  DT             0.33       2024-04-04 [2] CRAN (R 4.4.0)
#>  e1071          1.7-16     2024-09-16 [2] CRAN (R 4.4.1)
#>  evaluate       1.0.1      2024-10-10 [2] CRAN (R 4.4.1)
#>  fansi          1.0.6      2023-12-08 [2] CRAN (R 4.4.0)
#>  farver         2.1.2      2024-05-13 [2] CRAN (R 4.4.0)
#>  fastmap        1.2.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  forcats        1.0.0      2023-01-29 [2] CRAN (R 4.4.0)
#>  Formula        1.2-5      2023-02-24 [2] CRAN (R 4.4.0)
#>  generics       0.1.3      2022-07-05 [2] CRAN (R 4.4.0)
#>  GGally         2.2.1      2024-02-14 [2] CRAN (R 4.4.0)
#>  ggokabeito     0.1.0      2021-10-18 [2] CRAN (R 4.4.0)
#>  ggplot2        3.5.1      2024-04-23 [2] CRAN (R 4.4.0)
#>  ggstats        0.7.0      2024-09-22 [2] CRAN (R 4.4.1)
#>  glossary     * 1.0.0.9003 2024-08-05 [2] Github (debruine/glossary@05e4a61)
#>  glue           1.8.0      2024-09-30 [2] CRAN (R 4.4.1)
#>  gridExtra      2.3        2017-09-09 [2] CRAN (R 4.4.0)
#>  gtable         0.3.6      2024-10-25 [2] CRAN (R 4.4.1)
#>  haven          2.5.4      2023-11-30 [2] CRAN (R 4.4.0)
#>  here           1.0.1      2020-12-13 [2] CRAN (R 4.4.0)
#>  hms            1.1.3      2023-03-21 [2] CRAN (R 4.4.0)
#>  htmltools      0.5.8.1    2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets    1.6.4      2023-12-06 [2] CRAN (R 4.4.0)
#>  httr           1.4.7      2023-08-15 [2] CRAN (R 4.4.0)
#>  jquerylib      0.1.4      2021-04-26 [2] CRAN (R 4.4.0)
#>  jsonlite       1.8.9      2024-09-20 [2] CRAN (R 4.4.1)
#>  kableExtra     1.4.0      2024-01-24 [2] CRAN (R 4.4.0)
#>  knitr          1.49       2024-11-08 [2] CRAN (R 4.4.1)
#>  labeling       0.4.3      2023-08-29 [2] CRAN (R 4.4.0)
#>  labelled       2.13.0     2024-04-23 [2] CRAN (R 4.4.0)
#>  lattice        0.22-6     2024-03-20 [2] CRAN (R 4.4.2)
#>  lazyeval       0.2.2      2019-03-15 [2] CRAN (R 4.4.0)
#>  lifecycle      1.0.4      2023-11-07 [2] CRAN (R 4.4.0)
#>  lmtest         0.9-40     2022-03-21 [2] CRAN (R 4.4.0)
#>  magrittr       2.0.3      2022-03-30 [2] CRAN (R 4.4.0)
#>  markdown       1.13       2024-06-04 [2] CRAN (R 4.4.0)
#>  Matrix         1.7-1      2024-10-18 [2] CRAN (R 4.4.2)
#>  mgcv           1.9-1      2023-12-21 [2] CRAN (R 4.4.2)
#>  mitools        2.4        2019-04-26 [2] CRAN (R 4.4.0)
#>  munsell        0.5.1      2024-04-01 [2] CRAN (R 4.4.0)
#>  nlme           3.1-166    2024-08-14 [2] CRAN (R 4.4.2)
#>  nortest        1.0-4      2015-07-30 [2] CRAN (R 4.4.0)
#>  patchwork      1.3.0      2024-09-16 [2] CRAN (R 4.4.1)
#>  pillar         1.9.0      2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgconfig      2.0.3      2019-09-22 [2] CRAN (R 4.4.0)
#>  plotly         4.10.4     2024-01-13 [2] CRAN (R 4.4.0)
#>  plyr           1.8.9      2023-10-02 [2] CRAN (R 4.4.0)
#>  proxy          0.4-27     2022-06-09 [2] CRAN (R 4.4.0)
#>  purrr          1.0.2      2023-08-10 [2] CRAN (R 4.4.0)
#>  R6             2.5.1      2021-08-19 [2] CRAN (R 4.4.0)
#>  RColorBrewer   1.1-3      2022-04-03 [2] CRAN (R 4.4.0)
#>  Rcpp           1.0.13-1   2024-11-02 [1] CRAN (R 4.4.1)
#>  repr           1.1.7      2024-03-22 [2] CRAN (R 4.4.0)
#>  rlang          1.1.4      2024-06-04 [2] CRAN (R 4.4.0)
#>  rmarkdown      2.29       2024-11-04 [2] CRAN (R 4.4.1)
#>  rprojroot      2.0.4      2023-11-05 [2] CRAN (R 4.4.0)
#>  rstudioapi     0.17.1     2024-10-22 [2] CRAN (R 4.4.1)
#>  rversions      2.1.2      2022-08-31 [2] CRAN (R 4.4.0)
#>  sass           0.4.9      2024-03-15 [2] CRAN (R 4.4.0)
#>  scales         1.3.0      2023-11-28 [2] CRAN (R 4.4.0)
#>  sessioninfo    1.2.2      2021-12-06 [2] CRAN (R 4.4.0)
#>  skimr          2.1.5      2022-12-23 [2] CRAN (R 4.4.0)
#>  stringi        1.8.4      2024-05-06 [2] CRAN (R 4.4.0)
#>  stringr        1.5.1      2023-11-14 [2] CRAN (R 4.4.0)
#>  survey         4.4-2      2024-03-20 [2] CRAN (R 4.4.0)
#>  survival       3.7-0      2024-06-05 [2] CRAN (R 4.4.2)
#>  svglite        2.1.3      2023-12-08 [2] CRAN (R 4.4.0)
#>  systemfonts    1.1.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  tableone       0.13.2     2022-04-15 [2] CRAN (R 4.4.0)
#>  tibble         3.2.1      2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyr          1.3.1      2024-01-24 [2] CRAN (R 4.4.0)
#>  tidyselect     1.2.1      2024-03-11 [2] CRAN (R 4.4.0)
#>  utf8           1.2.4      2023-10-22 [2] CRAN (R 4.4.0)
#>  vctrs          0.6.5      2023-12-01 [2] CRAN (R 4.4.0)
#>  viridisLite    0.4.2      2023-05-02 [2] CRAN (R 4.4.0)
#>  withr          3.0.2      2024-10-28 [2] CRAN (R 4.4.1)
#>  xfun           0.49       2024-10-31 [2] CRAN (R 4.4.1)
#>  xml2           1.3.6      2023-12-04 [2] CRAN (R 4.4.0)
#>  yaml           2.3.10     2024-07-26 [2] CRAN (R 4.4.0)
#>  zoo            1.8-12     2023-04-13 [2] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

9.1 Achievements to unlock

9.2 The needle exchange examination

9.3 Resources & Chapter Outline

9.3.1 Data, codebook, and R packages

9.3.2 Get, recode and show data

9.4 Achievement 1: Explorative Data Analysis

9.4.1 Introduction

9.4.2 Steps for EDA

9.4.3 Executing EDA for chapter 9

9.5 Achivement 2: Exploring line model

9.5.1 Introduction

9.5.2 Plotting an example

9.6 Achievement 3: Slope and Intercept

9.6.1 Introduction

9.6.2 Computing the slope

9.6.3 Computing the intercept

9.6.4 Estimating the linear regression model with R

9.6.5 Understanding residuals

9.7 Achievement 4: Slope interpretation and significance

9.7.1 Interpreting statistical significance of the slope

9.7.1.1 NHST Step 1

9.7.1.2 NHST Step 2

9.7.1.3 NHST Step 3

9.7.1.4 NHST Step 4

9.7.2 Computing confidence intervals

9.7.3 Making predcitions

9.8 Achievement 5: Model fit

9.8.1 Introduction

9.8.2 Understanding F-statistic

9.8.3 Understanding \(R^2\) measure of model fit

9.8.4 Reproting linear regression results

9.9 Achievement 6: Conducting diagnostics

9.9.1 Introduction

9.9.2 Linearity

9.9.3 Checking homoscedasticity

9.9.4 Independent residuals

9.9.5 Normality of residuals

9.9.6 Interpreting the assumption tests

9.9.7 Using models diagnostics

9.9.8 Summarizing outliers and influential values

9.10 Achievement 7: Adding variables and transformation

9.10.1 Adding a binary variable metro to the model

9.10.2 Using the multiple regression model

9.10.3 Adding more variables

9.10.4 Identifying transformation procedures

9.10.5 No multicollinearity assumption

9.10.6 Partial-F test: Choosing a model

9.10.6.1 NHST Step 1

9.10.6.2 NHST Step 2

9.10.6.3 NHST Step 3

9.10.6.4 NHST Step 4

9.11 Experiments

9.11.1 Compute and visualize several variables combined

9.11.2 Using {broom}

9.11.3 Interactive plots

9.12 Exercises (empty)

9.13 Glossary

Session Info

9.10.1 Adding a binary variable `metro` to the model